Vik Paruchuri commited on
Commit
13fe745
·
1 Parent(s): 88f20eb

Update benchmarks

Browse files
.gitignore CHANGED
@@ -4,7 +4,6 @@ local.env
4
  experiments
5
  test_data
6
  training
7
- benchmark_data
8
  wandb
9
 
10
  # Byte-compiled / optimized / DLL files
 
4
  experiments
5
  test_data
6
  training
 
7
  wandb
8
 
9
  # Byte-compiled / optimized / DLL files
README.md CHANGED
@@ -51,14 +51,14 @@ First, clone the repo:
51
  - Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `install/ghostscript_install.sh`.
52
  - Install other requirements with `cat install/apt-requirements.txt | xargs sudo apt-get install -y`
53
  - Set the tesseract data folder path
54
- - Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the right tesseract version if you have multiple!
55
  - Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
56
  - Install python requirements
57
  - `poetry install`
58
  - `poetry shell` to activate your poetry venv
59
  - Update pytorch as needed since poetry doesn't play nicely with it
60
  - GPU only: run `pip install torch` to install other torch dependencies.
61
- - CPU only: Uninstall torch, then follow the [CPU install](https://pytorch.org/) instructions.
62
 
63
  ## Mac
64
 
@@ -126,7 +126,7 @@ METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=35 bash chunk_convert.s
126
 
127
  # Benchmarks
128
 
129
- Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I converted the latex to text, and compared the reference to the output of text extraction methods.
130
 
131
  Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data).
132
 
@@ -142,21 +142,21 @@ nougat 0.614548 810.756
142
 
143
  **Accuracy**
144
 
145
- First 4 are non-arXiv books, last 3 are arXiv papers.
146
 
147
- Method thinkos.pdf thinkdsp.pdf thinkpython.pdf paip.pdf switch_trans.pdf crowd.pdf multicolcnn.pdf
148
- -------- ------------- -------------- ----------------- ---------- ------------------ ----------- -----------------
149
- naive 0.366817 0.412014 0.468147 0.735464 0.244739 0.14489 0.0890217
150
- marker 0.753291 0.787938 0.779262 0.679189 0.478387 0.446068 0.533737
151
- nougat 0.638434 0.632723 0.637626 0.462495 0.690028 0.540994 0.699539
152
 
153
- Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `2.7GB` for marker. Benchmarks were run on an A6000.
154
 
155
  ## Running your own benchmarks
156
 
157
- You can benchmark the performance of marker on your machine.
158
 
159
- Run `benchmark.py` like this:
160
 
161
  ```
162
  python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
@@ -168,7 +168,17 @@ Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running
168
 
169
  # Commercial usage
170
 
171
- Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage. I'm building a version that can be used commercially. If you would like to get early access, email me at marker@vikas.sh.
 
 
 
 
 
 
 
 
 
 
172
 
173
  # Thanks
174
 
@@ -177,4 +187,6 @@ This work would not have been possible without amazing open source models and da
177
  - Nougat from Meta
178
  - Layoutlmv3 from Microsoft
179
  - DocLayNet from IBM
180
- - ByT5 from Google
 
 
 
51
  - Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `install/ghostscript_install.sh`.
52
  - Install other requirements with `cat install/apt-requirements.txt | xargs sudo apt-get install -y`
53
  - Set the tesseract data folder path
54
+ - Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the latest tesseract version if you have multiple!
55
  - Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
56
  - Install python requirements
57
  - `poetry install`
58
  - `poetry shell` to activate your poetry venv
59
  - Update pytorch as needed since poetry doesn't play nicely with it
60
  - GPU only: run `pip install torch` to install other torch dependencies.
61
+ - CPU only: Uninstall torch, then follow the [CPU install](https://pytorch.org/get-started/locally/) instructions.
62
 
63
  ## Mac
64
 
 
126
 
127
  # Benchmarks
128
 
129
+ Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I converted the latex to text, and compare the reference to the output of text extraction methods.
130
 
131
  Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data).
132
 
 
142
 
143
  **Accuracy**
144
 
145
+ First 3 are non-arXiv books, last 3 are arXiv papers.
146
 
147
+ Method thinkos.pdf thinkdsp.pdf thinkpython.pdf switch_trans.pdf crowd.pdf multicolcnn.pdf
148
+ -------- ------------- -------------- ----------------- ------------------ ----------- -----------------
149
+ naive 0.366817 0.412014 0.468147 0.244739 0.14489 0.0890217
150
+ marker 0.753291 0.787938 0.779262 0.478387 0.446068 0.533737
151
+ nougat 0.638434 0.632723 0.637626 0.690028 0.540994 0.699539
152
 
153
+ Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `3.1GB` for marker. Benchmarks were run on an A6000.
154
 
155
  ## Running your own benchmarks
156
 
157
+ You can benchmark the performance of marker on your machine. First, download the benchmark data [here](https://drive.google.com/file/d/1WiN4K2-jQfwyQMe4wSSurbpz3hxo2fG9/view?usp=drive_link) and unzip.
158
 
159
+ Then run `benchmark.py` like this:
160
 
161
  ```
162
  python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
 
168
 
169
  # Commercial usage
170
 
171
+ Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.
172
+
173
+ I'm building a version that can be used commercially, by stripping out the dependencies below. If you would like to get early access, email me at marker@vikas.sh.
174
+
175
+ Here are the non-commercial/restrictive dependencies:
176
+
177
+ - LayoutLMv3: CC BY-NC-SA 4.0 . [Source](https://huggingface.co/microsoft/layoutlmv3-base)
178
+ - Nougat: CC-BY-NC . [Source](https://github.com/facebookresearch/nougat)
179
+ - PyMuPDF - GPL . [Source](https://pymupdf.readthedocs.io/en/latest/about.html#license-and-copyright)
180
+
181
+ Other dependencies/datasets are openly licensed (doclaynet, byt5), or used in a way that is compatible with commercial usage (ghostscript).
182
 
183
  # Thanks
184
 
 
187
  - Nougat from Meta
188
  - Layoutlmv3 from Microsoft
189
  - DocLayNet from IBM
190
+ - ByT5 from Google
191
+
192
+ Thank you to the authors of these models and datasets for making them available to the community.
benchmark.py CHANGED
@@ -40,7 +40,8 @@ if __name__ == "__main__":
40
  parser.add_argument("reference_folder", help="Reference folder with reference markdown files")
41
  parser.add_argument("out_file", help="Output filename")
42
  parser.add_argument("--nougat", action="store_true", help="Run nougat and compare", default=False)
43
- parser.add_argument("--nougat_batch_size", type=int, default=2, help="Batch size to use for nougat when making predictions.")
 
44
  parser.add_argument("--marker_parallel_factor", type=int, default=1, help="How much to multiply default parallel OCR workers and model batch sizes by.")
45
  parser.add_argument("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
46
  args = parser.parse_args()
 
40
  parser.add_argument("reference_folder", help="Reference folder with reference markdown files")
41
  parser.add_argument("out_file", help="Output filename")
42
  parser.add_argument("--nougat", action="store_true", help="Run nougat and compare", default=False)
43
+ # Nougat batch size 1 uses about as much VRAM as default marker settings
44
+ parser.add_argument("--nougat_batch_size", type=int, default=1, help="Batch size to use for nougat when making predictions.")
45
  parser.add_argument("--marker_parallel_factor", type=int, default=1, help="How much to multiply default parallel OCR workers and model batch sizes by.")
46
  parser.add_argument("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
47
  args = parser.parse_args()
benchmark_data/.gitignore ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ latex
2
+ pdfs
3
+ references
benchmark_data/latex_to_md.sh ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # List all .tex files in the latex folder
4
+ FILES=$(find latex -name "*.tex")
5
+
6
+ for f in $FILES
7
+ do
8
+ echo "Processing $f file..."
9
+ base_name=$(basename "$f" .tex)
10
+ out_file="references/${base_name}.md"
11
+
12
+ pandoc --wrap=none --no-highlight --strip-comments=true -s "$f" -t plain -o "$out_file"
13
+ # Replace non-breaking spaces
14
+ sed -i .bak 's/ / /g' "$out_file"
15
+ sed -i .bak 's/ / /g' "$out_file"
16
+ sed -i .bak 's/ / /g' "$out_file"
17
+ sed -i .bak 's/ / /g' "$out_file"
18
+ # Remove .bak file
19
+ rm "$out_file.bak"
20
+ done
21
+
marker/settings.py CHANGED
@@ -11,7 +11,7 @@ class Settings(BaseSettings):
11
  # General
12
  TORCH_DEVICE: str = "cpu"
13
  INFERENCE_RAM: int = 40 # How much VRAM each GPU has (in GB).
14
- VRAM_PER_TASK: float = 2.5 # How much VRAM to allocate per task (in GB)
15
  DEBUG: bool = False # Enable debug logging
16
  DEFAULT_LANG: str = "English" # Default language we assume files to be in, should be one of the keys in TESSERACT_LANGUAGES
17
 
@@ -57,7 +57,7 @@ class Settings(BaseSettings):
57
  "\par\par\par", "## Chapter", "Fig.", "particle", "[REPEATS]", "[TRUNCATED]", "### "]
58
  NOUGAT_DPI: int = 96 # DPI to render images at, matches default settings for nougat
59
  NOUGAT_MODEL_NAME: str = "0.1.0-small" # Name of the model to use
60
- NOUGAT_BATCH_SIZE: int = 4 if TORCH_DEVICE == "cuda" else 1 # Batch size for nougat, don't batch on cpu
61
 
62
  # Layout model
63
  BAD_SPAN_TYPES: List[str] = ["Caption", "Footnote", "Page-footer", "Page-header", "Picture"]
@@ -73,7 +73,7 @@ class Settings(BaseSettings):
73
 
74
  # Final editing model
75
  EDITOR_BATCH_SIZE: int = 4
76
- EDITOR_MAX_LENGTH: int = 1024
77
  EDITOR_MODEL_NAME: str = "vikp/pdf_postprocessor"
78
  ENABLE_EDITOR_MODEL: bool = False # The editor model can create false positives
79
 
 
11
  # General
12
  TORCH_DEVICE: str = "cpu"
13
  INFERENCE_RAM: int = 40 # How much VRAM each GPU has (in GB).
14
+ VRAM_PER_TASK: float = 2.5 # How much VRAM to allocate per task (in GB). Peak marker VRAM usage is around 3GB, but avg across workers is lower.
15
  DEBUG: bool = False # Enable debug logging
16
  DEFAULT_LANG: str = "English" # Default language we assume files to be in, should be one of the keys in TESSERACT_LANGUAGES
17
 
 
57
  "\par\par\par", "## Chapter", "Fig.", "particle", "[REPEATS]", "[TRUNCATED]", "### "]
58
  NOUGAT_DPI: int = 96 # DPI to render images at, matches default settings for nougat
59
  NOUGAT_MODEL_NAME: str = "0.1.0-small" # Name of the model to use
60
+ NOUGAT_BATCH_SIZE: int = 6 if TORCH_DEVICE == "cuda" else 1 # Batch size for nougat, don't batch on cpu
61
 
62
  # Layout model
63
  BAD_SPAN_TYPES: List[str] = ["Caption", "Footnote", "Page-footer", "Page-header", "Picture"]
 
73
 
74
  # Final editing model
75
  EDITOR_BATCH_SIZE: int = 4
76
+ EDITOR_MAX_LENGTH: int = 2048
77
  EDITOR_MODEL_NAME: str = "vikp/pdf_postprocessor"
78
  ENABLE_EDITOR_MODEL: bool = False # The editor model can create false positives
79