Vik Paruchuri
commited on
Commit
·
13fe745
1
Parent(s):
88f20eb
Update benchmarks
Browse files- .gitignore +0 -1
- README.md +26 -14
- benchmark.py +2 -1
- benchmark_data/.gitignore +3 -0
- benchmark_data/latex_to_md.sh +21 -0
- marker/settings.py +3 -3
.gitignore
CHANGED
|
@@ -4,7 +4,6 @@ local.env
|
|
| 4 |
experiments
|
| 5 |
test_data
|
| 6 |
training
|
| 7 |
-
benchmark_data
|
| 8 |
wandb
|
| 9 |
|
| 10 |
# Byte-compiled / optimized / DLL files
|
|
|
|
| 4 |
experiments
|
| 5 |
test_data
|
| 6 |
training
|
|
|
|
| 7 |
wandb
|
| 8 |
|
| 9 |
# Byte-compiled / optimized / DLL files
|
README.md
CHANGED
|
@@ -51,14 +51,14 @@ First, clone the repo:
|
|
| 51 |
- Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `install/ghostscript_install.sh`.
|
| 52 |
- Install other requirements with `cat install/apt-requirements.txt | xargs sudo apt-get install -y`
|
| 53 |
- Set the tesseract data folder path
|
| 54 |
-
- Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the
|
| 55 |
- Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
|
| 56 |
- Install python requirements
|
| 57 |
- `poetry install`
|
| 58 |
- `poetry shell` to activate your poetry venv
|
| 59 |
- Update pytorch as needed since poetry doesn't play nicely with it
|
| 60 |
- GPU only: run `pip install torch` to install other torch dependencies.
|
| 61 |
-
- CPU only: Uninstall torch, then follow the [CPU install](https://pytorch.org/) instructions.
|
| 62 |
|
| 63 |
## Mac
|
| 64 |
|
|
@@ -126,7 +126,7 @@ METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=35 bash chunk_convert.s
|
|
| 126 |
|
| 127 |
# Benchmarks
|
| 128 |
|
| 129 |
-
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I converted the latex to text, and
|
| 130 |
|
| 131 |
Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data).
|
| 132 |
|
|
@@ -142,21 +142,21 @@ nougat 0.614548 810.756
|
|
| 142 |
|
| 143 |
**Accuracy**
|
| 144 |
|
| 145 |
-
First
|
| 146 |
|
| 147 |
-
Method thinkos.pdf thinkdsp.pdf thinkpython.pdf
|
| 148 |
-
-------- ------------- -------------- -----------------
|
| 149 |
-
naive 0.366817 0.412014 0.468147
|
| 150 |
-
marker 0.753291 0.787938 0.779262
|
| 151 |
-
nougat 0.638434 0.632723 0.637626
|
| 152 |
|
| 153 |
-
Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `
|
| 154 |
|
| 155 |
## Running your own benchmarks
|
| 156 |
|
| 157 |
-
You can benchmark the performance of marker on your machine.
|
| 158 |
|
| 159 |
-
|
| 160 |
|
| 161 |
```
|
| 162 |
python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
|
|
@@ -168,7 +168,17 @@ Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running
|
|
| 168 |
|
| 169 |
# Commercial usage
|
| 170 |
|
| 171 |
-
Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
|
| 173 |
# Thanks
|
| 174 |
|
|
@@ -177,4 +187,6 @@ This work would not have been possible without amazing open source models and da
|
|
| 177 |
- Nougat from Meta
|
| 178 |
- Layoutlmv3 from Microsoft
|
| 179 |
- DocLayNet from IBM
|
| 180 |
-
- ByT5 from Google
|
|
|
|
|
|
|
|
|
| 51 |
- Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `install/ghostscript_install.sh`.
|
| 52 |
- Install other requirements with `cat install/apt-requirements.txt | xargs sudo apt-get install -y`
|
| 53 |
- Set the tesseract data folder path
|
| 54 |
+
- Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the latest tesseract version if you have multiple!
|
| 55 |
- Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
|
| 56 |
- Install python requirements
|
| 57 |
- `poetry install`
|
| 58 |
- `poetry shell` to activate your poetry venv
|
| 59 |
- Update pytorch as needed since poetry doesn't play nicely with it
|
| 60 |
- GPU only: run `pip install torch` to install other torch dependencies.
|
| 61 |
+
- CPU only: Uninstall torch, then follow the [CPU install](https://pytorch.org/get-started/locally/) instructions.
|
| 62 |
|
| 63 |
## Mac
|
| 64 |
|
|
|
|
| 126 |
|
| 127 |
# Benchmarks
|
| 128 |
|
| 129 |
+
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I converted the latex to text, and compare the reference to the output of text extraction methods.
|
| 130 |
|
| 131 |
Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data).
|
| 132 |
|
|
|
|
| 142 |
|
| 143 |
**Accuracy**
|
| 144 |
|
| 145 |
+
First 3 are non-arXiv books, last 3 are arXiv papers.
|
| 146 |
|
| 147 |
+
Method thinkos.pdf thinkdsp.pdf thinkpython.pdf switch_trans.pdf crowd.pdf multicolcnn.pdf
|
| 148 |
+
-------- ------------- -------------- ----------------- ------------------ ----------- -----------------
|
| 149 |
+
naive 0.366817 0.412014 0.468147 0.244739 0.14489 0.0890217
|
| 150 |
+
marker 0.753291 0.787938 0.779262 0.478387 0.446068 0.533737
|
| 151 |
+
nougat 0.638434 0.632723 0.637626 0.690028 0.540994 0.699539
|
| 152 |
|
| 153 |
+
Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `3.1GB` for marker. Benchmarks were run on an A6000.
|
| 154 |
|
| 155 |
## Running your own benchmarks
|
| 156 |
|
| 157 |
+
You can benchmark the performance of marker on your machine. First, download the benchmark data [here](https://drive.google.com/file/d/1WiN4K2-jQfwyQMe4wSSurbpz3hxo2fG9/view?usp=drive_link) and unzip.
|
| 158 |
|
| 159 |
+
Then run `benchmark.py` like this:
|
| 160 |
|
| 161 |
```
|
| 162 |
python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
|
|
|
|
| 168 |
|
| 169 |
# Commercial usage
|
| 170 |
|
| 171 |
+
Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.
|
| 172 |
+
|
| 173 |
+
I'm building a version that can be used commercially, by stripping out the dependencies below. If you would like to get early access, email me at marker@vikas.sh.
|
| 174 |
+
|
| 175 |
+
Here are the non-commercial/restrictive dependencies:
|
| 176 |
+
|
| 177 |
+
- LayoutLMv3: CC BY-NC-SA 4.0 . [Source](https://huggingface.co/microsoft/layoutlmv3-base)
|
| 178 |
+
- Nougat: CC-BY-NC . [Source](https://github.com/facebookresearch/nougat)
|
| 179 |
+
- PyMuPDF - GPL . [Source](https://pymupdf.readthedocs.io/en/latest/about.html#license-and-copyright)
|
| 180 |
+
|
| 181 |
+
Other dependencies/datasets are openly licensed (doclaynet, byt5), or used in a way that is compatible with commercial usage (ghostscript).
|
| 182 |
|
| 183 |
# Thanks
|
| 184 |
|
|
|
|
| 187 |
- Nougat from Meta
|
| 188 |
- Layoutlmv3 from Microsoft
|
| 189 |
- DocLayNet from IBM
|
| 190 |
+
- ByT5 from Google
|
| 191 |
+
|
| 192 |
+
Thank you to the authors of these models and datasets for making them available to the community.
|
benchmark.py
CHANGED
|
@@ -40,7 +40,8 @@ if __name__ == "__main__":
|
|
| 40 |
parser.add_argument("reference_folder", help="Reference folder with reference markdown files")
|
| 41 |
parser.add_argument("out_file", help="Output filename")
|
| 42 |
parser.add_argument("--nougat", action="store_true", help="Run nougat and compare", default=False)
|
| 43 |
-
|
|
|
|
| 44 |
parser.add_argument("--marker_parallel_factor", type=int, default=1, help="How much to multiply default parallel OCR workers and model batch sizes by.")
|
| 45 |
parser.add_argument("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
|
| 46 |
args = parser.parse_args()
|
|
|
|
| 40 |
parser.add_argument("reference_folder", help="Reference folder with reference markdown files")
|
| 41 |
parser.add_argument("out_file", help="Output filename")
|
| 42 |
parser.add_argument("--nougat", action="store_true", help="Run nougat and compare", default=False)
|
| 43 |
+
# Nougat batch size 1 uses about as much VRAM as default marker settings
|
| 44 |
+
parser.add_argument("--nougat_batch_size", type=int, default=1, help="Batch size to use for nougat when making predictions.")
|
| 45 |
parser.add_argument("--marker_parallel_factor", type=int, default=1, help="How much to multiply default parallel OCR workers and model batch sizes by.")
|
| 46 |
parser.add_argument("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
|
| 47 |
args = parser.parse_args()
|
benchmark_data/.gitignore
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
latex
|
| 2 |
+
pdfs
|
| 3 |
+
references
|
benchmark_data/latex_to_md.sh
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/bin/bash
|
| 2 |
+
|
| 3 |
+
# List all .tex files in the latex folder
|
| 4 |
+
FILES=$(find latex -name "*.tex")
|
| 5 |
+
|
| 6 |
+
for f in $FILES
|
| 7 |
+
do
|
| 8 |
+
echo "Processing $f file..."
|
| 9 |
+
base_name=$(basename "$f" .tex)
|
| 10 |
+
out_file="references/${base_name}.md"
|
| 11 |
+
|
| 12 |
+
pandoc --wrap=none --no-highlight --strip-comments=true -s "$f" -t plain -o "$out_file"
|
| 13 |
+
# Replace non-breaking spaces
|
| 14 |
+
sed -i .bak 's/ / /g' "$out_file"
|
| 15 |
+
sed -i .bak 's/ / /g' "$out_file"
|
| 16 |
+
sed -i .bak 's/ / /g' "$out_file"
|
| 17 |
+
sed -i .bak 's/ / /g' "$out_file"
|
| 18 |
+
# Remove .bak file
|
| 19 |
+
rm "$out_file.bak"
|
| 20 |
+
done
|
| 21 |
+
|
marker/settings.py
CHANGED
|
@@ -11,7 +11,7 @@ class Settings(BaseSettings):
|
|
| 11 |
# General
|
| 12 |
TORCH_DEVICE: str = "cpu"
|
| 13 |
INFERENCE_RAM: int = 40 # How much VRAM each GPU has (in GB).
|
| 14 |
-
VRAM_PER_TASK: float = 2.5 # How much VRAM to allocate per task (in GB)
|
| 15 |
DEBUG: bool = False # Enable debug logging
|
| 16 |
DEFAULT_LANG: str = "English" # Default language we assume files to be in, should be one of the keys in TESSERACT_LANGUAGES
|
| 17 |
|
|
@@ -57,7 +57,7 @@ class Settings(BaseSettings):
|
|
| 57 |
"\par\par\par", "## Chapter", "Fig.", "particle", "[REPEATS]", "[TRUNCATED]", "### "]
|
| 58 |
NOUGAT_DPI: int = 96 # DPI to render images at, matches default settings for nougat
|
| 59 |
NOUGAT_MODEL_NAME: str = "0.1.0-small" # Name of the model to use
|
| 60 |
-
NOUGAT_BATCH_SIZE: int =
|
| 61 |
|
| 62 |
# Layout model
|
| 63 |
BAD_SPAN_TYPES: List[str] = ["Caption", "Footnote", "Page-footer", "Page-header", "Picture"]
|
|
@@ -73,7 +73,7 @@ class Settings(BaseSettings):
|
|
| 73 |
|
| 74 |
# Final editing model
|
| 75 |
EDITOR_BATCH_SIZE: int = 4
|
| 76 |
-
EDITOR_MAX_LENGTH: int =
|
| 77 |
EDITOR_MODEL_NAME: str = "vikp/pdf_postprocessor"
|
| 78 |
ENABLE_EDITOR_MODEL: bool = False # The editor model can create false positives
|
| 79 |
|
|
|
|
| 11 |
# General
|
| 12 |
TORCH_DEVICE: str = "cpu"
|
| 13 |
INFERENCE_RAM: int = 40 # How much VRAM each GPU has (in GB).
|
| 14 |
+
VRAM_PER_TASK: float = 2.5 # How much VRAM to allocate per task (in GB). Peak marker VRAM usage is around 3GB, but avg across workers is lower.
|
| 15 |
DEBUG: bool = False # Enable debug logging
|
| 16 |
DEFAULT_LANG: str = "English" # Default language we assume files to be in, should be one of the keys in TESSERACT_LANGUAGES
|
| 17 |
|
|
|
|
| 57 |
"\par\par\par", "## Chapter", "Fig.", "particle", "[REPEATS]", "[TRUNCATED]", "### "]
|
| 58 |
NOUGAT_DPI: int = 96 # DPI to render images at, matches default settings for nougat
|
| 59 |
NOUGAT_MODEL_NAME: str = "0.1.0-small" # Name of the model to use
|
| 60 |
+
NOUGAT_BATCH_SIZE: int = 6 if TORCH_DEVICE == "cuda" else 1 # Batch size for nougat, don't batch on cpu
|
| 61 |
|
| 62 |
# Layout model
|
| 63 |
BAD_SPAN_TYPES: List[str] = ["Caption", "Footnote", "Page-footer", "Page-header", "Picture"]
|
|
|
|
| 73 |
|
| 74 |
# Final editing model
|
| 75 |
EDITOR_BATCH_SIZE: int = 4
|
| 76 |
+
EDITOR_MAX_LENGTH: int = 2048
|
| 77 |
EDITOR_MODEL_NAME: str = "vikp/pdf_postprocessor"
|
| 78 |
ENABLE_EDITOR_MODEL: bool = False # The editor model can create false positives
|
| 79 |
|