Spaces:

rt4u
/

marker

Sleeping

App Files Files Community

Vik Paruchuri commited on Nov 29, 2023

Commit

13fe745

1 Parent(s): 88f20eb

Update benchmarks

Browse files

Files changed (6) hide show

.gitignore +0 -1
README.md +26 -14
benchmark.py +2 -1
benchmark_data/.gitignore +3 -0
benchmark_data/latex_to_md.sh +21 -0
marker/settings.py +3 -3

.gitignore CHANGED Viewed

@@ -4,7 +4,6 @@ local.env
 experiments
 test_data
 training
-benchmark_data
 wandb
 # Byte-compiled / optimized / DLL files

 experiments
 test_data
 training
 wandb
 # Byte-compiled / optimized / DLL files

README.md CHANGED Viewed

@@ -51,14 +51,14 @@ First, clone the repo:
   - Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `install/ghostscript_install.sh`.
   - Install other requirements with `cat install/apt-requirements.txt | xargs sudo apt-get install -y`
 - Set the tesseract data folder path
-  - Find the tesseract data folder `tessdata` with `find / -name tessdata`.  Make sure to use the one corresponding to the right tesseract version if you have multiple!
   - Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
 - Install python requirements
   - `poetry install`
   - `poetry shell` to activate your poetry venv
 - Update pytorch as needed since poetry doesn't play nicely with it
   - GPU only: run `pip install torch` to install other torch dependencies.
-  - CPU only: Uninstall torch, then follow the [CPU install](https://pytorch.org/) instructions.
 ## Mac
@@ -126,7 +126,7 @@ METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=35 bash chunk_convert.s
 # Benchmarks
-Benchmarking PDF extraction quality is hard.  I've created a test set by finding books and scientific papers that have a pdf version and a latex source.  I converted the latex to text, and compared the reference to the output of text extraction methods.
 Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data).
@@ -142,21 +142,21 @@ nougat           0.614548      810.756
 **Accuracy**
-First 4 are non-arXiv books, last 3 are arXiv papers.
-Method      thinkos.pdf    thinkdsp.pdf    thinkpython.pdf    paip.pdf    switch_trans.pdf    crowd.pdf    multicolcnn.pdf
---------  -------------  --------------  -----------------  ----------  ------------------  -----------  -----------------
-naive          0.366817        0.412014           0.468147    0.735464             0.244739     0.14489           0.0890217
-marker         0.753291        0.787938           0.779262    0.679189             0.478387     0.446068          0.533737
-nougat         0.638434        0.632723           0.637626    0.462495             0.690028     0.540994          0.699539
-Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `2.7GB` for marker.  Benchmarks were run on an A6000.
 ## Running your own benchmarks
-You can benchmark the performance of marker on your machine.
-Run `benchmark.py` like this:
 ```
 python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
@@ -168,7 +168,17 @@ Omit `--nougat` to exclude nougat from the benchmark.  I don't recommend running
 # Commercial usage
-Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.  I'm building a version that can be used commercially. If you would like to get early access, email me at marker@vikas.sh.
 # Thanks
@@ -177,4 +187,6 @@ This work would not have been possible without amazing open source models and da
 - Nougat from Meta
 - Layoutlmv3 from Microsoft
 - DocLayNet from IBM
-- ByT5 from Google

   - Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `install/ghostscript_install.sh`.
   - Install other requirements with `cat install/apt-requirements.txt | xargs sudo apt-get install -y`
 - Set the tesseract data folder path
+  - Find the tesseract data folder `tessdata` with `find / -name tessdata`.  Make sure to use the one corresponding to the latest tesseract version if you have multiple!
   - Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
 - Install python requirements
   - `poetry install`
   - `poetry shell` to activate your poetry venv
 - Update pytorch as needed since poetry doesn't play nicely with it
   - GPU only: run `pip install torch` to install other torch dependencies.
+  - CPU only: Uninstall torch, then follow the [CPU install](https://pytorch.org/get-started/locally/) instructions.
 ## Mac
 # Benchmarks
+Benchmarking PDF extraction quality is hard.  I've created a test set by finding books and scientific papers that have a pdf version and a latex source.  I converted the latex to text, and compare the reference to the output of text extraction methods.
 Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data).
 **Accuracy**
+First 3 are non-arXiv books, last 3 are arXiv papers.
+Method      thinkos.pdf    thinkdsp.pdf    thinkpython.pdf   switch_trans.pdf    crowd.pdf    multicolcnn.pdf
+--------  -------------  --------------  -----------------  ------------------  -----------  -----------------
+naive          0.366817        0.412014           0.468147            0.244739     0.14489          0.0890217
+marker         0.753291        0.787938           0.779262            0.478387     0.446068          0.533737
+nougat         0.638434        0.632723           0.637626            0.690028     0.540994          0.699539
+Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `3.1GB` for marker.  Benchmarks were run on an A6000.
 ## Running your own benchmarks
+You can benchmark the performance of marker on your machine.  First, download the benchmark data [here](https://drive.google.com/file/d/1WiN4K2-jQfwyQMe4wSSurbpz3hxo2fG9/view?usp=drive_link) and unzip.
+Then run `benchmark.py` like this:
 ```
 python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
 # Commercial usage
+Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.
+I'm building a version that can be used commercially, by stripping out the dependencies below. If you would like to get early access, email me at marker@vikas.sh.
+Here are the non-commercial/restrictive dependencies:
+- LayoutLMv3: CC BY-NC-SA 4.0 .  [Source](https://huggingface.co/microsoft/layoutlmv3-base)
+- Nougat: CC-BY-NC . [Source](https://github.com/facebookresearch/nougat)
+- PyMuPDF - GPL . [Source](https://pymupdf.readthedocs.io/en/latest/about.html#license-and-copyright)
+Other dependencies/datasets are openly licensed (doclaynet, byt5), or used in a way that is compatible with commercial usage (ghostscript).
 # Thanks
 - Nougat from Meta
 - Layoutlmv3 from Microsoft
 - DocLayNet from IBM
+- ByT5 from Google
+Thank you to the authors of these models and datasets for making them available to the community.

benchmark.py CHANGED Viewed

@@ -40,7 +40,8 @@ if __name__ == "__main__":
     parser.add_argument("reference_folder", help="Reference folder with reference markdown files")
     parser.add_argument("out_file", help="Output filename")
     parser.add_argument("--nougat", action="store_true", help="Run nougat and compare", default=False)
-    parser.add_argument("--nougat_batch_size", type=int, default=2, help="Batch size to use for nougat when making predictions.")
     parser.add_argument("--marker_parallel_factor", type=int, default=1, help="How much to multiply default parallel OCR workers and model batch sizes by.")
     parser.add_argument("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
     args = parser.parse_args()

     parser.add_argument("reference_folder", help="Reference folder with reference markdown files")
     parser.add_argument("out_file", help="Output filename")
     parser.add_argument("--nougat", action="store_true", help="Run nougat and compare", default=False)
+    # Nougat batch size 1 uses about as much VRAM as default marker settings
+    parser.add_argument("--nougat_batch_size", type=int, default=1, help="Batch size to use for nougat when making predictions.")
     parser.add_argument("--marker_parallel_factor", type=int, default=1, help="How much to multiply default parallel OCR workers and model batch sizes by.")
     parser.add_argument("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
     args = parser.parse_args()

benchmark_data/.gitignore ADDED Viewed

	@@ -0,0 +1,3 @@

+latex
+pdfs
+references

benchmark_data/latex_to_md.sh ADDED Viewed

	@@ -0,0 +1,21 @@

+#!/bin/bash
+# List all .tex files in the latex folder
+FILES=$(find latex -name "*.tex")
+for f in $FILES
+do
+  echo "Processing $f file..."
+  base_name=$(basename "$f" .tex)
+  out_file="references/${base_name}.md"
+  pandoc --wrap=none --no-highlight --strip-comments=true -s "$f" -t plain -o "$out_file"
+  # Replace non-breaking spaces
+  sed -i .bak 's/ / /g' "$out_file"
+  sed -i .bak 's/ / /g' "$out_file"
+  sed -i .bak 's/ / /g' "$out_file"
+  sed -i .bak 's/ / /g' "$out_file"
+  # Remove .bak file
+  rm "$out_file.bak"
+done

marker/settings.py CHANGED Viewed

@@ -11,7 +11,7 @@ class Settings(BaseSettings):
     # General
     TORCH_DEVICE: str = "cpu"
     INFERENCE_RAM: int = 40 # How much VRAM each GPU has (in GB).
-    VRAM_PER_TASK: float = 2.5 # How much VRAM to allocate per task (in GB)
     DEBUG: bool = False # Enable debug logging
     DEFAULT_LANG: str = "English" # Default language we assume files to be in, should be one of the keys in TESSERACT_LANGUAGES
@@ -57,7 +57,7 @@ class Settings(BaseSettings):
                                   "\par\par\par", "## Chapter", "Fig.", "particle", "[REPEATS]", "[TRUNCATED]", "### "]
     NOUGAT_DPI: int = 96 # DPI to render images at, matches default settings for nougat
     NOUGAT_MODEL_NAME: str = "0.1.0-small" # Name of the model to use
-    NOUGAT_BATCH_SIZE: int = 4 if TORCH_DEVICE == "cuda" else 1 # Batch size for nougat, don't batch on cpu
     # Layout model
     BAD_SPAN_TYPES: List[str] = ["Caption", "Footnote", "Page-footer", "Page-header", "Picture"]
@@ -73,7 +73,7 @@ class Settings(BaseSettings):
     # Final editing model
     EDITOR_BATCH_SIZE: int = 4
-    EDITOR_MAX_LENGTH: int = 1024
     EDITOR_MODEL_NAME: str = "vikp/pdf_postprocessor"
     ENABLE_EDITOR_MODEL: bool = False # The editor model can create false positives

     # General
     TORCH_DEVICE: str = "cpu"
     INFERENCE_RAM: int = 40 # How much VRAM each GPU has (in GB).
+    VRAM_PER_TASK: float = 2.5 # How much VRAM to allocate per task (in GB).  Peak marker VRAM usage is around 3GB, but avg across workers is lower.
     DEBUG: bool = False # Enable debug logging
     DEFAULT_LANG: str = "English" # Default language we assume files to be in, should be one of the keys in TESSERACT_LANGUAGES
                                   "\par\par\par", "## Chapter", "Fig.", "particle", "[REPEATS]", "[TRUNCATED]", "### "]
     NOUGAT_DPI: int = 96 # DPI to render images at, matches default settings for nougat
     NOUGAT_MODEL_NAME: str = "0.1.0-small" # Name of the model to use
+    NOUGAT_BATCH_SIZE: int = 6 if TORCH_DEVICE == "cuda" else 1 # Batch size for nougat, don't batch on cpu
     # Layout model
     BAD_SPAN_TYPES: List[str] = ["Caption", "Footnote", "Page-footer", "Page-header", "Picture"]
     # Final editing model
     EDITOR_BATCH_SIZE: int = 4
+    EDITOR_MAX_LENGTH: int = 2048
     EDITOR_MODEL_NAME: str = "vikp/pdf_postprocessor"
     ENABLE_EDITOR_MODEL: bool = False # The editor model can create false positives