Spaces:

rt4u
/

marker

Sleeping

App Files Files Community

Vik Paruchuri commited on May 9, 2024

Commit

e726f18

1 Parent(s): 76f26de

Update benchmarks

Browse files

Files changed (10) hide show

.github/workflows/tests.yml +1 -1
README.md +40 -55
benchmark.py +1 -1
chunk_convert.sh +1 -1
marker/benchmark/scoring.py +2 -9
marker/layout/layout.py +1 -1
marker/ocr/detection.py +1 -1
marker/postprocessors/markdown.py +3 -3
poetry.lock +14 -20
pyproject.toml +2 -4

.github/workflows/tests.yml CHANGED Viewed

@@ -23,7 +23,7 @@ jobs:
           poetry run pip install torch --index-url https://download.pytorch.org/whl/cpu
       - name: Download benchmark data
         run: |
-          wget -O benchmark_data.zip "https://drive.google.com/uc?export=download&id=1ktVDYPEeyHlKLaF56FnHjI5VjVnYa1xL"
           unzip benchmark_data.zip
       - name: Run benchmark test
         run: |

           poetry run pip install torch --index-url https://download.pytorch.org/whl/cpu
       - name: Download benchmark data
         run: |
+          wget -O benchmark_data.zip "https://drive.google.com/uc?export=download&id=1NHrdYatR1rtqs2gPVfdvO0BAvocH8CJi"
           unzip benchmark_data.zip
       - name: Run benchmark test
         run: |

README.md CHANGED Viewed

@@ -1,12 +1,13 @@
 # Marker
-Marker converts PDF to markdown.  It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.
-- Support for a range of documents (optimized for books and scientific papers)
 - Removes headers/footers/other artifacts
-- Converts most equations to latex
 - Formats tables and code blocks
-- Support for all languages (although most testing is done in English).
 - Works on GPU, CPU, or MPS
 ## How it works
@@ -53,26 +54,20 @@ PDF is a tricky format, so marker will not always work perfectly.  Here are some
 # Installation
-This has been tested on Mac and Linux (Ubuntu and Debian).  You'll need python 3.9+ and [poetry](https://python-poetry.org/docs/#installing-with-the-official-installer).
-First, clone the repo:
-- `git clone https://github.com/VikParuchuri/marker.git`
-- `cd marker`
-## Linux
-- Install python requirements
-  - `poetry install`
-  - `poetry shell` to activate your poetry venv
-- Update pytorch since poetry doesn't play nicely with it
-  - GPU only: run `pip install torch` to install other torch dependencies.
-  - CPU only: Uninstall torch with `poetry remove torch`, then follow the [CPU install](https://pytorch.org/get-started/locally/) instructions.
-**Optional**
 Only needed if using `ocrmypdf` as the ocr backend.
 - Run `pip install ocrmypdf`
 - Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `scripts/install/ghostscript_install.sh`.
 - Install other requirements with `cat scripts/install/tess-apt-requirements.txt | xargs sudo apt-get install -y`
@@ -80,13 +75,7 @@ Only needed if using `ocrmypdf` as the ocr backend.
   - Find the tesseract data folder `tessdata` with `find / -name tessdata`.  Make sure to use the one corresponding to the latest tesseract version if you have multiple.
   - Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
-## Mac
-- Install python requirements
-  - `poetry install`
-  - `poetry shell` to activate your poetry venv
-**Optional**
 Only needed if using `ocrmypdf` as the ocr backend.
@@ -98,21 +87,18 @@ Only needed if using `ocrmypdf` as the ocr backend.
 # Usage
-First, some configuration.  Note that settings can be overridden with env vars, or in a `local.env` file in the root `marker` folder.
-- Your torch device will be automatically detected, but you can manually set it also.  For example, `TORCH_DEVICE=cuda` or `TORCH_DEVICE=mps`. `cpu` is the default.
   - If using GPU, set `INFERENCE_RAM` to your GPU VRAM (per GPU).  For example, if you have 16 GB of VRAM, set `INFERENCE_RAM=16`.
   - Depending on your document types, marker's average memory usage per task can vary slightly.  You can configure `VRAM_PER_TASK` to adjust this if you notice tasks failing with GPU out of memory errors.
 - By default, marker will use `surya` for OCR.  Surya is slower on CPU, but more accurate than tesseract.  If you want faster OCR, set `OCR_ENGINE` to `ocrmypdf`. This also requires external dependencies (see above).  If you don't want OCR at all, set `OCR_ENGINE` to `None`.
-- Inspect the other settings in `marker/settings.py`.  You can override any settings in the `local.env` file, or by setting environment variables.
 ## Convert a single file
-Run `convert_single.py`, like this:
-```
-python convert_single.py /path/to/file.pdf /path/to/output/folder --parallel_factor 2 --max_pages 10 --langs English
 ```
 - `--batch_multiplier` is how much to multiply default batch sizes by if you have extra VRAM.  Higher numbers will take more VRAM, but process faster.  Set to 2 by default.  The default batch sizes will take ~3GB of VRAM.
@@ -123,10 +109,8 @@ Make sure the `DEFAULT_LANG` setting is set appropriately for your document.  Th
 ## Convert multiple files
-Run `convert.py`, like this:
-```
-python convert.py /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000
 ```
 - `--workers` is the number of pdfs to convert at once.  This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK` if you're using GPU.
@@ -146,10 +130,8 @@ You can use language names or codes.  The exact codes depend on the OCR engine.
 ## Convert multiple files on multiple GPUs
-Run `chunk_convert.sh`, like this:
-```
-MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 bash chunk_convert.sh ../pdf_in ../md_out
 ```
 - `METADATA_FILE` is an optional path to a json file with metadata about the pdfs.  See above for the format.
@@ -161,29 +143,27 @@ Note that the env variables above are specific to this script, and cannot be set
 # Benchmarks
-Benchmarking PDF extraction quality is hard.  I've created a test set by finding books and scientific papers that have a pdf version and a latex source.  I convert the latex to text, and compare the reference to the output of text extraction methods.
-Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data).  We show naive text extraction (pulling text out of the pdf with no processing) for comparison.
 **Speed**
 | Method | Average Score | Time per page | Time per document |
 |--------|---------------|---------------|-------------------|
-| naive  | 0.350727      | 0.00152378    | 0.326524          |
-| marker | 0.641062      | 0.360622      | 77.2762           |
-| nougat | 0.629211      | 3.77259       | 808.413           |
 **Accuracy**
 First 3 are non-arXiv books, last 3 are arXiv papers.
-| Method | switch_trans.pdf | crowd.pdf | multicolcnn.pdf | thinkos.pdf | thinkdsp.pdf | thinkpython.pdf |
-|--------|------------------|-----------|-----------------|-------------|--------------|-----------------|
-| naive  | 0.244114         | 0.140669  | 0.0868221       | 0.366856    | 0.412521     | 0.468281        |
-| marker | 0.482091         | 0.466882  | 0.537062        | 0.754347    | 0.78825      | 0.779536        |
-| nougat | 0.696458         | 0.552337  | 0.735099        | 0.655002    | 0.645704     | 0.650282        |
-Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `3.1GB` for marker.  Benchmarks were run on an A6000.
 **Throughput**
@@ -193,11 +173,16 @@ Marker takes about 3GB of VRAM on average per task, so you can convert 16 docume
 ## Running your own benchmarks
-You can benchmark the performance of marker on your machine.  First, download the benchmark data [here](https://drive.google.com/file/d/1WiN4K2-jQfwyQMe4wSSurbpz3hxo2fG9/view?usp=drive_link) and unzip.
-Then run `benchmark.py` like this:
 ```
 python benchmark.py data/pdfs data/references report.json --nougat
 ```

 # Marker
+Marker converts PDF to markdown quickly and accurately.
+- Supports a wide range of documents (optimized for books and scientific papers)
+- Supports all languages
 - Removes headers/footers/other artifacts
 - Formats tables and code blocks
+- Extracts and saves images along with the markdown
+- Converts most equations to latex
 - Works on GPU, CPU, or MPS
 ## How it works
 # Installation
+This has been tested on Mac and Linux (Ubuntu and Debian).  You'll need python 3.9+ and PyTorch.  You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine.  See [here](https://pytorch.org/get-started/locally/) for more details.
+Install with:
+```shell
+pip install marker-pdf
+```
+## Optional
 Only needed if using `ocrmypdf` as the ocr backend.
+**Linux**
 - Run `pip install ocrmypdf`
 - Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `scripts/install/ghostscript_install.sh`.
 - Install other requirements with `cat scripts/install/tess-apt-requirements.txt | xargs sudo apt-get install -y`
   - Find the tesseract data folder `tessdata` with `find / -name tessdata`.  Make sure to use the one corresponding to the latest tesseract version if you have multiple.
   - Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
+**Mac**
 Only needed if using `ocrmypdf` as the ocr backend.
 # Usage
+First, some configuration.  Note that settings can be overridden with env vars.
+- Inspect the settings in `marker/settings.py`.  You can override any settings with environment variables.
+- Your torch device will be automatically detected, but you can override this.  For example, `TORCH_DEVICE=cuda`.
   - If using GPU, set `INFERENCE_RAM` to your GPU VRAM (per GPU).  For example, if you have 16 GB of VRAM, set `INFERENCE_RAM=16`.
   - Depending on your document types, marker's average memory usage per task can vary slightly.  You can configure `VRAM_PER_TASK` to adjust this if you notice tasks failing with GPU out of memory errors.
 - By default, marker will use `surya` for OCR.  Surya is slower on CPU, but more accurate than tesseract.  If you want faster OCR, set `OCR_ENGINE` to `ocrmypdf`. This also requires external dependencies (see above).  If you don't want OCR at all, set `OCR_ENGINE` to `None`.
 ## Convert a single file
+```shell
+marker_single /path/to/file.pdf /path/to/output/folder --parallel_factor 2 --max_pages 10 --langs English
 ```
 - `--batch_multiplier` is how much to multiply default batch sizes by if you have extra VRAM.  Higher numbers will take more VRAM, but process faster.  Set to 2 by default.  The default batch sizes will take ~3GB of VRAM.
 ## Convert multiple files
+```shell
+marker /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000
 ```
 - `--workers` is the number of pdfs to convert at once.  This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK` if you're using GPU.
 ## Convert multiple files on multiple GPUs
+```shell
+MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out
 ```
 - `METADATA_FILE` is an optional path to a json file with metadata about the pdfs.  See above for the format.
 # Benchmarks
+Benchmarking PDF extraction quality is hard.  I've created a test set by finding books and scientific papers that have a pdf version and a latex source.  I convert the latex to text, and compare the reference to the output of text extraction methods.  It's noisy, but at least directionally correct.
+Benchmarks show that marker is 4x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data).  We show naive text extraction (pulling text out of the pdf with no processing) for comparison.
 **Speed**
 | Method | Average Score | Time per page | Time per document |
 |--------|---------------|---------------|-------------------|
+| marker | 0.613721      | 0.631991      | 58.1432           |
+| nougat | 0.406603      | 2.59702       | 238.926           |
 **Accuracy**
 First 3 are non-arXiv books, last 3 are arXiv papers.
+| Method | multicolcnn.pdf | switch_trans.pdf | thinkpython.pdf | thinkos.pdf | thinkdsp.pdf | crowd.pdf |
+|--------|-----------------|------------------|-----------------|-------------|--------------|-----------|
+| marker | 0.536176        | 0.516833         | 0.70515         | 0.710657    | 0.690042     | 0.523467  |
+| nougat | 0.44009         | 0.588973         | 0.322706        | 0.401342    | 0.160842     | 0.525663  |
+Peak GPU memory usage during the benchmark is `4.2GB` for nougat, and `6.5GB` for marker.  Benchmarks were run on an A6000 Ada.
 **Throughput**
 ## Running your own benchmarks
+You can benchmark the performance of marker on your machine. Install marker manually with:
+```shell
+git clone https://github.com/VikParuchuri/marker.git
+poetry install
 ```
+Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run `benchmark.py` like this:
+```shell
 python benchmark.py data/pdfs data/references report.json --nougat
 ```

benchmark.py CHANGED Viewed

@@ -46,7 +46,7 @@ def main():
     args = parser.parse_args()
-    methods = ["naive", "marker"]
     if args.nougat:
         methods.append("nougat")

     args = parser.parse_args()
+    methods = ["marker"]
     if args.nougat:
         methods.append("nougat")

chunk_convert.sh CHANGED Viewed

@@ -35,7 +35,7 @@ for (( i=0; i<$NUM_DEVICES; i++ )); do
     export NUM_DEVICES
     export NUM_WORKERS
     echo "Running convert.py on GPU $DEVICE_NUM"
-    cmd="CUDA_VISIBLE_DEVICES=$DEVICE_NUM python convert.py $INPUT_FOLDER $OUTPUT_FOLDER --num_chunks $NUM_DEVICES --chunk_idx $DEVICE_NUM --workers $NUM_WORKERS"
     [[ -n "$METADATA_FILE" ]] && cmd="$cmd --metadata_file $METADATA_FILE"
     [[ -n "$MIN_LENGTH" ]] && cmd="$cmd --min_length $MIN_LENGTH"
     eval $cmd &

     export NUM_DEVICES
     export NUM_WORKERS
     echo "Running convert.py on GPU $DEVICE_NUM"
+    cmd="CUDA_VISIBLE_DEVICES=$DEVICE_NUM marker $INPUT_FOLDER $OUTPUT_FOLDER --num_chunks $NUM_DEVICES --chunk_idx $DEVICE_NUM --workers $NUM_WORKERS"
     [[ -n "$METADATA_FILE" ]] && cmd="$cmd --metadata_file $METADATA_FILE"
     [[ -n "$MIN_LENGTH" ]] && cmd="$cmd --min_length $MIN_LENGTH"
     eval $cmd &

marker/benchmark/scoring.py CHANGED Viewed

@@ -7,12 +7,7 @@ from statistics import mean
 CHUNK_MIN_CHARS = 25
-def replace_alphanumeric(text):
-    return regex.sub(r'[\p{L}]', '', text)
-def chunk_text(text, chunk_len=50):
     chunks = [text[i:i+chunk_len] for i in range(0, len(text), chunk_len)]
     chunks = [c for c in chunks if c.strip() and len(c) > CHUNK_MIN_CHARS]
     return chunks
@@ -29,7 +24,7 @@ def overlap_score(hypothesis_chunks, reference_chunks):
         chunk_range = range(max(0, i_offset-search_distance), min(len(reference_chunks), i_offset+search_distance))
         for j in chunk_range:
             ref_chunk = reference_chunks[j]
-            score = fuzz.ratio(hyp_chunk, ref_chunk, score_cutoff=50) / 100
             if score > max_score:
                 max_score = score
                 total_len = len(ref_chunk)
@@ -38,8 +33,6 @@ def overlap_score(hypothesis_chunks, reference_chunks):
 def score_text(hypothesis, reference):
-    hypothesis = replace_alphanumeric(hypothesis)
-    reference = replace_alphanumeric(reference)
     # Returns a 0-1 alignment score
     hypothesis_chunks = chunk_text(hypothesis)
     reference_chunks = chunk_text(reference)

 CHUNK_MIN_CHARS = 25
+def chunk_text(text, chunk_len=500):
     chunks = [text[i:i+chunk_len] for i in range(0, len(text), chunk_len)]
     chunks = [c for c in chunks if c.strip() and len(c) > CHUNK_MIN_CHARS]
     return chunks
         chunk_range = range(max(0, i_offset-search_distance), min(len(reference_chunks), i_offset+search_distance))
         for j in chunk_range:
             ref_chunk = reference_chunks[j]
+            score = fuzz.ratio(hyp_chunk, ref_chunk, score_cutoff=30) / 100
             if score > max_score:
                 max_score = score
                 total_len = len(ref_chunk)
 def score_text(hypothesis, reference):
     # Returns a 0-1 alignment score
     hypothesis_chunks = chunk_text(hypothesis)
     reference_chunks = chunk_text(reference)

marker/layout/layout.py CHANGED Viewed

@@ -12,7 +12,7 @@ def get_batch_size():
     if settings.LAYOUT_BATCH_SIZE is not None:
         return settings.LAYOUT_BATCH_SIZE
     elif settings.TORCH_DEVICE_MODEL == "cuda":
-        return 12
     return 6

     if settings.LAYOUT_BATCH_SIZE is not None:
         return settings.LAYOUT_BATCH_SIZE
     elif settings.TORCH_DEVICE_MODEL == "cuda":
+        return 6
     return 6

marker/ocr/detection.py CHANGED Viewed

@@ -12,7 +12,7 @@ def get_batch_size():
     if settings.DETECTOR_BATCH_SIZE is not None:
         return settings.DETECTOR_BATCH_SIZE
     elif settings.TORCH_DEVICE_MODEL == "cuda":
-        return 12
     return 6

     if settings.DETECTOR_BATCH_SIZE is not None:
         return settings.DETECTOR_BATCH_SIZE
     elif settings.TORCH_DEVICE_MODEL == "cuda":
+        return 6
     return 6

marker/postprocessors/markdown.py CHANGED Viewed

@@ -95,7 +95,7 @@ def block_surround(text, block_type):
 def line_separator(line1, line2, block_type, is_continuation=False):
     # Should cover latin-derived languages and russian
-    lowercase_letters = r'\p{Lo}+|\d+'
     # Remove hyphen in current line if next line and current line appear to be joined
     hyphen_pattern = regex.compile(rf'.*[{lowercase_letters}][-]\s?$', regex.DOTALL)
     if line1 and hyphen_pattern.match(line1) and regex.match(rf"^\s?[{lowercase_letters}]", line2):
@@ -103,8 +103,8 @@ def line_separator(line1, line2, block_type, is_continuation=False):
         line1 = regex.split(r"[-—]\s?$", line1)[0]
         return line1.rstrip() + line2.lstrip()
-    all_letters = r'\p{L}+|\d+'
-    sentence_continuations = r',;\(\—'
     sentence_ends = r'。ๆ\.?!'
     line_end_pattern = regex.compile(rf'.*[{lowercase_letters}][{sentence_continuations}]?\s?$', regex.DOTALL)
     line_start_pattern = regex.compile(rf'^\s?[{all_letters}]', regex.DOTALL)

 def line_separator(line1, line2, block_type, is_continuation=False):
     # Should cover latin-derived languages and russian
+    lowercase_letters = r'\p{Lo}|\p{Ll}|\d'
     # Remove hyphen in current line if next line and current line appear to be joined
     hyphen_pattern = regex.compile(rf'.*[{lowercase_letters}][-]\s?$', regex.DOTALL)
     if line1 and hyphen_pattern.match(line1) and regex.match(rf"^\s?[{lowercase_letters}]", line2):
         line1 = regex.split(r"[-—]\s?$", line1)[0]
         return line1.rstrip() + line2.lstrip()
+    all_letters = r'\p{L}|\d'
+    sentence_continuations = r',;\(\—\"\''
     sentence_ends = r'。ๆ\.?!'
     line_end_pattern = regex.compile(rf'.*[{lowercase_letters}][{sentence_continuations}]?\s?$', regex.DOTALL)
     line_start_pattern = regex.compile(rf'^\s?[{all_letters}]', regex.DOTALL)

poetry.lock CHANGED Viewed

@@ -2968,26 +2968,20 @@ full = ["numpy"]
 [[package]]
 name = "ray"
-version = "2.20.0"
 description = "Ray provides a simple, universal API for building distributed applications."
 optional = false
 python-versions = ">=3.8"
 files = [
-    {file = "ray-2.20.0-cp310-cp310-macosx_10_15_x86_64.whl", hash = "sha256:8855a5df8b3e6b8bcb5582a8491c50d0237e70751f941e8978bd6408245b7838"},
-    {file = "ray-2.20.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:c0566b28c75aad1d47b9403c3901a85db586ce7191fdc6978e07ad56e80bf82b"},
-    {file = "ray-2.20.0-cp310-cp310-manylinux2014_aarch64.whl", hash = "sha256:738c68f4114754f846b3d03b730b42a6468f8b54665732da9f9108aa1d3ecbe3"},
-    {file = "ray-2.20.0-cp310-cp310-manylinux2014_x86_64.whl", hash = "sha256:2c7f8cd468cbba009d7ebd8a8da66026aeb520f7f4183dd6f49419d75bc84415"},
-    {file = "ray-2.20.0-cp310-cp310-win_amd64.whl", hash = "sha256:611d34d0c659652a38ef482a82dfc362074984617765e1d5a414337e4f914cfd"},
-    {file = "ray-2.20.0-cp311-cp311-macosx_10_15_x86_64.whl", hash = "sha256:f7816767e644014f65afbfceb6adfb08c15784a4227aa331b28ac90d1b757a58"},
-    {file = "ray-2.20.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:8e98df29fd6dac52c87c1f5be5ad99601a8955eaabe921e5cab29b27775250ce"},
-    {file = "ray-2.20.0-cp311-cp311-manylinux2014_aarch64.whl", hash = "sha256:e84ddad1521e06c91fc641f2b856d33ca2bfa314784172862c41a5184e0e760b"},
-    {file = "ray-2.20.0-cp311-cp311-manylinux2014_x86_64.whl", hash = "sha256:d9b13815fae5c9a68c9a02f21e1c49c58a5bb6565cb9ed5d48571cacce7568f2"},
-    {file = "ray-2.20.0-cp311-cp311-win_amd64.whl", hash = "sha256:6ac1dcb303ddf53d2d87bc5b719e8c38f0a5efe41e175b6ba563fb65b5f4e9a2"},
-    {file = "ray-2.20.0-cp39-cp39-macosx_10_15_x86_64.whl", hash = "sha256:1de0810f77ae4a0bf055aa2bdcb161be1d6d1b67b4095e85a5b3fbb6e0dadcd2"},
-    {file = "ray-2.20.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:3f3519dd7794ead4d3e17d4570593b2a10e8db062836907517e85b4e769dec1a"},
-    {file = "ray-2.20.0-cp39-cp39-manylinux2014_aarch64.whl", hash = "sha256:5a2cb9f100bbb6351372519b03ddc21d9fa6c8716621237273a59a6e250a8204"},
-    {file = "ray-2.20.0-cp39-cp39-manylinux2014_x86_64.whl", hash = "sha256:64b394a6462a2ac2401b1b004f2cc7ac31e429388abf27024072a55702f1159c"},
-    {file = "ray-2.20.0-cp39-cp39-win_amd64.whl", hash = "sha256:65938f7bd28a825d90c643465ad6b1334d97d16e381c409b19269e4dcc043341"},
 ]
 [package.dependencies]
@@ -3004,9 +2998,9 @@ requests = "*"
 [package.extras]
 air = ["aiohttp (>=3.7)", "aiohttp-cors", "colorful", "fastapi", "fsspec", "grpcio (>=1.32.0)", "grpcio (>=1.42.0)", "memray", "numpy (>=1.20)", "opencensus", "pandas", "pandas (>=1.3)", "prometheus-client (>=0.7.1)", "py-spy (>=0.2.0)", "pyarrow (>=6.0.1)", "pydantic (<2.0.dev0 || >=2.5.dev0,<3)", "requests", "smart-open", "starlette", "tensorboardX (>=1.9)", "uvicorn[standard]", "virtualenv (>=20.0.24,!=20.21.1)", "watchfiles"]
-all = ["aiohttp (>=3.7)", "aiohttp-cors", "colorful", "dm-tree", "fastapi", "fsspec", "grpcio", "grpcio (>=1.32.0)", "grpcio (>=1.42.0)", "gymnasium (==0.28.1)", "lz4", "memray", "numpy (>=1.20)", "opencensus", "opentelemetry-api", "opentelemetry-exporter-otlp", "opentelemetry-sdk", "pandas", "pandas (>=1.3)", "prometheus-client (>=0.7.1)", "py-spy (>=0.2.0)", "pyarrow (>=6.0.1)", "pydantic (<2.0.dev0 || >=2.5.dev0,<3)", "pyyaml", "ray-cpp (==2.20.0)", "requests", "rich", "scikit-image", "scipy", "smart-open", "starlette", "tensorboardX (>=1.9)", "typer", "uvicorn[standard]", "virtualenv (>=20.0.24,!=20.21.1)", "watchfiles"]
-client = ["grpcio"]
-cpp = ["ray-cpp (==2.20.0)"]
 data = ["fsspec", "numpy (>=1.20)", "pandas (>=1.3)", "pyarrow (>=6.0.1)"]
 default = ["aiohttp (>=3.7)", "aiohttp-cors", "colorful", "grpcio (>=1.32.0)", "grpcio (>=1.42.0)", "memray", "opencensus", "prometheus-client (>=0.7.1)", "py-spy (>=0.2.0)", "pydantic (<2.0.dev0 || >=2.5.dev0,<3)", "requests", "smart-open", "virtualenv (>=20.0.24,!=20.21.1)"]
 observability = ["opentelemetry-api", "opentelemetry-exporter-otlp", "opentelemetry-sdk"]
@@ -4319,4 +4313,4 @@ testing = ["big-O", "jaraco.functools", "jaraco.itertools", "more-itertools", "p
 [metadata]
 lock-version = "2.0"
 python-versions = ">=3.9,<3.13,!=3.9.7"
-content-hash = "09525ba80729de3a24267505ece01a9d46cafc8b8b5d06c9df04e550da7b255b"

 [[package]]
 name = "ray"
+version = "2.21.0"
 description = "Ray provides a simple, universal API for building distributed applications."
 optional = false
 python-versions = ">=3.8"
 files = [
+    {file = "ray-2.21.0-cp310-cp310-macosx_10_15_x86_64.whl", hash = "sha256:17c755f2682be60713c1c5895dd1cde6be11ce66fe905ff65d1275c9fc7dfa98"},
+    {file = "ray-2.21.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:e14c1ed90009438a19c80386d6f8ef5e84d85b484883893a048375e4d6af87dc"},
+    {file = "ray-2.21.0-cp310-cp310-manylinux2014_x86_64.whl", hash = "sha256:3a642e98734003be7778cea4e5089ca4f7263497e1ce3a5334069d7a791be81d"},
+    {file = "ray-2.21.0-cp311-cp311-macosx_10_15_x86_64.whl", hash = "sha256:f162a21e53a38b4a327d51d9a2d84b6110b6af248af3dfa8f8a2ce73d204f10d"},
+    {file = "ray-2.21.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:1521a39248fea0c5079d6ce3884316cd75f3d815d982e8da249f2e8334d2178e"},
+    {file = "ray-2.21.0-cp311-cp311-manylinux2014_x86_64.whl", hash = "sha256:b43d84c19a0081b372915f25405973275245a6b1f590a3be9ad5b367c4288352"},
+    {file = "ray-2.21.0-cp39-cp39-macosx_10_15_x86_64.whl", hash = "sha256:36cdf493dfa5a85fb0ff02d612a8a40dbeeb0ffba1c469bb89cd1fa74db4b018"},
+    {file = "ray-2.21.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:6bda45370540f2749fc355a5088cc4b0642b15e14604e6ed636dde8cb4e9e6e1"},
+    {file = "ray-2.21.0-cp39-cp39-manylinux2014_x86_64.whl", hash = "sha256:a794e82656dc76f64dca8a1b78bbe5de786b62cf14ee05b678849fc8bc55c122"},
 ]
 [package.dependencies]
 [package.extras]
 air = ["aiohttp (>=3.7)", "aiohttp-cors", "colorful", "fastapi", "fsspec", "grpcio (>=1.32.0)", "grpcio (>=1.42.0)", "memray", "numpy (>=1.20)", "opencensus", "pandas", "pandas (>=1.3)", "prometheus-client (>=0.7.1)", "py-spy (>=0.2.0)", "pyarrow (>=6.0.1)", "pydantic (<2.0.dev0 || >=2.5.dev0,<3)", "requests", "smart-open", "starlette", "tensorboardX (>=1.9)", "uvicorn[standard]", "virtualenv (>=20.0.24,!=20.21.1)", "watchfiles"]
+all = ["aiohttp (>=3.7)", "aiohttp-cors", "colorful", "dm-tree", "fastapi", "fsspec", "grpcio (!=1.56.0)", "grpcio (>=1.32.0)", "grpcio (>=1.42.0)", "gymnasium (==0.28.1)", "lz4", "memray", "numpy (>=1.20)", "opencensus", "opentelemetry-api", "opentelemetry-exporter-otlp", "opentelemetry-sdk", "pandas", "pandas (>=1.3)", "prometheus-client (>=0.7.1)", "py-spy (>=0.2.0)", "pyarrow (>=6.0.1)", "pydantic (<2.0.dev0 || >=2.5.dev0,<3)", "pyyaml", "ray-cpp (==2.21.0)", "requests", "rich", "scikit-image", "scipy", "smart-open", "starlette", "tensorboardX (>=1.9)", "typer", "uvicorn[standard]", "virtualenv (>=20.0.24,!=20.21.1)", "watchfiles"]
+client = ["grpcio (!=1.56.0)"]
+cpp = ["ray-cpp (==2.21.0)"]
 data = ["fsspec", "numpy (>=1.20)", "pandas (>=1.3)", "pyarrow (>=6.0.1)"]
 default = ["aiohttp (>=3.7)", "aiohttp-cors", "colorful", "grpcio (>=1.32.0)", "grpcio (>=1.42.0)", "memray", "opencensus", "prometheus-client (>=0.7.1)", "py-spy (>=0.2.0)", "pydantic (<2.0.dev0 || >=2.5.dev0,<3)", "requests", "smart-open", "virtualenv (>=20.0.24,!=20.21.1)"]
 observability = ["opentelemetry-api", "opentelemetry-exporter-otlp", "opentelemetry-sdk"]
 [metadata]
 lock-version = "2.0"
 python-versions = ">=3.9,<3.13,!=3.9.7"
+content-hash = "98e9ddb4d60035b0c7a8ce9acabf87ed17dfc91d0cb651c851277ab6f39aa6ca"

pyproject.toml CHANGED Viewed

@@ -14,7 +14,6 @@ include = [
     "convert.py",
     "convert_single.py",
     "chunk_convert.sh",
-    "benchmark.py",
     "chunk_convert.py",
 ]
@@ -27,14 +26,14 @@ pydantic-settings = "^2.0.3"
 transformers = "^4.36.2" # 4.36.2 needed because issues with donut models and later versions
 numpy = "^1.26.1"
 python-dotenv = "^1.0.0"
-torch = "2.2.2" # Issue with torch 2.3.0 and vision models - https://github.com/pytorch/pytorch/issues/121834
 ray = "^2.20.0"
 tqdm = "^4.66.1"
 tabulate = "^0.9.0"
 ftfy = "^6.1.1"
 texify = "^0.1.8"
 rapidfuzz = "^3.8.1"
-surya-ocr = "^0.4.0"
 filetype = "^1.2.0"
 regex = "^2024.4.28"
 pdftext = "^0.3.7"
@@ -45,7 +44,6 @@ jupyter = "^1.0.0"
 [tool.poetry.scripts]
 marker = "convert:main"
 marker_single = "convert_single:main"
-marker_benchmark = "benchmark:main"
 marker_chunk_convert = "chunk_convert:main"
 [build-system]

     "convert.py",
     "convert_single.py",
     "chunk_convert.sh",
     "chunk_convert.py",
 ]
 transformers = "^4.36.2" # 4.36.2 needed because issues with donut models and later versions
 numpy = "^1.26.1"
 python-dotenv = "^1.0.0"
+torch = "^2.2.2" # Issue with torch 2.3.0 and vision models - https://github.com/pytorch/pytorch/issues/121834
 ray = "^2.20.0"
 tqdm = "^4.66.1"
 tabulate = "^0.9.0"
 ftfy = "^6.1.1"
 texify = "^0.1.8"
 rapidfuzz = "^3.8.1"
+surya-ocr = "^0.4.3"
 filetype = "^1.2.0"
 regex = "^2024.4.28"
 pdftext = "^0.3.7"
 [tool.poetry.scripts]
 marker = "convert:main"
 marker_single = "convert_single:main"
 marker_chunk_convert = "chunk_convert:main"
 [build-system]