Vik Paruchuri
commited on
Commit
Β·
e726f18
1
Parent(s):
76f26de
Update benchmarks
Browse files- .github/workflows/tests.yml +1 -1
- README.md +40 -55
- benchmark.py +1 -1
- chunk_convert.sh +1 -1
- marker/benchmark/scoring.py +2 -9
- marker/layout/layout.py +1 -1
- marker/ocr/detection.py +1 -1
- marker/postprocessors/markdown.py +3 -3
- poetry.lock +14 -20
- pyproject.toml +2 -4
.github/workflows/tests.yml
CHANGED
|
@@ -23,7 +23,7 @@ jobs:
|
|
| 23 |
poetry run pip install torch --index-url https://download.pytorch.org/whl/cpu
|
| 24 |
- name: Download benchmark data
|
| 25 |
run: |
|
| 26 |
-
wget -O benchmark_data.zip "https://drive.google.com/uc?export=download&id=
|
| 27 |
unzip benchmark_data.zip
|
| 28 |
- name: Run benchmark test
|
| 29 |
run: |
|
|
|
|
| 23 |
poetry run pip install torch --index-url https://download.pytorch.org/whl/cpu
|
| 24 |
- name: Download benchmark data
|
| 25 |
run: |
|
| 26 |
+
wget -O benchmark_data.zip "https://drive.google.com/uc?export=download&id=1NHrdYatR1rtqs2gPVfdvO0BAvocH8CJi"
|
| 27 |
unzip benchmark_data.zip
|
| 28 |
- name: Run benchmark test
|
| 29 |
run: |
|
README.md
CHANGED
|
@@ -1,12 +1,13 @@
|
|
| 1 |
# Marker
|
| 2 |
|
| 3 |
-
Marker converts PDF to markdown
|
| 4 |
|
| 5 |
-
-
|
|
|
|
| 6 |
- Removes headers/footers/other artifacts
|
| 7 |
-
- Converts most equations to latex
|
| 8 |
- Formats tables and code blocks
|
| 9 |
-
-
|
|
|
|
| 10 |
- Works on GPU, CPU, or MPS
|
| 11 |
|
| 12 |
## How it works
|
|
@@ -53,26 +54,20 @@ PDF is a tricky format, so marker will not always work perfectly. Here are some
|
|
| 53 |
|
| 54 |
# Installation
|
| 55 |
|
| 56 |
-
This has been tested on Mac and Linux (Ubuntu and Debian). You'll need python 3.9+ and [
|
| 57 |
-
|
| 58 |
-
First, clone the repo:
|
| 59 |
|
| 60 |
-
|
| 61 |
-
- `cd marker`
|
| 62 |
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
- `poetry install`
|
| 67 |
-
- `poetry shell` to activate your poetry venv
|
| 68 |
-
- Update pytorch since poetry doesn't play nicely with it
|
| 69 |
-
- GPU only: run `pip install torch` to install other torch dependencies.
|
| 70 |
-
- CPU only: Uninstall torch with `poetry remove torch`, then follow the [CPU install](https://pytorch.org/get-started/locally/) instructions.
|
| 71 |
|
| 72 |
-
|
| 73 |
|
| 74 |
Only needed if using `ocrmypdf` as the ocr backend.
|
| 75 |
|
|
|
|
|
|
|
| 76 |
- Run `pip install ocrmypdf`
|
| 77 |
- Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `scripts/install/ghostscript_install.sh`.
|
| 78 |
- Install other requirements with `cat scripts/install/tess-apt-requirements.txt | xargs sudo apt-get install -y`
|
|
@@ -80,13 +75,7 @@ Only needed if using `ocrmypdf` as the ocr backend.
|
|
| 80 |
- Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the latest tesseract version if you have multiple.
|
| 81 |
- Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
|
| 82 |
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
- Install python requirements
|
| 86 |
-
- `poetry install`
|
| 87 |
-
- `poetry shell` to activate your poetry venv
|
| 88 |
-
|
| 89 |
-
**Optional**
|
| 90 |
|
| 91 |
Only needed if using `ocrmypdf` as the ocr backend.
|
| 92 |
|
|
@@ -98,21 +87,18 @@ Only needed if using `ocrmypdf` as the ocr backend.
|
|
| 98 |
|
| 99 |
# Usage
|
| 100 |
|
| 101 |
-
First, some configuration. Note that settings can be overridden with env vars
|
| 102 |
|
| 103 |
-
-
|
|
|
|
| 104 |
- If using GPU, set `INFERENCE_RAM` to your GPU VRAM (per GPU). For example, if you have 16 GB of VRAM, set `INFERENCE_RAM=16`.
|
| 105 |
- Depending on your document types, marker's average memory usage per task can vary slightly. You can configure `VRAM_PER_TASK` to adjust this if you notice tasks failing with GPU out of memory errors.
|
| 106 |
- By default, marker will use `surya` for OCR. Surya is slower on CPU, but more accurate than tesseract. If you want faster OCR, set `OCR_ENGINE` to `ocrmypdf`. This also requires external dependencies (see above). If you don't want OCR at all, set `OCR_ENGINE` to `None`.
|
| 107 |
-
- Inspect the other settings in `marker/settings.py`. You can override any settings in the `local.env` file, or by setting environment variables.
|
| 108 |
-
|
| 109 |
|
| 110 |
## Convert a single file
|
| 111 |
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
```
|
| 115 |
-
python convert_single.py /path/to/file.pdf /path/to/output/folder --parallel_factor 2 --max_pages 10 --langs English
|
| 116 |
```
|
| 117 |
|
| 118 |
- `--batch_multiplier` is how much to multiply default batch sizes by if you have extra VRAM. Higher numbers will take more VRAM, but process faster. Set to 2 by default. The default batch sizes will take ~3GB of VRAM.
|
|
@@ -123,10 +109,8 @@ Make sure the `DEFAULT_LANG` setting is set appropriately for your document. Th
|
|
| 123 |
|
| 124 |
## Convert multiple files
|
| 125 |
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
```
|
| 129 |
-
python convert.py /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000
|
| 130 |
```
|
| 131 |
|
| 132 |
- `--workers` is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK` if you're using GPU.
|
|
@@ -146,10 +130,8 @@ You can use language names or codes. The exact codes depend on the OCR engine.
|
|
| 146 |
|
| 147 |
## Convert multiple files on multiple GPUs
|
| 148 |
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
```
|
| 152 |
-
MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 bash chunk_convert.sh ../pdf_in ../md_out
|
| 153 |
```
|
| 154 |
|
| 155 |
- `METADATA_FILE` is an optional path to a json file with metadata about the pdfs. See above for the format.
|
|
@@ -161,29 +143,27 @@ Note that the env variables above are specific to this script, and cannot be set
|
|
| 161 |
|
| 162 |
# Benchmarks
|
| 163 |
|
| 164 |
-
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods.
|
| 165 |
|
| 166 |
-
Benchmarks show that marker is
|
| 167 |
|
| 168 |
**Speed**
|
| 169 |
|
| 170 |
| Method | Average Score | Time per page | Time per document |
|
| 171 |
|--------|---------------|---------------|-------------------|
|
| 172 |
-
|
|
| 173 |
-
|
|
| 174 |
-
| nougat | 0.629211 | 3.77259 | 808.413 |
|
| 175 |
|
| 176 |
**Accuracy**
|
| 177 |
|
| 178 |
First 3 are non-arXiv books, last 3 are arXiv papers.
|
| 179 |
|
| 180 |
-
| Method |
|
| 181 |
-
|
| 182 |
-
|
|
| 183 |
-
|
|
| 184 |
-
| nougat | 0.696458 | 0.552337 | 0.735099 | 0.655002 | 0.645704 | 0.650282 |
|
| 185 |
|
| 186 |
-
Peak GPU memory usage during the benchmark is `
|
| 187 |
|
| 188 |
**Throughput**
|
| 189 |
|
|
@@ -193,11 +173,16 @@ Marker takes about 3GB of VRAM on average per task, so you can convert 16 docume
|
|
| 193 |
|
| 194 |
## Running your own benchmarks
|
| 195 |
|
| 196 |
-
You can benchmark the performance of marker on your machine.
|
| 197 |
-
|
| 198 |
-
Then run `benchmark.py` like this:
|
| 199 |
|
|
|
|
|
|
|
|
|
|
| 200 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 201 |
python benchmark.py data/pdfs data/references report.json --nougat
|
| 202 |
```
|
| 203 |
|
|
|
|
| 1 |
# Marker
|
| 2 |
|
| 3 |
+
Marker converts PDF to markdown quickly and accurately.
|
| 4 |
|
| 5 |
+
- Supports a wide range of documents (optimized for books and scientific papers)
|
| 6 |
+
- Supports all languages
|
| 7 |
- Removes headers/footers/other artifacts
|
|
|
|
| 8 |
- Formats tables and code blocks
|
| 9 |
+
- Extracts and saves images along with the markdown
|
| 10 |
+
- Converts most equations to latex
|
| 11 |
- Works on GPU, CPU, or MPS
|
| 12 |
|
| 13 |
## How it works
|
|
|
|
| 54 |
|
| 55 |
# Installation
|
| 56 |
|
| 57 |
+
This has been tested on Mac and Linux (Ubuntu and Debian). You'll need python 3.9+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See [here](https://pytorch.org/get-started/locally/) for more details.
|
|
|
|
|
|
|
| 58 |
|
| 59 |
+
Install with:
|
|
|
|
| 60 |
|
| 61 |
+
```shell
|
| 62 |
+
pip install marker-pdf
|
| 63 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
+
## Optional
|
| 66 |
|
| 67 |
Only needed if using `ocrmypdf` as the ocr backend.
|
| 68 |
|
| 69 |
+
**Linux**
|
| 70 |
+
|
| 71 |
- Run `pip install ocrmypdf`
|
| 72 |
- Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `scripts/install/ghostscript_install.sh`.
|
| 73 |
- Install other requirements with `cat scripts/install/tess-apt-requirements.txt | xargs sudo apt-get install -y`
|
|
|
|
| 75 |
- Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the latest tesseract version if you have multiple.
|
| 76 |
- Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
|
| 77 |
|
| 78 |
+
**Mac**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
Only needed if using `ocrmypdf` as the ocr backend.
|
| 81 |
|
|
|
|
| 87 |
|
| 88 |
# Usage
|
| 89 |
|
| 90 |
+
First, some configuration. Note that settings can be overridden with env vars.
|
| 91 |
|
| 92 |
+
- Inspect the settings in `marker/settings.py`. You can override any settings with environment variables.
|
| 93 |
+
- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
|
| 94 |
- If using GPU, set `INFERENCE_RAM` to your GPU VRAM (per GPU). For example, if you have 16 GB of VRAM, set `INFERENCE_RAM=16`.
|
| 95 |
- Depending on your document types, marker's average memory usage per task can vary slightly. You can configure `VRAM_PER_TASK` to adjust this if you notice tasks failing with GPU out of memory errors.
|
| 96 |
- By default, marker will use `surya` for OCR. Surya is slower on CPU, but more accurate than tesseract. If you want faster OCR, set `OCR_ENGINE` to `ocrmypdf`. This also requires external dependencies (see above). If you don't want OCR at all, set `OCR_ENGINE` to `None`.
|
|
|
|
|
|
|
| 97 |
|
| 98 |
## Convert a single file
|
| 99 |
|
| 100 |
+
```shell
|
| 101 |
+
marker_single /path/to/file.pdf /path/to/output/folder --parallel_factor 2 --max_pages 10 --langs English
|
|
|
|
|
|
|
| 102 |
```
|
| 103 |
|
| 104 |
- `--batch_multiplier` is how much to multiply default batch sizes by if you have extra VRAM. Higher numbers will take more VRAM, but process faster. Set to 2 by default. The default batch sizes will take ~3GB of VRAM.
|
|
|
|
| 109 |
|
| 110 |
## Convert multiple files
|
| 111 |
|
| 112 |
+
```shell
|
| 113 |
+
marker /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000
|
|
|
|
|
|
|
| 114 |
```
|
| 115 |
|
| 116 |
- `--workers` is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK` if you're using GPU.
|
|
|
|
| 130 |
|
| 131 |
## Convert multiple files on multiple GPUs
|
| 132 |
|
| 133 |
+
```shell
|
| 134 |
+
MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out
|
|
|
|
|
|
|
| 135 |
```
|
| 136 |
|
| 137 |
- `METADATA_FILE` is an optional path to a json file with metadata about the pdfs. See above for the format.
|
|
|
|
| 143 |
|
| 144 |
# Benchmarks
|
| 145 |
|
| 146 |
+
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. It's noisy, but at least directionally correct.
|
| 147 |
|
| 148 |
+
Benchmarks show that marker is 4x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data). We show naive text extraction (pulling text out of the pdf with no processing) for comparison.
|
| 149 |
|
| 150 |
**Speed**
|
| 151 |
|
| 152 |
| Method | Average Score | Time per page | Time per document |
|
| 153 |
|--------|---------------|---------------|-------------------|
|
| 154 |
+
| marker | 0.613721 | 0.631991 | 58.1432 |
|
| 155 |
+
| nougat | 0.406603 | 2.59702 | 238.926 |
|
|
|
|
| 156 |
|
| 157 |
**Accuracy**
|
| 158 |
|
| 159 |
First 3 are non-arXiv books, last 3 are arXiv papers.
|
| 160 |
|
| 161 |
+
| Method | multicolcnn.pdf | switch_trans.pdf | thinkpython.pdf | thinkos.pdf | thinkdsp.pdf | crowd.pdf |
|
| 162 |
+
|--------|-----------------|------------------|-----------------|-------------|--------------|-----------|
|
| 163 |
+
| marker | 0.536176 | 0.516833 | 0.70515 | 0.710657 | 0.690042 | 0.523467 |
|
| 164 |
+
| nougat | 0.44009 | 0.588973 | 0.322706 | 0.401342 | 0.160842 | 0.525663 |
|
|
|
|
| 165 |
|
| 166 |
+
Peak GPU memory usage during the benchmark is `4.2GB` for nougat, and `6.5GB` for marker. Benchmarks were run on an A6000 Ada.
|
| 167 |
|
| 168 |
**Throughput**
|
| 169 |
|
|
|
|
| 173 |
|
| 174 |
## Running your own benchmarks
|
| 175 |
|
| 176 |
+
You can benchmark the performance of marker on your machine. Install marker manually with:
|
|
|
|
|
|
|
| 177 |
|
| 178 |
+
```shell
|
| 179 |
+
git clone https://github.com/VikParuchuri/marker.git
|
| 180 |
+
poetry install
|
| 181 |
```
|
| 182 |
+
|
| 183 |
+
Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run `benchmark.py` like this:
|
| 184 |
+
|
| 185 |
+
```shell
|
| 186 |
python benchmark.py data/pdfs data/references report.json --nougat
|
| 187 |
```
|
| 188 |
|
benchmark.py
CHANGED
|
@@ -46,7 +46,7 @@ def main():
|
|
| 46 |
|
| 47 |
args = parser.parse_args()
|
| 48 |
|
| 49 |
-
methods = ["
|
| 50 |
if args.nougat:
|
| 51 |
methods.append("nougat")
|
| 52 |
|
|
|
|
| 46 |
|
| 47 |
args = parser.parse_args()
|
| 48 |
|
| 49 |
+
methods = ["marker"]
|
| 50 |
if args.nougat:
|
| 51 |
methods.append("nougat")
|
| 52 |
|
chunk_convert.sh
CHANGED
|
@@ -35,7 +35,7 @@ for (( i=0; i<$NUM_DEVICES; i++ )); do
|
|
| 35 |
export NUM_DEVICES
|
| 36 |
export NUM_WORKERS
|
| 37 |
echo "Running convert.py on GPU $DEVICE_NUM"
|
| 38 |
-
cmd="CUDA_VISIBLE_DEVICES=$DEVICE_NUM
|
| 39 |
[[ -n "$METADATA_FILE" ]] && cmd="$cmd --metadata_file $METADATA_FILE"
|
| 40 |
[[ -n "$MIN_LENGTH" ]] && cmd="$cmd --min_length $MIN_LENGTH"
|
| 41 |
eval $cmd &
|
|
|
|
| 35 |
export NUM_DEVICES
|
| 36 |
export NUM_WORKERS
|
| 37 |
echo "Running convert.py on GPU $DEVICE_NUM"
|
| 38 |
+
cmd="CUDA_VISIBLE_DEVICES=$DEVICE_NUM marker $INPUT_FOLDER $OUTPUT_FOLDER --num_chunks $NUM_DEVICES --chunk_idx $DEVICE_NUM --workers $NUM_WORKERS"
|
| 39 |
[[ -n "$METADATA_FILE" ]] && cmd="$cmd --metadata_file $METADATA_FILE"
|
| 40 |
[[ -n "$MIN_LENGTH" ]] && cmd="$cmd --min_length $MIN_LENGTH"
|
| 41 |
eval $cmd &
|
marker/benchmark/scoring.py
CHANGED
|
@@ -7,12 +7,7 @@ from statistics import mean
|
|
| 7 |
|
| 8 |
CHUNK_MIN_CHARS = 25
|
| 9 |
|
| 10 |
-
|
| 11 |
-
def replace_alphanumeric(text):
|
| 12 |
-
return regex.sub(r'[\p{L}]', '', text)
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
def chunk_text(text, chunk_len=50):
|
| 16 |
chunks = [text[i:i+chunk_len] for i in range(0, len(text), chunk_len)]
|
| 17 |
chunks = [c for c in chunks if c.strip() and len(c) > CHUNK_MIN_CHARS]
|
| 18 |
return chunks
|
|
@@ -29,7 +24,7 @@ def overlap_score(hypothesis_chunks, reference_chunks):
|
|
| 29 |
chunk_range = range(max(0, i_offset-search_distance), min(len(reference_chunks), i_offset+search_distance))
|
| 30 |
for j in chunk_range:
|
| 31 |
ref_chunk = reference_chunks[j]
|
| 32 |
-
score = fuzz.ratio(hyp_chunk, ref_chunk, score_cutoff=
|
| 33 |
if score > max_score:
|
| 34 |
max_score = score
|
| 35 |
total_len = len(ref_chunk)
|
|
@@ -38,8 +33,6 @@ def overlap_score(hypothesis_chunks, reference_chunks):
|
|
| 38 |
|
| 39 |
|
| 40 |
def score_text(hypothesis, reference):
|
| 41 |
-
hypothesis = replace_alphanumeric(hypothesis)
|
| 42 |
-
reference = replace_alphanumeric(reference)
|
| 43 |
# Returns a 0-1 alignment score
|
| 44 |
hypothesis_chunks = chunk_text(hypothesis)
|
| 45 |
reference_chunks = chunk_text(reference)
|
|
|
|
| 7 |
|
| 8 |
CHUNK_MIN_CHARS = 25
|
| 9 |
|
| 10 |
+
def chunk_text(text, chunk_len=500):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
chunks = [text[i:i+chunk_len] for i in range(0, len(text), chunk_len)]
|
| 12 |
chunks = [c for c in chunks if c.strip() and len(c) > CHUNK_MIN_CHARS]
|
| 13 |
return chunks
|
|
|
|
| 24 |
chunk_range = range(max(0, i_offset-search_distance), min(len(reference_chunks), i_offset+search_distance))
|
| 25 |
for j in chunk_range:
|
| 26 |
ref_chunk = reference_chunks[j]
|
| 27 |
+
score = fuzz.ratio(hyp_chunk, ref_chunk, score_cutoff=30) / 100
|
| 28 |
if score > max_score:
|
| 29 |
max_score = score
|
| 30 |
total_len = len(ref_chunk)
|
|
|
|
| 33 |
|
| 34 |
|
| 35 |
def score_text(hypothesis, reference):
|
|
|
|
|
|
|
| 36 |
# Returns a 0-1 alignment score
|
| 37 |
hypothesis_chunks = chunk_text(hypothesis)
|
| 38 |
reference_chunks = chunk_text(reference)
|
marker/layout/layout.py
CHANGED
|
@@ -12,7 +12,7 @@ def get_batch_size():
|
|
| 12 |
if settings.LAYOUT_BATCH_SIZE is not None:
|
| 13 |
return settings.LAYOUT_BATCH_SIZE
|
| 14 |
elif settings.TORCH_DEVICE_MODEL == "cuda":
|
| 15 |
-
return
|
| 16 |
return 6
|
| 17 |
|
| 18 |
|
|
|
|
| 12 |
if settings.LAYOUT_BATCH_SIZE is not None:
|
| 13 |
return settings.LAYOUT_BATCH_SIZE
|
| 14 |
elif settings.TORCH_DEVICE_MODEL == "cuda":
|
| 15 |
+
return 6
|
| 16 |
return 6
|
| 17 |
|
| 18 |
|
marker/ocr/detection.py
CHANGED
|
@@ -12,7 +12,7 @@ def get_batch_size():
|
|
| 12 |
if settings.DETECTOR_BATCH_SIZE is not None:
|
| 13 |
return settings.DETECTOR_BATCH_SIZE
|
| 14 |
elif settings.TORCH_DEVICE_MODEL == "cuda":
|
| 15 |
-
return
|
| 16 |
return 6
|
| 17 |
|
| 18 |
|
|
|
|
| 12 |
if settings.DETECTOR_BATCH_SIZE is not None:
|
| 13 |
return settings.DETECTOR_BATCH_SIZE
|
| 14 |
elif settings.TORCH_DEVICE_MODEL == "cuda":
|
| 15 |
+
return 6
|
| 16 |
return 6
|
| 17 |
|
| 18 |
|
marker/postprocessors/markdown.py
CHANGED
|
@@ -95,7 +95,7 @@ def block_surround(text, block_type):
|
|
| 95 |
|
| 96 |
def line_separator(line1, line2, block_type, is_continuation=False):
|
| 97 |
# Should cover latin-derived languages and russian
|
| 98 |
-
lowercase_letters = r'\p{Lo}
|
| 99 |
# Remove hyphen in current line if next line and current line appear to be joined
|
| 100 |
hyphen_pattern = regex.compile(rf'.*[{lowercase_letters}][-]\s?$', regex.DOTALL)
|
| 101 |
if line1 and hyphen_pattern.match(line1) and regex.match(rf"^\s?[{lowercase_letters}]", line2):
|
|
@@ -103,8 +103,8 @@ def line_separator(line1, line2, block_type, is_continuation=False):
|
|
| 103 |
line1 = regex.split(r"[-β]\s?$", line1)[0]
|
| 104 |
return line1.rstrip() + line2.lstrip()
|
| 105 |
|
| 106 |
-
all_letters = r'\p{L}
|
| 107 |
-
sentence_continuations = r',;\(
|
| 108 |
sentence_ends = r'γΰΉ\.?!'
|
| 109 |
line_end_pattern = regex.compile(rf'.*[{lowercase_letters}][{sentence_continuations}]?\s?$', regex.DOTALL)
|
| 110 |
line_start_pattern = regex.compile(rf'^\s?[{all_letters}]', regex.DOTALL)
|
|
|
|
| 95 |
|
| 96 |
def line_separator(line1, line2, block_type, is_continuation=False):
|
| 97 |
# Should cover latin-derived languages and russian
|
| 98 |
+
lowercase_letters = r'\p{Lo}|\p{Ll}|\d'
|
| 99 |
# Remove hyphen in current line if next line and current line appear to be joined
|
| 100 |
hyphen_pattern = regex.compile(rf'.*[{lowercase_letters}][-]\s?$', regex.DOTALL)
|
| 101 |
if line1 and hyphen_pattern.match(line1) and regex.match(rf"^\s?[{lowercase_letters}]", line2):
|
|
|
|
| 103 |
line1 = regex.split(r"[-β]\s?$", line1)[0]
|
| 104 |
return line1.rstrip() + line2.lstrip()
|
| 105 |
|
| 106 |
+
all_letters = r'\p{L}|\d'
|
| 107 |
+
sentence_continuations = r',;\(\β\"\''
|
| 108 |
sentence_ends = r'γΰΉ\.?!'
|
| 109 |
line_end_pattern = regex.compile(rf'.*[{lowercase_letters}][{sentence_continuations}]?\s?$', regex.DOTALL)
|
| 110 |
line_start_pattern = regex.compile(rf'^\s?[{all_letters}]', regex.DOTALL)
|
poetry.lock
CHANGED
|
@@ -2968,26 +2968,20 @@ full = ["numpy"]
|
|
| 2968 |
|
| 2969 |
[[package]]
|
| 2970 |
name = "ray"
|
| 2971 |
-
version = "2.
|
| 2972 |
description = "Ray provides a simple, universal API for building distributed applications."
|
| 2973 |
optional = false
|
| 2974 |
python-versions = ">=3.8"
|
| 2975 |
files = [
|
| 2976 |
-
{file = "ray-2.
|
| 2977 |
-
{file = "ray-2.
|
| 2978 |
-
{file = "ray-2.
|
| 2979 |
-
{file = "ray-2.
|
| 2980 |
-
{file = "ray-2.
|
| 2981 |
-
{file = "ray-2.
|
| 2982 |
-
{file = "ray-2.
|
| 2983 |
-
{file = "ray-2.
|
| 2984 |
-
{file = "ray-2.
|
| 2985 |
-
{file = "ray-2.20.0-cp311-cp311-win_amd64.whl", hash = "sha256:6ac1dcb303ddf53d2d87bc5b719e8c38f0a5efe41e175b6ba563fb65b5f4e9a2"},
|
| 2986 |
-
{file = "ray-2.20.0-cp39-cp39-macosx_10_15_x86_64.whl", hash = "sha256:1de0810f77ae4a0bf055aa2bdcb161be1d6d1b67b4095e85a5b3fbb6e0dadcd2"},
|
| 2987 |
-
{file = "ray-2.20.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:3f3519dd7794ead4d3e17d4570593b2a10e8db062836907517e85b4e769dec1a"},
|
| 2988 |
-
{file = "ray-2.20.0-cp39-cp39-manylinux2014_aarch64.whl", hash = "sha256:5a2cb9f100bbb6351372519b03ddc21d9fa6c8716621237273a59a6e250a8204"},
|
| 2989 |
-
{file = "ray-2.20.0-cp39-cp39-manylinux2014_x86_64.whl", hash = "sha256:64b394a6462a2ac2401b1b004f2cc7ac31e429388abf27024072a55702f1159c"},
|
| 2990 |
-
{file = "ray-2.20.0-cp39-cp39-win_amd64.whl", hash = "sha256:65938f7bd28a825d90c643465ad6b1334d97d16e381c409b19269e4dcc043341"},
|
| 2991 |
]
|
| 2992 |
|
| 2993 |
[package.dependencies]
|
|
@@ -3004,9 +2998,9 @@ requests = "*"
|
|
| 3004 |
|
| 3005 |
[package.extras]
|
| 3006 |
air = ["aiohttp (>=3.7)", "aiohttp-cors", "colorful", "fastapi", "fsspec", "grpcio (>=1.32.0)", "grpcio (>=1.42.0)", "memray", "numpy (>=1.20)", "opencensus", "pandas", "pandas (>=1.3)", "prometheus-client (>=0.7.1)", "py-spy (>=0.2.0)", "pyarrow (>=6.0.1)", "pydantic (<2.0.dev0 || >=2.5.dev0,<3)", "requests", "smart-open", "starlette", "tensorboardX (>=1.9)", "uvicorn[standard]", "virtualenv (>=20.0.24,!=20.21.1)", "watchfiles"]
|
| 3007 |
-
all = ["aiohttp (>=3.7)", "aiohttp-cors", "colorful", "dm-tree", "fastapi", "fsspec", "grpcio", "grpcio (>=1.32.0)", "grpcio (>=1.42.0)", "gymnasium (==0.28.1)", "lz4", "memray", "numpy (>=1.20)", "opencensus", "opentelemetry-api", "opentelemetry-exporter-otlp", "opentelemetry-sdk", "pandas", "pandas (>=1.3)", "prometheus-client (>=0.7.1)", "py-spy (>=0.2.0)", "pyarrow (>=6.0.1)", "pydantic (<2.0.dev0 || >=2.5.dev0,<3)", "pyyaml", "ray-cpp (==2.
|
| 3008 |
-
client = ["grpcio"]
|
| 3009 |
-
cpp = ["ray-cpp (==2.
|
| 3010 |
data = ["fsspec", "numpy (>=1.20)", "pandas (>=1.3)", "pyarrow (>=6.0.1)"]
|
| 3011 |
default = ["aiohttp (>=3.7)", "aiohttp-cors", "colorful", "grpcio (>=1.32.0)", "grpcio (>=1.42.0)", "memray", "opencensus", "prometheus-client (>=0.7.1)", "py-spy (>=0.2.0)", "pydantic (<2.0.dev0 || >=2.5.dev0,<3)", "requests", "smart-open", "virtualenv (>=20.0.24,!=20.21.1)"]
|
| 3012 |
observability = ["opentelemetry-api", "opentelemetry-exporter-otlp", "opentelemetry-sdk"]
|
|
@@ -4319,4 +4313,4 @@ testing = ["big-O", "jaraco.functools", "jaraco.itertools", "more-itertools", "p
|
|
| 4319 |
[metadata]
|
| 4320 |
lock-version = "2.0"
|
| 4321 |
python-versions = ">=3.9,<3.13,!=3.9.7"
|
| 4322 |
-
content-hash = "
|
|
|
|
| 2968 |
|
| 2969 |
[[package]]
|
| 2970 |
name = "ray"
|
| 2971 |
+
version = "2.21.0"
|
| 2972 |
description = "Ray provides a simple, universal API for building distributed applications."
|
| 2973 |
optional = false
|
| 2974 |
python-versions = ">=3.8"
|
| 2975 |
files = [
|
| 2976 |
+
{file = "ray-2.21.0-cp310-cp310-macosx_10_15_x86_64.whl", hash = "sha256:17c755f2682be60713c1c5895dd1cde6be11ce66fe905ff65d1275c9fc7dfa98"},
|
| 2977 |
+
{file = "ray-2.21.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:e14c1ed90009438a19c80386d6f8ef5e84d85b484883893a048375e4d6af87dc"},
|
| 2978 |
+
{file = "ray-2.21.0-cp310-cp310-manylinux2014_x86_64.whl", hash = "sha256:3a642e98734003be7778cea4e5089ca4f7263497e1ce3a5334069d7a791be81d"},
|
| 2979 |
+
{file = "ray-2.21.0-cp311-cp311-macosx_10_15_x86_64.whl", hash = "sha256:f162a21e53a38b4a327d51d9a2d84b6110b6af248af3dfa8f8a2ce73d204f10d"},
|
| 2980 |
+
{file = "ray-2.21.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:1521a39248fea0c5079d6ce3884316cd75f3d815d982e8da249f2e8334d2178e"},
|
| 2981 |
+
{file = "ray-2.21.0-cp311-cp311-manylinux2014_x86_64.whl", hash = "sha256:b43d84c19a0081b372915f25405973275245a6b1f590a3be9ad5b367c4288352"},
|
| 2982 |
+
{file = "ray-2.21.0-cp39-cp39-macosx_10_15_x86_64.whl", hash = "sha256:36cdf493dfa5a85fb0ff02d612a8a40dbeeb0ffba1c469bb89cd1fa74db4b018"},
|
| 2983 |
+
{file = "ray-2.21.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:6bda45370540f2749fc355a5088cc4b0642b15e14604e6ed636dde8cb4e9e6e1"},
|
| 2984 |
+
{file = "ray-2.21.0-cp39-cp39-manylinux2014_x86_64.whl", hash = "sha256:a794e82656dc76f64dca8a1b78bbe5de786b62cf14ee05b678849fc8bc55c122"},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2985 |
]
|
| 2986 |
|
| 2987 |
[package.dependencies]
|
|
|
|
| 2998 |
|
| 2999 |
[package.extras]
|
| 3000 |
air = ["aiohttp (>=3.7)", "aiohttp-cors", "colorful", "fastapi", "fsspec", "grpcio (>=1.32.0)", "grpcio (>=1.42.0)", "memray", "numpy (>=1.20)", "opencensus", "pandas", "pandas (>=1.3)", "prometheus-client (>=0.7.1)", "py-spy (>=0.2.0)", "pyarrow (>=6.0.1)", "pydantic (<2.0.dev0 || >=2.5.dev0,<3)", "requests", "smart-open", "starlette", "tensorboardX (>=1.9)", "uvicorn[standard]", "virtualenv (>=20.0.24,!=20.21.1)", "watchfiles"]
|
| 3001 |
+
all = ["aiohttp (>=3.7)", "aiohttp-cors", "colorful", "dm-tree", "fastapi", "fsspec", "grpcio (!=1.56.0)", "grpcio (>=1.32.0)", "grpcio (>=1.42.0)", "gymnasium (==0.28.1)", "lz4", "memray", "numpy (>=1.20)", "opencensus", "opentelemetry-api", "opentelemetry-exporter-otlp", "opentelemetry-sdk", "pandas", "pandas (>=1.3)", "prometheus-client (>=0.7.1)", "py-spy (>=0.2.0)", "pyarrow (>=6.0.1)", "pydantic (<2.0.dev0 || >=2.5.dev0,<3)", "pyyaml", "ray-cpp (==2.21.0)", "requests", "rich", "scikit-image", "scipy", "smart-open", "starlette", "tensorboardX (>=1.9)", "typer", "uvicorn[standard]", "virtualenv (>=20.0.24,!=20.21.1)", "watchfiles"]
|
| 3002 |
+
client = ["grpcio (!=1.56.0)"]
|
| 3003 |
+
cpp = ["ray-cpp (==2.21.0)"]
|
| 3004 |
data = ["fsspec", "numpy (>=1.20)", "pandas (>=1.3)", "pyarrow (>=6.0.1)"]
|
| 3005 |
default = ["aiohttp (>=3.7)", "aiohttp-cors", "colorful", "grpcio (>=1.32.0)", "grpcio (>=1.42.0)", "memray", "opencensus", "prometheus-client (>=0.7.1)", "py-spy (>=0.2.0)", "pydantic (<2.0.dev0 || >=2.5.dev0,<3)", "requests", "smart-open", "virtualenv (>=20.0.24,!=20.21.1)"]
|
| 3006 |
observability = ["opentelemetry-api", "opentelemetry-exporter-otlp", "opentelemetry-sdk"]
|
|
|
|
| 4313 |
[metadata]
|
| 4314 |
lock-version = "2.0"
|
| 4315 |
python-versions = ">=3.9,<3.13,!=3.9.7"
|
| 4316 |
+
content-hash = "98e9ddb4d60035b0c7a8ce9acabf87ed17dfc91d0cb651c851277ab6f39aa6ca"
|
pyproject.toml
CHANGED
|
@@ -14,7 +14,6 @@ include = [
|
|
| 14 |
"convert.py",
|
| 15 |
"convert_single.py",
|
| 16 |
"chunk_convert.sh",
|
| 17 |
-
"benchmark.py",
|
| 18 |
"chunk_convert.py",
|
| 19 |
]
|
| 20 |
|
|
@@ -27,14 +26,14 @@ pydantic-settings = "^2.0.3"
|
|
| 27 |
transformers = "^4.36.2" # 4.36.2 needed because issues with donut models and later versions
|
| 28 |
numpy = "^1.26.1"
|
| 29 |
python-dotenv = "^1.0.0"
|
| 30 |
-
torch = "2.2.2" # Issue with torch 2.3.0 and vision models - https://github.com/pytorch/pytorch/issues/121834
|
| 31 |
ray = "^2.20.0"
|
| 32 |
tqdm = "^4.66.1"
|
| 33 |
tabulate = "^0.9.0"
|
| 34 |
ftfy = "^6.1.1"
|
| 35 |
texify = "^0.1.8"
|
| 36 |
rapidfuzz = "^3.8.1"
|
| 37 |
-
surya-ocr = "^0.4.
|
| 38 |
filetype = "^1.2.0"
|
| 39 |
regex = "^2024.4.28"
|
| 40 |
pdftext = "^0.3.7"
|
|
@@ -45,7 +44,6 @@ jupyter = "^1.0.0"
|
|
| 45 |
[tool.poetry.scripts]
|
| 46 |
marker = "convert:main"
|
| 47 |
marker_single = "convert_single:main"
|
| 48 |
-
marker_benchmark = "benchmark:main"
|
| 49 |
marker_chunk_convert = "chunk_convert:main"
|
| 50 |
|
| 51 |
[build-system]
|
|
|
|
| 14 |
"convert.py",
|
| 15 |
"convert_single.py",
|
| 16 |
"chunk_convert.sh",
|
|
|
|
| 17 |
"chunk_convert.py",
|
| 18 |
]
|
| 19 |
|
|
|
|
| 26 |
transformers = "^4.36.2" # 4.36.2 needed because issues with donut models and later versions
|
| 27 |
numpy = "^1.26.1"
|
| 28 |
python-dotenv = "^1.0.0"
|
| 29 |
+
torch = "^2.2.2" # Issue with torch 2.3.0 and vision models - https://github.com/pytorch/pytorch/issues/121834
|
| 30 |
ray = "^2.20.0"
|
| 31 |
tqdm = "^4.66.1"
|
| 32 |
tabulate = "^0.9.0"
|
| 33 |
ftfy = "^6.1.1"
|
| 34 |
texify = "^0.1.8"
|
| 35 |
rapidfuzz = "^3.8.1"
|
| 36 |
+
surya-ocr = "^0.4.3"
|
| 37 |
filetype = "^1.2.0"
|
| 38 |
regex = "^2024.4.28"
|
| 39 |
pdftext = "^0.3.7"
|
|
|
|
| 44 |
[tool.poetry.scripts]
|
| 45 |
marker = "convert:main"
|
| 46 |
marker_single = "convert_single:main"
|
|
|
|
| 47 |
marker_chunk_convert = "chunk_convert:main"
|
| 48 |
|
| 49 |
[build-system]
|