File size: 2,882 Bytes
2d01495
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# OCR Utilities

This directory contains utility modules for the Historical OCR project.

## PDF OCR Processing

The `pdf_ocr.py` module provides specialized functionality for processing PDF documents with OCR.

### Features

- **Robust PDF-to-Image Conversion**: Converts PDF documents to images using optimized settings before OCR processing
- **Multi-Page Support**: Intelligently handles multi-page documents, allowing processing of specific pages or page ranges
- **Memory-Efficient Processing**: Processes PDFs in batches to prevent memory issues with large documents
- **Fallback Mechanism**: Falls back to structured_ocr's internal processing if direct conversion fails
- **Cleanup Management**: Automatically cleans up temporary files after processing

### Key Components

- **PDFOCR**: Main class for processing PDF files with OCR
- **PDFConversionResult**: Helper class that holds PDF conversion results and manages cleanup

### Basic Usage

```python
from utils.pdf_ocr import PDFOCR

# Initialize the processor
processor = PDFOCR()

# Process a PDF file (all pages, with vision model)
result = processor.process_pdf('document.pdf')

# Process a PDF file (specific pages, with vision model)
result = processor.process_pdf('document.pdf', custom_pages=[1, 3, 5])

# Process a PDF file (first N pages, without vision model)
result = processor.process_pdf('document.pdf', max_pages=3, use_vision=False)

# Process a PDF file with custom prompt
result = processor.process_pdf(
    'document.pdf', 
    custom_prompt="This is a historical newspaper with multiple columns."
)

# Save results to JSON
output_path = processor.save_json_output('document.pdf', 'results.json')
```

### Command Line Usage

The module can also be used directly from the command line:

```bash
python utils/pdf_ocr.py document.pdf --output results.json
python utils/pdf_ocr.py document.pdf --max-pages 3
python utils/pdf_ocr.py document.pdf --pages 1,3,5
python utils/pdf_ocr.py document.pdf --prompt "This is a historical newspaper with multiple columns."
python utils/pdf_ocr.py document.pdf --no-vision
```

### How It Works

1. The module first attempts to convert the PDF to images using `pdf2image`
2. It processes the first page with the vision model (if requested) for detailed analysis
3. Additional pages are processed with the text model for efficiency
4. All text is combined into a single result with appropriate metadata
5. If direct conversion fails, it falls back to using `structured_ocr.py` for PDF processing

### Parameters

- **pdf_path**: Path to the PDF file to process
- **use_vision**: Whether to use vision model for improved analysis (default: True)
- **max_pages**: Maximum number of pages to process (default: all pages)
- **custom_pages**: Specific page numbers to process, 1-based indexing (e.g., [1, 3, 5])
- **custom_prompt**: Custom instructions for OCR processing