File size: 9,372 Bytes
90497a2 c8eb9de 90497a2 c8eb9de 56d5d23 7d37ff2 c8eb9de 90497a2 7d37ff2 64a5609 bce449a a6b2da4 a27fe0b 64a5609 7d37ff2 c8eb9de 7d37ff2 64a5609 7d37ff2 64a5609 7d37ff2 64a5609 7d37ff2 64a5609 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 |
---
language:
- en
- es
- fr
- de
- it
- hi
- mr
- sa
- kn
- te
- ta
- ml
- zh
- ja
- ko
- ar
- bn
- gu
- or
- pa
- ru
- th
license: gemma
library_name: transformers
tags:
- vision-language
- retrieval
- colbert
- late-interaction
- multimodal
- multilingual
- document-retrieval
- 22-languages
pipeline_tag: visual-document-retrieval
base_model:
- google/gemma-3-4b-it
datasets:
- Cognitive-Lab/nayanair-bench
model-index:
- name: ColNetraEmbed
results:
- task:
type: image-text-retrieval
name: Cross-Lingual Document Retrieval
dataset:
type: Cognitive-Lab/nayanair-bench
name: Nayana-IR Cross-Lingual
split: test
metrics:
- type: ndcg_at_5
value: 0.637
name: NDCG@5
- type: recall_at_10
value: 0.700
name: Recall@10
- type: map_at_10
value: 0.610
name: MAP@10
- type: mrr_at_10
value: 0.610
name: MRR@10
- task:
type: image-text-retrieval
name: Monolingual Document Retrieval
dataset:
type: Cognitive-Lab/nayanair-bench
name: Nayana-IR Monolingual
split: test
metrics:
- type: ndcg_at_5
value: 0.670
name: NDCG@5
- type: recall_at_10
value: 0.764
name: Recall@10
- type: map_at_10
value: 0.645
name: MAP@10
- type: mrr_at_10
value: 0.686
name: MRR@10
- task:
type: image-text-retrieval
name: English Document Retrieval
dataset:
type: vidore/vidore-benchmark
name: ViDoRe v2
split: test
metrics:
- type: ndcg_at_5
value: 0.551
name: NDCG@5
- type: recall_at_10
value: 0.664
name: Recall@10
- type: map_at_10
value: 0.445
name: MAP@10
- type: mrr_at_10
value: 0.445
name: MRR@10
---
# ColNetraEmbed

[](https://arxiv.org/abs/2512.03514)
[](https://github.com/adithya-s-k/colpali)
[](https://huggingface.co/Cognitive-Lab/ColNetraEmbed)
[](https://www.cognitivelab.in/blog/introducing-netraembed)
[](https://huggingface.co/spaces/AdithyaSK/NetraEmbed)
[](https://huggingface.co/Cognitive-Lab/ColNetraEmbed/blob/main/ColNetraEmbed_InferenceDemo.ipynb)
[](https://huggingface.co/Cognitive-Lab/NetraEmbed/blob/main/NetraEmbed_Gradio_Demo_final.ipynb)
**ColNetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations.
## Model Description
ColNetraEmbed is a multilingual multimodal embedding model that encodes documents as multi-vector representations using the ColPali architecture. Each image patch is mapped to a contextualized embedding, enabling fine-grained matching between visual content and text queries through late interaction (MaxSim).
- **Model Type:** Multilingual Multimodal Embedding Model with ColPali-style Multi-vector representations
- **Architecture:** ColPali with Gemma3-4B backbone
- **Embedding Dimension:** 128 per token
- **Capabilities:** Multilingual, Multimodal (Vision + Text), Multi-vector late interaction
- **Use Case:** Visual document retrieval, multilingual document understanding, fine-grained visual search
## Paper
π **[M3DR: Towards Universal Multilingual Multimodal Document Retrieval](https://arxiv.org/abs/2512.03514)**
## Installation
```bash
pip install git+https://github.com/adithya-s-k/colpali.git
```
## Quick Start
```python
import torch
from PIL import Image
from colpali_engine.models import ColGemma3, ColGemmaProcessor3
# Load model and processor
model_name = "Cognitive-Lab/ColNetraEmbed"
model = ColGemma3.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="cuda",
)
processor = ColGemmaProcessor3.from_pretrained(model_name)
# Load your images
images = [
Image.open("document1.jpg"),
Image.open("document2.jpg"),
]
# Define queries
queries = [
"What is the total revenue?",
"Show me the organizational chart",
]
# Process and encode
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)
with torch.no_grad():
image_embeddings = model(**batch_images) # Shape: (num_images, num_patches, 128)
query_embeddings = model(**batch_queries) # Shape: (num_queries, num_tokens, 128)
# Compute similarity scores using MaxSim
scores = processor.score_multi_vector(
qs=query_embeddings,
ps=image_embeddings,
) # Shape: (num_queries, num_images)
# Get best matches
for i, query in enumerate(queries):
best_idx = scores[i].argmax().item()
print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.2f})")
```
## Use Cases
- **Document Retrieval:** Search through large collections of visual documents
- **Visual Question Answering:** Answer questions about document content
- **Document Understanding:** Extract and match information from scanned documents
- **Cross-lingual Document Search:** Multilingual visual document retrieval
## Model Details
- **Base Model:** [Gemma3-4B-IT](https://huggingface.co/google/gemma-3-4b-it)
- **Vision Encoder:** SigLIP
- **Training Data:** Multilingual document datasets
- **Embedding Strategy:** Multi-vector (Late Interaction)
- **Similarity Function:** MaxSim (Maximum Similarity)
## Performance
ColNetraEmbed achieves strong performance on multilingual document retrieval benchmarks. Evaluated on [Nayana-IR Bench](https://huggingface.co/collections/Cognitive-Lab/nayanair-bench) (22 languages) and ViDoRe v2.
### Benchmark Results
**Nayana-IR Cross-Lingual**
| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
|-------|:------:|:---------:|:------:|:------:|
| **ColNetraEmbed** | **0.637** | **0.700** | **0.610** | **0.610** |
| Jina-Embeddings-v4 | 0.435 | 0.435 | 0.390 | 0.548 |
| ColNomic-Embed-3B | 0.315 | 0.320 | 0.267 | 0.444 |
| ColPali-v1.3 | 0.284 | 0.347 | 0.249 | 0.403 |
| GME-Qwen2-VL-2B | 0.235 | 0.308 | 0.209 | 0.314 |
| ColQwen2.5-v0.2 | 0.143 | 0.160 | 0.127 | 0.220 |
| ColQwen2-v1.0 | 0.050 | 0.065 | 0.038 | 0.109 |
**Nayana-IR Monolingual**
| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
|-------|:------:|:---------:|:------:|:------:|
| **ColNetraEmbed** | **0.670** | **0.764** | **0.645** | **0.686** |
| ColNomic-Embed-3B | 0.534 | 0.603 | 0.515 | 0.546 |
| ColQwen2.5-v0.2 | 0.453 | 0.513 | 0.437 | 0.464 |
| GME-Qwen2-VL-2B | 0.444 | 0.525 | 0.426 | 0.452 |
| ColQwen2-v1.0 | 0.413 | 0.466 | 0.398 | 0.422 |
| ColPali-v1.3 | 0.410 | 0.484 | 0.393 | 0.422 |
**ViDoRe v2**
| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
|-------|:------:|:---------:|:------:|:------:|
| ColQwen2.5-v0.2 | 0.592 | 0.664 | 0.484 | 0.711 |
| Jina-Embeddings-v4 | 0.576 | 0.686 | - | - |
| GME-Qwen2-VL-2B | 0.574 | 0.630 | 0.466 | 0.690 |
| ColNomic-Embed-3B | 0.556 | 0.633 | 0.451 | 0.672 |
| **ColNetraEmbed** | **0.551** | **0.664** | **0.445** | **0.445** |
| ColQwen2-v1.0 | 0.545 | 0.640 | 0.438 | 0.653 |
| ColPali-v1.3 | 0.538 | 0.627 | 0.436 | 0.644 |
**Key Results:**
- π **Strong multilingual performance** with ColBERT-style late interaction
- π **124% improvement** over ColPali-v1.3 on cross-lingual tasks
- π Supports **22 languages** across diverse script families
- π **Fine-grained matching** through token-level MaxSim scoring
**Comparison: Multi-vector vs Single-vector**
- ColNetraEmbed (multi-vector): More interpretable with token-level attribution
- NetraEmbed (single-vector): Higher accuracy (0.716 vs 0.637) and 250x more efficient storage
See our [paper](https://arxiv.org/abs/2512.03514) for comprehensive evaluation and architectural comparisons.
## Citation
```bibtex
@misc{kolavi2025m3druniversalmultilingualmultimodal,
title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval},
author={Adithya S Kolavi and Vyoman Jain},
year={2025},
eprint={2512.03514},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2512.03514}
}
```
## License
This model is released under the same license as the base Gemma3 model.
## Acknowledgments
This work benefited from compute credits for training, inference, and evaluation provided by [Modal](https://modal.com), acknowledged as a compute sponsor. Dataset curation and synthesis were supported by the [Meta LLaMA Impact Grant](https://about.fb.com/news/2025/04/llama-impact-grant-recipients/?utm_source=AIatMeta&utm_medium=organic_social&utm_content=image&utm_campaign=llamacon) through our [Nayana initiative](https://www.cognitivelab.in/nayana). We appreciate Meta for continued support of our research efforts at [CognitiveLab](https://www.cognitivelab.in).
Built on top of the ColPali framework and Gemma3 architecture.
|