File size: 9,372 Bytes

---
language:
- en
- es
- fr
- de
- it
- hi
- mr
- sa
- kn
- te
- ta
- ml
- zh
- ja
- ko
- ar
- bn
- gu
- or
- pa
- ru
- th
license: gemma
library_name: transformers
tags:
- vision-language
- retrieval
- colbert
- late-interaction
- multimodal
- multilingual
- document-retrieval
- 22-languages
pipeline_tag: visual-document-retrieval
base_model:
- google/gemma-3-4b-it

datasets:
- Cognitive-Lab/nayanair-bench
model-index:
- name: ColNetraEmbed
  results:
  - task:
      type: image-text-retrieval
      name: Cross-Lingual Document Retrieval
    dataset:
      type: Cognitive-Lab/nayanair-bench
      name: Nayana-IR Cross-Lingual
      split: test
    metrics:
    - type: ndcg_at_5
      value: 0.637
      name: NDCG@5
    - type: recall_at_10
      value: 0.700
      name: Recall@10
    - type: map_at_10
      value: 0.610
      name: MAP@10
    - type: mrr_at_10
      value: 0.610
      name: MRR@10
  - task:
      type: image-text-retrieval
      name: Monolingual Document Retrieval
    dataset:
      type: Cognitive-Lab/nayanair-bench
      name: Nayana-IR Monolingual
      split: test
    metrics:
    - type: ndcg_at_5
      value: 0.670
      name: NDCG@5
    - type: recall_at_10
      value: 0.764
      name: Recall@10
    - type: map_at_10
      value: 0.645
      name: MAP@10
    - type: mrr_at_10
      value: 0.686
      name: MRR@10
  - task:
      type: image-text-retrieval
      name: English Document Retrieval
    dataset:
      type: vidore/vidore-benchmark
      name: ViDoRe v2
      split: test
    metrics:
    - type: ndcg_at_5
      value: 0.551
      name: NDCG@5
    - type: recall_at_10
      value: 0.664
      name: Recall@10
    - type: map_at_10
      value: 0.445
      name: MAP@10
    - type: mrr_at_10
      value: 0.445
      name: MRR@10
---
# ColNetraEmbed

![Group 54 (1)](https://cdn-uploads.huggingface.co/production/uploads/6442d975ad54813badc1ddf7/-fYMikXhSuqRqm-UIdulK.png)


[![Paper](https://img.shields.io/badge/arXiv-2512.03514-b31b1b.svg)](https://arxiv.org/abs/2512.03514)
[![GitHub](https://img.shields.io/badge/GitHub-colpali-181717?logo=github)](https://github.com/adithya-s-k/colpali)
[![Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-yellow)](https://huggingface.co/Cognitive-Lab/ColNetraEmbed)
[![Blog](https://img.shields.io/badge/Blog-CognitiveLab-blue)](https://www.cognitivelab.in/blog/introducing-netraembed)
[![Demo](https://img.shields.io/badge/Demo-Try%20it%20out-green)](https://huggingface.co/spaces/AdithyaSK/NetraEmbed)
[![Colab](https://img.shields.io/badge/Colab-Run%20Code-F9AB00?logo=googlecolab&logoColor=white)](https://huggingface.co/Cognitive-Lab/ColNetraEmbed/blob/main/ColNetraEmbed_InferenceDemo.ipynb)
[![Colab](https://img.shields.io/badge/Colab-Gradio%20Demo-F9AB00?logo=googlecolab&logoColor=white)](https://huggingface.co/Cognitive-Lab/NetraEmbed/blob/main/NetraEmbed_Gradio_Demo_final.ipynb)


**ColNetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations.

## Model Description

ColNetraEmbed is a multilingual multimodal embedding model that encodes documents as multi-vector representations using the ColPali architecture. Each image patch is mapped to a contextualized embedding, enabling fine-grained matching between visual content and text queries through late interaction (MaxSim).

- **Model Type:** Multilingual Multimodal Embedding Model with ColPali-style Multi-vector representations
- **Architecture:** ColPali with Gemma3-4B backbone
- **Embedding Dimension:** 128 per token
- **Capabilities:** Multilingual, Multimodal (Vision + Text), Multi-vector late interaction
- **Use Case:** Visual document retrieval, multilingual document understanding, fine-grained visual search

## Paper

📄 **[M3DR: Towards Universal Multilingual Multimodal Document Retrieval](https://arxiv.org/abs/2512.03514)**

## Installation

```bash
pip install git+https://github.com/adithya-s-k/colpali.git
```

## Quick Start

```python
import torch
from PIL import Image
from colpali_engine.models import ColGemma3, ColGemmaProcessor3

# Load model and processor
model_name = "Cognitive-Lab/ColNetraEmbed"
model = ColGemma3.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
processor = ColGemmaProcessor3.from_pretrained(model_name)

# Load your images
images = [
    Image.open("document1.jpg"),
    Image.open("document2.jpg"),
]

# Define queries
queries = [
    "What is the total revenue?",
    "Show me the organizational chart",
]

# Process and encode
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

with torch.no_grad():
    image_embeddings = model(**batch_images)  # Shape: (num_images, num_patches, 128)
    query_embeddings = model(**batch_queries)  # Shape: (num_queries, num_tokens, 128)

# Compute similarity scores using MaxSim
scores = processor.score_multi_vector(
    qs=query_embeddings,
    ps=image_embeddings,
)  # Shape: (num_queries, num_images)

# Get best matches
for i, query in enumerate(queries):
    best_idx = scores[i].argmax().item()
    print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.2f})")
```

## Use Cases

- **Document Retrieval:** Search through large collections of visual documents
- **Visual Question Answering:** Answer questions about document content
- **Document Understanding:** Extract and match information from scanned documents
- **Cross-lingual Document Search:** Multilingual visual document retrieval

## Model Details

- **Base Model:** [Gemma3-4B-IT](https://huggingface.co/google/gemma-3-4b-it)
- **Vision Encoder:** SigLIP
- **Training Data:** Multilingual document datasets
- **Embedding Strategy:** Multi-vector (Late Interaction)
- **Similarity Function:** MaxSim (Maximum Similarity)

## Performance

ColNetraEmbed achieves strong performance on multilingual document retrieval benchmarks. Evaluated on [Nayana-IR Bench](https://huggingface.co/collections/Cognitive-Lab/nayanair-bench) (22 languages) and ViDoRe v2.

### Benchmark Results

**Nayana-IR Cross-Lingual**

| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
|-------|:------:|:---------:|:------:|:------:|
| **ColNetraEmbed** | **0.637** | **0.700** | **0.610** | **0.610** |
| Jina-Embeddings-v4 | 0.435 | 0.435 | 0.390 | 0.548 |
| ColNomic-Embed-3B | 0.315 | 0.320 | 0.267 | 0.444 |
| ColPali-v1.3 | 0.284 | 0.347 | 0.249 | 0.403 |
| GME-Qwen2-VL-2B | 0.235 | 0.308 | 0.209 | 0.314 |
| ColQwen2.5-v0.2 | 0.143 | 0.160 | 0.127 | 0.220 |
| ColQwen2-v1.0 | 0.050 | 0.065 | 0.038 | 0.109 |

**Nayana-IR Monolingual**

| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
|-------|:------:|:---------:|:------:|:------:|
| **ColNetraEmbed** | **0.670** | **0.764** | **0.645** | **0.686** |
| ColNomic-Embed-3B | 0.534 | 0.603 | 0.515 | 0.546 |
| ColQwen2.5-v0.2 | 0.453 | 0.513 | 0.437 | 0.464 |
| GME-Qwen2-VL-2B | 0.444 | 0.525 | 0.426 | 0.452 |
| ColQwen2-v1.0 | 0.413 | 0.466 | 0.398 | 0.422 |
| ColPali-v1.3 | 0.410 | 0.484 | 0.393 | 0.422 |

**ViDoRe v2**

| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
|-------|:------:|:---------:|:------:|:------:|
| ColQwen2.5-v0.2 | 0.592 | 0.664 | 0.484 | 0.711 |
| Jina-Embeddings-v4 | 0.576 | 0.686 | - | - |
| GME-Qwen2-VL-2B | 0.574 | 0.630 | 0.466 | 0.690 |
| ColNomic-Embed-3B | 0.556 | 0.633 | 0.451 | 0.672 |
| **ColNetraEmbed** | **0.551** | **0.664** | **0.445** | **0.445** |
| ColQwen2-v1.0 | 0.545 | 0.640 | 0.438 | 0.653 |
| ColPali-v1.3 | 0.538 | 0.627 | 0.436 | 0.644 |

**Key Results:**
- 🏆 **Strong multilingual performance** with ColBERT-style late interaction
- 📈 **124% improvement** over ColPali-v1.3 on cross-lingual tasks
- 🌍 Supports **22 languages** across diverse script families
- 🔍 **Fine-grained matching** through token-level MaxSim scoring

**Comparison: Multi-vector vs Single-vector**
- ColNetraEmbed (multi-vector): More interpretable with token-level attribution
- NetraEmbed (single-vector): Higher accuracy (0.716 vs 0.637) and 250x more efficient storage

See our [paper](https://arxiv.org/abs/2512.03514) for comprehensive evaluation and architectural comparisons.

## Citation

```bibtex
@misc{kolavi2025m3druniversalmultilingualmultimodal,
  title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval}, 
  author={Adithya S Kolavi and Vyoman Jain},
  year={2025},
  eprint={2512.03514},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2512.03514}
}
```

## License

This model is released under the same license as the base Gemma3 model.

## Acknowledgments

This work benefited from compute credits for training, inference, and evaluation provided by [Modal](https://modal.com), acknowledged as a compute sponsor. Dataset curation and synthesis were supported by the [Meta LLaMA Impact Grant](https://about.fb.com/news/2025/04/llama-impact-grant-recipients/?utm_source=AIatMeta&utm_medium=organic_social&utm_content=image&utm_campaign=llamacon) through our [Nayana initiative](https://www.cognitivelab.in/nayana). We appreciate Meta for continued support of our research efforts at [CognitiveLab](https://www.cognitivelab.in).

Built on top of the ColPali framework and Gemma3 architecture.