AdithyaSK commited on
Commit
7d37ff2
·
verified ·
1 Parent(s): 56d5d23

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +114 -2
README.md CHANGED
@@ -1,7 +1,6 @@
1
  ---
2
  language:
3
  - en
4
- - multilingual
5
  license: gemma
6
  library_name: transformers
7
  tags:
@@ -10,6 +9,119 @@ tags:
10
  - colbert
11
  - late-interaction
12
  pipeline_tag: visual-document-retrieval
 
 
13
  ---
14
 
15
- # ColNetraEmbed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  language:
3
  - en
 
4
  license: gemma
5
  library_name: transformers
6
  tags:
 
9
  - colbert
10
  - late-interaction
11
  pipeline_tag: visual-document-retrieval
12
+ base_model:
13
+ - google/gemma-3-4b-it
14
  ---
15
 
16
+ # ColNetraEmbed
17
+
18
+ **ColNetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations.
19
+
20
+ ## Model Description
21
+
22
+ ColNetraEmbed is a multilingual multimodal embedding model that encodes documents as multi-vector representations using the ColPali architecture. Each image patch is mapped to a contextualized embedding, enabling fine-grained matching between visual content and text queries through late interaction (MaxSim).
23
+
24
+ - **Model Type:** Multilingual Multimodal Embedding Model with ColPali-style Multi-vector representations
25
+ - **Architecture:** ColPali with Gemma3-2B backbone
26
+ - **Embedding Dimension:** 128 per token
27
+ - **Capabilities:** Multilingual, Multimodal (Vision + Text), Multi-vector late interaction
28
+ - **Use Case:** Visual document retrieval, multilingual document understanding, fine-grained visual search
29
+
30
+ ## Paper
31
+
32
+ 📄 **[M3DR: Towards Universal Multilingual Multimodal Document Retrieval](https://arxiv.org/abs/2512.03514)**
33
+
34
+ ## Installation
35
+
36
+ ```bash
37
+ pip install git+https://github.com/adithya-s-k/colpali.git
38
+ ```
39
+
40
+ ## Quick Start
41
+
42
+ ```python
43
+ import torch
44
+ from PIL import Image
45
+ from colpali_engine.models import ColGemma3, ColGemmaProcessor3
46
+
47
+ # Load model and processor
48
+ model_name = "Cognitive-Lab/ColNetraEmbed"
49
+ model = ColGemma3.from_pretrained(
50
+ model_name,
51
+ dtype=torch.bfloat16,
52
+ device_map="cuda",
53
+ )
54
+ processor = ColGemmaProcessor3.from_pretrained(model_name)
55
+
56
+ # Load your images
57
+ images = [
58
+ Image.open("document1.jpg"),
59
+ Image.open("document2.jpg"),
60
+ ]
61
+
62
+ # Define queries
63
+ queries = [
64
+ "What is the total revenue?",
65
+ "Show me the organizational chart",
66
+ ]
67
+
68
+ # Process and encode
69
+ batch_images = processor.process_images(images).to(model.device)
70
+ batch_queries = processor.process_queries(queries).to(model.device)
71
+
72
+ with torch.no_grad():
73
+ image_embeddings = model(**batch_images) # Shape: (num_images, num_patches, 128)
74
+ query_embeddings = model(**batch_queries) # Shape: (num_queries, num_tokens, 128)
75
+
76
+ # Compute similarity scores using MaxSim
77
+ scores = processor.score_multi_vector(
78
+ qs=query_embeddings,
79
+ ps=image_embeddings,
80
+ ) # Shape: (num_queries, num_images)
81
+
82
+ # Get best matches
83
+ for i, query in enumerate(queries):
84
+ best_idx = scores[i].argmax().item()
85
+ print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.2f})")
86
+ ```
87
+
88
+ ## Use Cases
89
+
90
+ - **Document Retrieval:** Search through large collections of visual documents
91
+ - **Visual Question Answering:** Answer questions about document content
92
+ - **Document Understanding:** Extract and match information from scanned documents
93
+ - **Cross-lingual Document Search:** Multilingual visual document retrieval
94
+
95
+ ## Model Details
96
+
97
+ - **Base Model:** Gemma3-2B
98
+ - **Vision Encoder:** SigLIP
99
+ - **Training Data:** Multilingual document datasets
100
+ - **Embedding Strategy:** Multi-vector (Late Interaction)
101
+ - **Similarity Function:** MaxSim (Maximum Similarity)
102
+
103
+ ## Performance
104
+
105
+ ColNetraEmbed achieves state-of-the-art results on visual document retrieval benchmarks. See our [paper](https://arxiv.org/abs/2512.03514) for detailed evaluation metrics.
106
+
107
+ ## Citation
108
+
109
+ ```bibtex
110
+ @misc{kolavi2025m3druniversalmultilingualmultimodal,
111
+ title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval},
112
+ author={Adithya S Kolavi and Vyoman Jain},
113
+ year={2025},
114
+ eprint={2512.03514},
115
+ archivePrefix={arXiv},
116
+ primaryClass={cs.IR},
117
+ url={https://arxiv.org/abs/2512.03514}
118
+ }
119
+ ```
120
+
121
+ ## License
122
+
123
+ This model is released under the same license as the base Gemma3 model.
124
+
125
+ ## Acknowledgments
126
+
127
+ Built on top of the ColPali framework and Gemma3 architecture.