|
|
--- |
|
|
language: |
|
|
- en |
|
|
- es |
|
|
- fr |
|
|
- de |
|
|
- it |
|
|
- hi |
|
|
- mr |
|
|
- sa |
|
|
- kn |
|
|
- te |
|
|
- ta |
|
|
- ml |
|
|
- zh |
|
|
- ja |
|
|
- ko |
|
|
- ar |
|
|
- bn |
|
|
- gu |
|
|
- or |
|
|
- pa |
|
|
- ru |
|
|
- th |
|
|
license: gemma |
|
|
library_name: transformers |
|
|
tags: |
|
|
- vision-language |
|
|
- retrieval |
|
|
- colbert |
|
|
- late-interaction |
|
|
- multimodal |
|
|
- multilingual |
|
|
- document-retrieval |
|
|
- 22-languages |
|
|
pipeline_tag: visual-document-retrieval |
|
|
base_model: |
|
|
- google/gemma-3-4b-it |
|
|
|
|
|
datasets: |
|
|
- Cognitive-Lab/nayanair-bench |
|
|
model-index: |
|
|
- name: ColNetraEmbed |
|
|
results: |
|
|
- task: |
|
|
type: image-text-retrieval |
|
|
name: Cross-Lingual Document Retrieval |
|
|
dataset: |
|
|
type: Cognitive-Lab/nayanair-bench |
|
|
name: Nayana-IR Cross-Lingual |
|
|
split: test |
|
|
metrics: |
|
|
- type: ndcg_at_5 |
|
|
value: 0.637 |
|
|
name: NDCG@5 |
|
|
- type: recall_at_10 |
|
|
value: 0.700 |
|
|
name: Recall@10 |
|
|
- type: map_at_10 |
|
|
value: 0.610 |
|
|
name: MAP@10 |
|
|
- type: mrr_at_10 |
|
|
value: 0.610 |
|
|
name: MRR@10 |
|
|
- task: |
|
|
type: image-text-retrieval |
|
|
name: Monolingual Document Retrieval |
|
|
dataset: |
|
|
type: Cognitive-Lab/nayanair-bench |
|
|
name: Nayana-IR Monolingual |
|
|
split: test |
|
|
metrics: |
|
|
- type: ndcg_at_5 |
|
|
value: 0.670 |
|
|
name: NDCG@5 |
|
|
- type: recall_at_10 |
|
|
value: 0.764 |
|
|
name: Recall@10 |
|
|
- type: map_at_10 |
|
|
value: 0.645 |
|
|
name: MAP@10 |
|
|
- type: mrr_at_10 |
|
|
value: 0.686 |
|
|
name: MRR@10 |
|
|
- task: |
|
|
type: image-text-retrieval |
|
|
name: English Document Retrieval |
|
|
dataset: |
|
|
type: vidore/vidore-benchmark |
|
|
name: ViDoRe v2 |
|
|
split: test |
|
|
metrics: |
|
|
- type: ndcg_at_5 |
|
|
value: 0.551 |
|
|
name: NDCG@5 |
|
|
- type: recall_at_10 |
|
|
value: 0.664 |
|
|
name: Recall@10 |
|
|
- type: map_at_10 |
|
|
value: 0.445 |
|
|
name: MAP@10 |
|
|
- type: mrr_at_10 |
|
|
value: 0.445 |
|
|
name: MRR@10 |
|
|
--- |
|
|
# ColNetraEmbed |
|
|
|
|
|
 |
|
|
|
|
|
|
|
|
[](https://arxiv.org/abs/2512.03514) |
|
|
[](https://github.com/adithya-s-k/colpali) |
|
|
[](https://huggingface.co/Cognitive-Lab/ColNetraEmbed) |
|
|
[](https://www.cognitivelab.in/blog/introducing-netraembed) |
|
|
[](https://huggingface.co/spaces/AdithyaSK/NetraEmbed) |
|
|
|
|
|
|
|
|
**ColNetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
ColNetraEmbed is a multilingual multimodal embedding model that encodes documents as multi-vector representations using the ColPali architecture. Each image patch is mapped to a contextualized embedding, enabling fine-grained matching between visual content and text queries through late interaction (MaxSim). |
|
|
|
|
|
- **Model Type:** Multilingual Multimodal Embedding Model with ColPali-style Multi-vector representations |
|
|
- **Architecture:** ColPali with Gemma3-4B backbone |
|
|
- **Embedding Dimension:** 128 per token |
|
|
- **Capabilities:** Multilingual, Multimodal (Vision + Text), Multi-vector late interaction |
|
|
- **Use Case:** Visual document retrieval, multilingual document understanding, fine-grained visual search |
|
|
|
|
|
## Paper |
|
|
|
|
|
π **[M3DR: Towards Universal Multilingual Multimodal Document Retrieval](https://arxiv.org/abs/2512.03514)** |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install git+https://github.com/adithya-s-k/colpali.git |
|
|
``` |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from PIL import Image |
|
|
from colpali_engine.models import ColGemma3, ColGemmaProcessor3 |
|
|
|
|
|
# Load model and processor |
|
|
model_name = "Cognitive-Lab/ColNetraEmbed" |
|
|
model = ColGemma3.from_pretrained( |
|
|
model_name, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="cuda", |
|
|
) |
|
|
processor = ColGemmaProcessor3.from_pretrained(model_name) |
|
|
|
|
|
# Load your images |
|
|
images = [ |
|
|
Image.open("document1.jpg"), |
|
|
Image.open("document2.jpg"), |
|
|
] |
|
|
|
|
|
# Define queries |
|
|
queries = [ |
|
|
"What is the total revenue?", |
|
|
"Show me the organizational chart", |
|
|
] |
|
|
|
|
|
# Process and encode |
|
|
batch_images = processor.process_images(images).to(model.device) |
|
|
batch_queries = processor.process_queries(queries).to(model.device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
image_embeddings = model(**batch_images) # Shape: (num_images, num_patches, 128) |
|
|
query_embeddings = model(**batch_queries) # Shape: (num_queries, num_tokens, 128) |
|
|
|
|
|
# Compute similarity scores using MaxSim |
|
|
scores = processor.score_multi_vector( |
|
|
qs=query_embeddings, |
|
|
ps=image_embeddings, |
|
|
) # Shape: (num_queries, num_images) |
|
|
|
|
|
# Get best matches |
|
|
for i, query in enumerate(queries): |
|
|
best_idx = scores[i].argmax().item() |
|
|
print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.2f})") |
|
|
``` |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
- **Document Retrieval:** Search through large collections of visual documents |
|
|
- **Visual Question Answering:** Answer questions about document content |
|
|
- **Document Understanding:** Extract and match information from scanned documents |
|
|
- **Cross-lingual Document Search:** Multilingual visual document retrieval |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model:** [Gemma3-4B-IT](https://huggingface.co/google/gemma-3-4b-it) |
|
|
- **Vision Encoder:** SigLIP |
|
|
- **Training Data:** Multilingual document datasets |
|
|
- **Embedding Strategy:** Multi-vector (Late Interaction) |
|
|
- **Similarity Function:** MaxSim (Maximum Similarity) |
|
|
|
|
|
## Performance |
|
|
|
|
|
ColNetraEmbed achieves strong performance on multilingual document retrieval benchmarks. Evaluated on [Nayana-IR Bench](https://huggingface.co/collections/Cognitive-Lab/nayanair-bench) (22 languages) and ViDoRe v2. |
|
|
|
|
|
### Benchmark Results |
|
|
|
|
|
**Nayana-IR Cross-Lingual** |
|
|
|
|
|
| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 | |
|
|
|-------|:------:|:---------:|:------:|:------:| |
|
|
| **ColNetraEmbed** | **0.637** | **0.700** | **0.610** | **0.610** | |
|
|
| Jina-Embeddings-v4 | 0.435 | 0.435 | 0.390 | 0.548 | |
|
|
| ColNomic-Embed-3B | 0.315 | 0.320 | 0.267 | 0.444 | |
|
|
| ColPali-v1.3 | 0.284 | 0.347 | 0.249 | 0.403 | |
|
|
| GME-Qwen2-VL-2B | 0.235 | 0.308 | 0.209 | 0.314 | |
|
|
| ColQwen2.5-v0.2 | 0.143 | 0.160 | 0.127 | 0.220 | |
|
|
| ColQwen2-v1.0 | 0.050 | 0.065 | 0.038 | 0.109 | |
|
|
|
|
|
**Nayana-IR Monolingual** |
|
|
|
|
|
| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 | |
|
|
|-------|:------:|:---------:|:------:|:------:| |
|
|
| **ColNetraEmbed** | **0.670** | **0.764** | **0.645** | **0.686** | |
|
|
| ColNomic-Embed-3B | 0.534 | 0.603 | 0.515 | 0.546 | |
|
|
| ColQwen2.5-v0.2 | 0.453 | 0.513 | 0.437 | 0.464 | |
|
|
| GME-Qwen2-VL-2B | 0.444 | 0.525 | 0.426 | 0.452 | |
|
|
| ColQwen2-v1.0 | 0.413 | 0.466 | 0.398 | 0.422 | |
|
|
| ColPali-v1.3 | 0.410 | 0.484 | 0.393 | 0.422 | |
|
|
|
|
|
**ViDoRe v2** |
|
|
|
|
|
| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 | |
|
|
|-------|:------:|:---------:|:------:|:------:| |
|
|
| ColQwen2.5-v0.2 | 0.592 | 0.664 | 0.484 | 0.711 | |
|
|
| Jina-Embeddings-v4 | 0.576 | 0.686 | - | - | |
|
|
| GME-Qwen2-VL-2B | 0.574 | 0.630 | 0.466 | 0.690 | |
|
|
| ColNomic-Embed-3B | 0.556 | 0.633 | 0.451 | 0.672 | |
|
|
| **ColNetraEmbed** | **0.551** | **0.664** | **0.445** | **0.445** | |
|
|
| ColQwen2-v1.0 | 0.545 | 0.640 | 0.438 | 0.653 | |
|
|
| ColPali-v1.3 | 0.538 | 0.627 | 0.436 | 0.644 | |
|
|
|
|
|
**Key Results:** |
|
|
- π **Strong multilingual performance** with ColBERT-style late interaction |
|
|
- π **124% improvement** over ColPali-v1.3 on cross-lingual tasks |
|
|
- π Supports **22 languages** across diverse script families |
|
|
- π **Fine-grained matching** through token-level MaxSim scoring |
|
|
|
|
|
**Comparison: Multi-vector vs Single-vector** |
|
|
- ColNetraEmbed (multi-vector): More interpretable with token-level attribution |
|
|
- NetraEmbed (single-vector): Higher accuracy (0.716 vs 0.637) and 250x more efficient storage |
|
|
|
|
|
See our [paper](https://arxiv.org/abs/2512.03514) for comprehensive evaluation and architectural comparisons. |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{kolavi2025m3druniversalmultilingualmultimodal, |
|
|
title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval}, |
|
|
author={Adithya S Kolavi and Vyoman Jain}, |
|
|
year={2025}, |
|
|
eprint={2512.03514}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.IR}, |
|
|
url={https://arxiv.org/abs/2512.03514} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the same license as the base Gemma3 model. |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
This work benefited from compute credits for training, inference, and evaluation provided by [Modal](https://modal.com), acknowledged as a compute sponsor. Dataset curation and synthesis were supported by the [Meta LLaMA Impact Grant](https://about.fb.com/news/2025/04/llama-impact-grant-recipients/?utm_source=AIatMeta&utm_medium=organic_social&utm_content=image&utm_campaign=llamacon) through our [Nayana initiative](https://www.cognitivelab.in/nayana). We appreciate Meta for continued support of our research efforts at [CognitiveLab](https://www.cognitivelab.in). |
|
|
|
|
|
Built on top of the ColPali framework and Gemma3 architecture. |
|
|
|