unicamp-dl/mmarco
Updated • 2.09k • 92
SPLADE (Sparse Lexical AnD Expansion) model fine-tuned for Portuguese text retrieval. Based on BERTimbau and trained on Portuguese question-answering datasets.
GitHub Repository: https://github.com/AxelPCG/SPLADE-PT-BR
SPLADE is a neural retrieval model that learns to expand queries and documents with contextually relevant terms while maintaining sparsity. Unlike dense retrievers, SPLADE produces sparse vectors (typically ~99% sparse) that are:
neuralmind/bert-base-portuguese-cased (BERTimbau)unicamp-dl/mmarco)unicamp-dl/mrobust)Learning Rate: 2e-5
Batch Size: 8 (effective: 32 with gradient accumulation)
Gradient Accumulation Steps: 4
Weight Decay: 0.01
Warmup Steps: 6,000
Mixed Precision: FP16
Optimizer: AdamW
FLOPS regularization is applied to enforce sparsity:
Dataset: mRobust (528k docs, 250 queries)
| Metric | Score |
|---|---|
| MRR@10 | 0.453 |
pip install torch transformers
Option 1: Using HuggingFace Hub (Recommended)
import torch
from transformers import AutoTokenizer
from modeling_splade import Splade
# Load model and tokenizer
model = Splade.from_pretrained("AxelPCG/splade-pt-br")
tokenizer = AutoTokenizer.from_pretrained("AxelPCG/splade-pt-br")
model.eval()
# Encode a query
query = "Qual é a capital do Brasil?"
with torch.no_grad():
query_tokens = tokenizer(query, return_tensors="pt", max_length=256, truncation=True)
query_vec = model(q_kwargs=query_tokens)["q_rep"].squeeze()
# Encode a document
document = "Brasília é a capital federal do Brasil desde 1960."
with torch.no_grad():
doc_tokens = tokenizer(document, return_tensors="pt", max_length=256, truncation=True)
doc_vec = model(d_kwargs=doc_tokens)["d_rep"].squeeze()
# Calculate similarity (dot product)
similarity = torch.dot(query_vec, doc_vec).item()
print(f"Similarity: {similarity:.4f}")
# Get sparse representation
indices = torch.nonzero(query_vec).squeeze().tolist()
values = query_vec[indices].tolist()
print(f"Active dimensions: {len(indices)} / {query_vec.shape[0]}")
Option 2: Using SPLADE Library
from splade.models.transformer_rep import Splade
from transformers import AutoTokenizer
# Load model by pointing to HuggingFace repo
model = Splade(model_type_or_dir="AxelPCG/splade-pt-br", agg="max", fp16=True)
tokenizer = AutoTokenizer.from_pretrained("AxelPCG/splade-pt-br")
@misc{splade-pt-br-2025,
author = {Axel Chepanski},
title = {SPLADE-PT-BR: Sparse Retrieval for Portuguese},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/AxelPCG/splade-pt-br}
}
Apache 2.0
Base model
neuralmind/bert-base-portuguese-cased