potion-code-16M Model Card

Overview

potion-code-16M is a fast static code embedding model optimized for code retrieval tasks. It is distilled from nomic-ai/CodeRankEmbed and trained on the CornStack code corpus using Tokenlearn and contrastive fine-tuning.

It uses static embeddings, allowing text and code embeddings to be computed orders of magnitude faster than transformer-based models on both GPU and CPU.

Installation

pip install model2vec

Usage

from model2vec import StaticModel

model = StaticModel.from_pretrained("minishlab/potion-code-16M")

# Embed natural language queries
query_embeddings = model.encode(["How to read a file in Python?"])

# Embed code documents
code_embeddings = model.encode(["def read_file(path):\n    with open(path) as f:\n        return f.read()"])

How it works

potion-code-16M is created using the following pipeline:

Vocabulary mining: code-specific tokens are mined from CornStack and added to the base CodeRankEmbed tokenizer (42k extra tokens → ~62.5k total)
Distillation: the extended vocabulary is distilled from CodeRankEmbed using Model2Vec (256-dimensional embeddings, PCA whitening)
Tokenlearn: the distilled model is fine-tuned on 240k (query, document) pairs from CornStack using cosine similarity loss
Contrastive fine-tuning: the model is further fine-tuned using MultipleNegativesRankingLoss on 120k CornStack query-document pairs
Post-SIF re-regularization: token weights are re-regularized using SIF weighting after each training stage

Results

Results on the CoIR benchmark (NDCG@10, mteb>=2.10):

Model	Params	AVG	AppsRetrieval	COIRCodeSearchNet	CodeFeedbackMT	CodeFeedbackST	CodeSearchNetCC	CodeTransContest	CodeTransDL	CosQA	StackOverflow	Text2SQL
CodeRankEmbed	137M	59.14	23.46	94.70	42.61	78.11	76.39	66.43	34.84	35.92	80.53	58.37
potion-code-16M + Hybrid	16M	40.41	5.23	34.03	51.23	64.26	33.22	52.67	31.14	21.63	69.65	41.03
BM25	—	39.11	4.76	32.45	59.69	67.85	33.00	47.29	32.97	15.53	69.54	28.07
potion-code-16M	16M	37.05	3.97	42.99	36.26	50.27	43.40	39.76	31.72	21.37	57.47	43.34
potion-retrieval-32M	32M	32.10	4.22	31.80	36.71	45.11	38.64	29.97	32.62	8.70	56.26	36.93
potion-base-32M	32M	31.42	3.37	29.58	34.77	42.69	37.88	28.51	30.55	14.61	53.36	38.88

CoIR covers a broad range of code retrieval scenarios. For the use case of finding code given a natural language query, CosQA and CodeFeedback (ST/MT) are the most relevant tasks. Others are less so: COIRCodeSearchNetRetrieval retrieves text given a code query (the reverse direction), and the CodeTransOcean tasks target cross-language code translation. The hybrid row combines dense retrieval with BM25 using min-max score normalization and equal weighting (alpha=0.5).

Model Details

Property	Value
Parameters	~16M
Embedding dimensions	256
Vocabulary size	~62,500
Teacher model	nomic-ai/CodeRankEmbed
Training corpus	CornStack (6 languages: Python, Java, JavaScript, Go, PHP, Ruby)
Max sequence length	1,000,000 tokens (static, no limit in practice)

Additional Resources

Reproducibility

The full training pipeline (distill → tokenlearn → contrastive) is in train.py. It requires minishlab/tokenlearn-cornstack-docs-coderankembed and minishlab/tokenlearn-cornstack-queries-coderankembed (20k samples per language used).

pip install model2vec tokenlearn sentence-transformers datasets skeletoken einops
python train.py

Citation

@software{minishlab2024model2vec,
  author       = {Stephan Tulkens and {van Dongen}, Thomas},
  title        = {Model2Vec: Fast State-of-the-Art Static Embeddings},
  year         = {2024},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17270888},
  url          = {https://github.com/MinishLab/model2vec},
  license      = {MIT}
}

Downloads last month: 10

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train minishlab/potion-code-16M

Collection including minishlab/potion-code-16M

POTION

Collection

These are the flagship POTION models. Load them and use them with model2vec (https://github.com/MinishLab/model2vec) or sentence-transformers • 8 items • Updated about 8 hours ago • 15