potion-code-16M Model Card

Overview

potion-code-16M is a fast static code embedding model optimized for code retrieval tasks. It is distilled from nomic-ai/CodeRankEmbed and trained on the CornStack code corpus using Tokenlearn and contrastive fine-tuning.

It uses static embeddings, allowing text and code embeddings to be computed orders of magnitude faster than transformer-based models on both GPU and CPU.

Installation

pip install model2vec

Usage

from model2vec import StaticModel

model = StaticModel.from_pretrained("minishlab/potion-code-16M")

# Embed natural language queries
query_embeddings = model.encode(["How to read a file in Python?"])

# Embed code documents
code_embeddings = model.encode(["def read_file(path):\n    with open(path) as f:\n        return f.read()"])

How it works

potion-code-16M is created using the following pipeline:

  1. Vocabulary mining: code-specific tokens are mined from CornStack and added to the base CodeRankEmbed tokenizer (42k extra tokens → ~62.5k total)
  2. Distillation: the extended vocabulary is distilled from CodeRankEmbed using Model2Vec (256-dimensional embeddings, PCA whitening)
  3. Tokenlearn: the distilled model is fine-tuned on 240k (query, document) pairs from CornStack using cosine similarity loss
  4. Contrastive fine-tuning: the model is further fine-tuned using MultipleNegativesRankingLoss on 120k CornStack query-document pairs
  5. Post-SIF re-regularization: token weights are re-regularized using SIF weighting after each training stage

Results

Results on the CoIR benchmark (NDCG@10, mteb>=2.10):

Model Params AVG AppsRetrieval COIRCodeSearchNet CodeFeedbackMT CodeFeedbackST CodeSearchNetCC CodeTransContest CodeTransDL CosQA StackOverflow Text2SQL
CodeRankEmbed 137M 59.14 23.46 94.70 42.61 78.11 76.39 66.43 34.84 35.92 80.53 58.37
potion-code-16M + Hybrid 16M 40.41 5.23 34.03 51.23 64.26 33.22 52.67 31.14 21.63 69.65 41.03
BM25 — 39.11 4.76 32.45 59.69 67.85 33.00 47.29 32.97 15.53 69.54 28.07
potion-code-16M 16M 37.05 3.97 42.99 36.26 50.27 43.40 39.76 31.72 21.37 57.47 43.34
potion-retrieval-32M 32M 32.10 4.22 31.80 36.71 45.11 38.64 29.97 32.62 8.70 56.26 36.93
potion-base-32M 32M 31.42 3.37 29.58 34.77 42.69 37.88 28.51 30.55 14.61 53.36 38.88

CoIR covers a broad range of code retrieval scenarios. For the use case of finding code given a natural language query, CosQA and CodeFeedback (ST/MT) are the most relevant tasks. Others are less so: COIRCodeSearchNetRetrieval retrieves text given a code query (the reverse direction), and the CodeTransOcean tasks target cross-language code translation. The hybrid row combines dense retrieval with BM25 using min-max score normalization and equal weighting (alpha=0.5).

Model Details

Property Value
Parameters ~16M
Embedding dimensions 256
Vocabulary size ~62,500
Teacher model nomic-ai/CodeRankEmbed
Training corpus CornStack (6 languages: Python, Java, JavaScript, Go, PHP, Ruby)
Max sequence length 1,000,000 tokens (static, no limit in practice)

Additional Resources

Reproducibility

The full training pipeline (distill → tokenlearn → contrastive) is in train.py. It requires minishlab/tokenlearn-cornstack-docs-coderankembed and minishlab/tokenlearn-cornstack-queries-coderankembed (20k samples per language used).

pip install model2vec tokenlearn sentence-transformers datasets skeletoken einops
python train.py

Citation

@software{minishlab2024model2vec,
  author       = {Stephan Tulkens and {van Dongen}, Thomas},
  title        = {Model2Vec: Fast State-of-the-Art Static Embeddings},
  year         = {2024},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17270888},
  url          = {https://github.com/MinishLab/model2vec},
  license      = {MIT}
}
Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train minishlab/potion-code-16M

Collection including minishlab/potion-code-16M