nomic-ai
/

CodeRankEmbed

sentence-transformers

Model card Files Files and versions

CodeRankEmbed / README.md

tarsur909's picture

add citation

3c4b608 verified 8 months ago

|

history blame contribute delete

3.1 kB

	---
	base_model:
	- Snowflake/snowflake-arctic-embed-m-long
	library_name: sentence-transformers
	license: mit
	---


	# CodeRankEmbed

	`CodeRankEmbed` is a 137M bi-encoder supporting 8192 context length for code retrieval. It significantly outperforms various open-source and proprietary code embedding models on various code retrieval tasks.

	Check out our [blog post](https://gangiswag.github.io/cornstack/) and [paper](https://arxiv.org/pdf/2412.01007) for more details!

	Combine `CodeRankEmbed` with our re-ranker [`CodeRankLLM`](https://huggingface.co/cornstack/CodeRankLLM) for even higher quality code retrieval.

	# Performance Benchmarks

	\| Name \| Parameters \| CSN (MRR) \| CoIR (NDCG@10) \|
	\| :-------------------------------:\| :----- \| :-------- \| :------: \|
	\| CodeRankEmbed \| 137M \| 77.9 \|60.1 \|
	\| Arctic-Embed-M-Long \| 137M \| 53.4 \| 43.0 \|
	\| CodeSage-Small \| 130M \| 64.9 \| 54.4 \|
	\| CodeSage-Base \| 356M \| 68.7 \| 57.5 \|
	\| CodeSage-Large \| 1.3B \| 71.2 \| 59.4 \|
	\| Jina-Code-v2 \| 161M \| 67.2 \| 58.4 \|
	\| CodeT5+ \| 110M \| 74.2 \| 45.9 \|
	\| OpenAI-Ada-002 \| 110M \| 71.3 \| 45.6 \|
	\| Voyage-Code-002 \| Unknown \| 68.5 \| 56.3 \|


	We release the scripts to evaluate our model's performance [here](https://github.com/gangiswag/cornstack).

	# Usage

	Important: the query prompt must include the following task instruction prefix: "Represent this query for searching relevant code"

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("nomic-ai/CodeRankEmbed", trust_remote_code=True)
	queries = ['Represent this query for searching relevant code: Calculate the n-th factorial']
	codes = ['def fact(n):\n if n < 0:\n raise ValueError\n return 1 if n == 0 else n * fact(n - 1)']
	query_embeddings = model.encode(queries)
	print(query_embeddings)
	code_embeddings = model.encode(codes)
	print(code_embeddings)
	```



	## Training
	We use a bi-encoder architecture for `CodeRankEmbed`, with weights shared between the text and code encoder. The retriever is contrastively fine-tuned with InfoNCE loss on a 21 million example high-quality dataset we curated called [CoRNStack](https://gangiswag.github.io/cornstack/). Our encoder is initialized with [Arctic-Embed-M-Long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long), a 137M parameter text encoder supporting an extended context length of 8,192 tokens.

	# Citation

	If you find the model, dataset, or training code useful, please cite our work:

	```bibtex
	@misc{suresh2025cornstackhighqualitycontrastivedata,
	title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
	author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji},
	year={2025},
	eprint={2412.01007},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2412.01007},
	}
	```