CodeCompass-Embed
CodeCompass-Embed is a 494M-parameter embedding model for semantic code search and retrieval, trained on 86B tokens total. It produces 896-dimensional embeddings optimized for matching natural language queries to code across Python, Java, JavaScript, Go, Ruby, and PHP, achieving state-of-the-art results on the CoIR code retrieval benchmark.
Model Highlights
- Code search from natural language — find relevant code snippets across Python, Java, JavaScript, Go, Ruby, PHP
- Competitive with models 3× smaller and larger — 494M params, 896-dim embeddings
- Bidirectional attention — all 24 layers converted from causal for better embedding quality
- Lightweight — runs on consumer GPUs, trained at 512 tokens with RoPE extrapolation for longer inputs
- Versatile — supports NL→Code, Code→Code, Q&A, and Text→SQL retrieval via instruction templates
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen2.5-Coder-0.5B |
| Parameters | 494M |
| Embedding Dimension | 896 |
| Max Sequence Length | 512 (training) / 32K (inference) |
| Pooling | Mean |
| Normalization | L2 |
| Attention | Bidirectional (all 24 layers) |
Benchmark Results (CoIR)
Evaluated on the CoIR Benchmark (ACL 2025). All scores are NDCG@10. Sorted by CSN-Python.
| Model | Params | CSN-Py | CodeTrans | Text2SQL | SO-QA | CodeFeedback | Apps | Avg |
|---|---|---|---|---|---|---|---|---|
| CodeCompass-Embed (ours) | 494M | 0.979 | 0.286 | 0.736 | 0.834 | 0.814 | 0.349 | 0.666 |
| SFR-Embedding-Code | 400M | 0.951 | 0.268 | 0.995 | 0.911 | 0.726 | 0.221 | 0.679 |
| Jina-Code-v2 | 161M | 0.944 | 0.274 | 0.517 | 0.887 | 0.698 | 0.154 | 0.579 |
| CodeRankEmbed | 137M | 0.938 | 0.260 | 0.769 | 0.899 | 0.717 | 0.199 | 0.630 |
| Snowflake-Arctic-Embed-L | 568M | 0.915 | 0.196 | 0.540 | 0.872 | 0.650 | 0.144 | 0.553 |
| BGE-M3 | 568M | 0.898 | 0.219 | 0.573 | 0.850 | 0.644 | 0.145 | 0.555 |
| BGE-Base-en-v1.5 | 109M | 0.894 | 0.213 | 0.527 | 0.858 | 0.642 | 0.142 | 0.546 |
| CodeT5+-110M | 110M | 0.870 | 0.179 | 0.328 | 0.815 | 0.580 | 0.118 | 0.482 |
Multi-Language Code Search (CodeSearchNet)
| Language | NDCG@10 | MRR@10 |
|---|---|---|
| Python | 0.979 | 0.976 |
| Go | 0.797 | 0.767 |
| Java | 0.639 | 0.600 |
| PHP | 0.627 | 0.585 |
| JavaScript | 0.621 | 0.578 |
| Ruby | 0.579 | 0.535 |
Full Results (All 12 Tasks)
| Task | NDCG@10 | MRR@10 |
|---|---|---|
| codesearchnet-python | 0.979 | 0.976 |
| stackoverflow-qa | 0.834 | 0.810 |
| codefeedback-st | 0.814 | 0.775 |
| codesearchnet-go | 0.797 | 0.767 |
| synthetic-text2sql | 0.736 | 0.662 |
| codesearchnet-java | 0.639 | 0.600 |
| codesearchnet-php | 0.627 | 0.585 |
| codesearchnet-javascript | 0.621 | 0.578 |
| codesearchnet-ruby | 0.579 | 0.535 |
| apps | 0.349 | 0.307 |
| codetrans-dl | 0.286 | 0.164 |
| cosqa | 0.209 | 0.165 |
| Average (12 tasks) | 0.623 | 0.577 |
Usage
With Transformers
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
# Load model
model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")
# CRITICAL: Enable bidirectional attention for embeddings
for layer in model.model.layers:
layer.self_attn.is_causal = False
model.eval()
def encode(texts, is_query=False):
# Add instruction prefix for queries
if is_query:
texts = [f"Instruct: Find the most relevant code snippet given the following query:\nQuery: {{t}}" for t in texts]
inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
hidden = outputs.hidden_states[-1]
# Mean pooling
mask = inputs["attention_mask"].unsqueeze(-1).float()
embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
# L2 normalize
embeddings = F.normalize(embeddings, p=2, dim=-1)
return embeddings
# Example: Code Search
query = "How to sort a list in Python"
code_snippets = [
"def sort_list(lst):\n return sorted(lst)",
"def add_numbers(a, b):\n return a + b",
"def reverse_string(s):\n return s[::-1]",
]
query_emb = encode([query], is_query=True)
code_embs = encode(code_snippets, is_query=False)
# Compute similarities
similarities = (query_emb @ code_embs.T).squeeze()
print(f"Query: {{query}}")
for i, (code, sim) in enumerate(zip(code_snippets, similarities)):
print(f" [{{sim:.4f}}] {{code[:50]}}...")
Instruction Templates
For optimal performance, use these instruction prefixes for queries:
| Task | Instruction Template |
|---|---|
| NL → Code | Instruct: Find the most relevant code snippet given the following query:\nQuery: {{query}} |
| Code → Code | Instruct: Find an equivalent code snippet given the following code snippet:\nQuery: {{query}} |
| Tech Q&A | Instruct: Find the most relevant answer given the following question:\nQuery: {{query}} |
| Text → SQL | Instruct: Given a natural language question and schema, find the corresponding SQL query:\nQuery: {{query}} |
Note: Document/corpus texts do NOT need instruction prefixes.
Training Details
Training followed a two-stage approach:
Stage 1 — Embedding Conversion (8.8M samples): Converted Qwen2.5-Coder-0.5B from a causal language model to a bidirectional embedding model. Trained on 8.8M samples spanning CoRNStack (Python, Java, JavaScript, Go, Ruby, PHP), CoderPile, StackOverflow, and synthetic data with mined hard negatives.
Stage 2 — Hard Negative Refinement (100K samples): Continued fine-tuning on a curated 100K-sample subset with hard negatives.
- Base Model: Qwen2.5-Coder-0.5B
- Architecture: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
- Loss: InfoNCE with temperature Ï„=0.05
- Effective Batch Size: 1024 (via GradCache)
- Hardware: NVIDIA H100 (95GB)
Limitations
- Strongest on Python; other languages show lower but competitive performance
- Weaker on competitive programming tasks (APPS) due to long solution lengths vs. 512 training context
- May not generalize to low-resource programming languages not seen in training
Citation
@misc{{codecompass2026,
author = {{Faisal Mumtaz}},
title = {{CodeCompass-Embed: A Code Embedding Model for Semantic Code Search}},
year = {{2026}},
publisher = {{Hugging Face}},
url = {{https://huggingface.co/faisalmumtaz/codecompass-embed}}
}}
License
Apache 2.0
- Downloads last month
- 89
Model tree for faisalmumtaz/codecompass-embed
Dataset used to train faisalmumtaz/codecompass-embed
Evaluation results
- NDCG@10 on CodeSearchNet Pythonself-reported0.979
- MRR@10 on CodeSearchNet Pythonself-reported0.976
- NDCG@10 on CodeTrans-DLself-reported0.286