CodeCompass-Embed

CodeCompass-Embed is a 494M-parameter embedding model for semantic code search and retrieval, trained on 86B tokens total. It produces 896-dimensional embeddings optimized for matching natural language queries to code across Python, Java, JavaScript, Go, Ruby, and PHP, achieving state-of-the-art results on the CoIR code retrieval benchmark.

Model Highlights

Code search from natural language — find relevant code snippets across Python, Java, JavaScript, Go, Ruby, PHP
Competitive with models 3× smaller and larger — 494M params, 896-dim embeddings
Bidirectional attention — all 24 layers converted from causal for better embedding quality
Lightweight — runs on consumer GPUs, trained at 512 tokens with RoPE extrapolation for longer inputs
Versatile — supports NL→Code, Code→Code, Q&A, and Text→SQL retrieval via instruction templates

Model Details

Property	Value
Base Model	Qwen2.5-Coder-0.5B
Parameters	494M
Embedding Dimension	896
Max Sequence Length	512 (training) / 32K (inference)
Pooling	Mean
Normalization	L2
Attention	Bidirectional (all 24 layers)

Benchmark Results (CoIR)

Evaluated on the CoIR Benchmark (ACL 2025). All scores are NDCG@10. Sorted by CSN-Python.

Model	Params	CSN-Py	CodeTrans	Text2SQL	SO-QA	CodeFeedback	Apps	Avg
CodeCompass-Embed (ours)	494M	0.979	0.286	0.736	0.834	0.814	0.349	0.666
SFR-Embedding-Code	400M	0.951	0.268	0.995	0.911	0.726	0.221	0.679
Jina-Code-v2	161M	0.944	0.274	0.517	0.887	0.698	0.154	0.579
CodeRankEmbed	137M	0.938	0.260	0.769	0.899	0.717	0.199	0.630
Snowflake-Arctic-Embed-L	568M	0.915	0.196	0.540	0.872	0.650	0.144	0.553
BGE-M3	568M	0.898	0.219	0.573	0.850	0.644	0.145	0.555
BGE-Base-en-v1.5	109M	0.894	0.213	0.527	0.858	0.642	0.142	0.546
CodeT5+-110M	110M	0.870	0.179	0.328	0.815	0.580	0.118	0.482

Multi-Language Code Search (CodeSearchNet)

Language	NDCG@10	MRR@10
Python	0.979	0.976
Go	0.797	0.767
Java	0.639	0.600
PHP	0.627	0.585
JavaScript	0.621	0.578
Ruby	0.579	0.535

Full Results (All 12 Tasks)

Task	NDCG@10	MRR@10
codesearchnet-python	0.979	0.976
stackoverflow-qa	0.834	0.810
codefeedback-st	0.814	0.775
codesearchnet-go	0.797	0.767
synthetic-text2sql	0.736	0.662
codesearchnet-java	0.639	0.600
codesearchnet-php	0.627	0.585
codesearchnet-javascript	0.621	0.578
codesearchnet-ruby	0.579	0.535
apps	0.349	0.307
codetrans-dl	0.286	0.164
cosqa	0.209	0.165
Average (12 tasks)	0.623	0.577

Usage

With Transformers

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

# Load model
model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")

# CRITICAL: Enable bidirectional attention for embeddings
for layer in model.model.layers:
    layer.self_attn.is_causal = False

model.eval()

def encode(texts, is_query=False):
    # Add instruction prefix for queries
    if is_query:
        texts = [f"Instruct: Find the most relevant code snippet given the following query:\nQuery: {{t}}" for t in texts]
    
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        hidden = outputs.hidden_states[-1]
        
        # Mean pooling
        mask = inputs["attention_mask"].unsqueeze(-1).float()
        embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
        
        # L2 normalize
        embeddings = F.normalize(embeddings, p=2, dim=-1)
    
    return embeddings

# Example: Code Search
query = "How to sort a list in Python"
code_snippets = [
    "def sort_list(lst):\n    return sorted(lst)",
    "def add_numbers(a, b):\n    return a + b",
    "def reverse_string(s):\n    return s[::-1]",
]

query_emb = encode([query], is_query=True)
code_embs = encode(code_snippets, is_query=False)

# Compute similarities
similarities = (query_emb @ code_embs.T).squeeze()
print(f"Query: {{query}}")
for i, (code, sim) in enumerate(zip(code_snippets, similarities)):
    print(f"  [{{sim:.4f}}] {{code[:50]}}...")

Instruction Templates

For optimal performance, use these instruction prefixes for queries:

Task	Instruction Template
NL → Code	`Instruct: Find the most relevant code snippet given the following query:\nQuery: {{query}}`
Code → Code	`Instruct: Find an equivalent code snippet given the following code snippet:\nQuery: {{query}}`
Tech Q&A	`Instruct: Find the most relevant answer given the following question:\nQuery: {{query}}`
Text → SQL	`Instruct: Given a natural language question and schema, find the corresponding SQL query:\nQuery: {{query}}`

Note: Document/corpus texts do NOT need instruction prefixes.

Training Details

Training followed a two-stage approach:

Stage 1 — Embedding Conversion (8.8M samples): Converted Qwen2.5-Coder-0.5B from a causal language model to a bidirectional embedding model. Trained on 8.8M samples spanning CoRNStack (Python, Java, JavaScript, Go, Ruby, PHP), CoderPile, StackOverflow, and synthetic data with mined hard negatives.

Stage 2 — Hard Negative Refinement (100K samples): Continued fine-tuning on a curated 100K-sample subset with hard negatives.

Base Model: Qwen2.5-Coder-0.5B
Architecture: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
Loss: InfoNCE with temperature τ=0.05
Effective Batch Size: 1024 (via GradCache)
Hardware: NVIDIA H100 (95GB)

Limitations

Strongest on Python; other languages show lower but competitive performance
Weaker on competitive programming tasks (APPS) due to long solution lengths vs. 512 training context
May not generalize to low-resource programming languages not seen in training

Citation

@misc{{codecompass2026,
  author = {{Faisal Mumtaz}},
  title = {{CodeCompass-Embed: A Code Embedding Model for Semantic Code Search}},
  year = {{2026}},
  publisher = {{Hugging Face}},
  url = {{https://huggingface.co/faisalmumtaz/codecompass-embed}}
}}

License

Apache 2.0

Downloads last month: 89

Safetensors

Model size

0.5B params

Tensor type

BF16

Model tree for faisalmumtaz/codecompass-embed

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-Coder-0.5B

Finetuned

(22)

this model

Dataset used to train faisalmumtaz/codecompass-embed

Evaluation results

NDCG@10 on CodeSearchNet Python
self-reported

0.979
MRR@10 on CodeSearchNet Python
self-reported

0.976
NDCG@10 on CodeTrans-DL
self-reported

0.286