CodeCompass-Embed

CodeCompass-Embed is a 494M-parameter embedding model for semantic code search and retrieval, trained on 86B tokens total. It produces 896-dimensional embeddings optimized for matching natural language queries to code across Python, Java, JavaScript, Go, Ruby, and PHP, achieving state-of-the-art results on the CoIR code retrieval benchmark.

Model Highlights

  • Code search from natural language — find relevant code snippets across Python, Java, JavaScript, Go, Ruby, PHP
  • Competitive with models 3× smaller and larger — 494M params, 896-dim embeddings
  • Bidirectional attention — all 24 layers converted from causal for better embedding quality
  • Lightweight — runs on consumer GPUs, trained at 512 tokens with RoPE extrapolation for longer inputs
  • Versatile — supports NL→Code, Code→Code, Q&A, and Text→SQL retrieval via instruction templates

Model Details

Property Value
Base Model Qwen2.5-Coder-0.5B
Parameters 494M
Embedding Dimension 896
Max Sequence Length 512 (training) / 32K (inference)
Pooling Mean
Normalization L2
Attention Bidirectional (all 24 layers)

Benchmark Results (CoIR)

Evaluated on the CoIR Benchmark (ACL 2025). All scores are NDCG@10. Sorted by CSN-Python.

Model Params CSN-Py CodeTrans Text2SQL SO-QA CodeFeedback Apps Avg
CodeCompass-Embed (ours) 494M 0.979 0.286 0.736 0.834 0.814 0.349 0.666
SFR-Embedding-Code 400M 0.951 0.268 0.995 0.911 0.726 0.221 0.679
Jina-Code-v2 161M 0.944 0.274 0.517 0.887 0.698 0.154 0.579
CodeRankEmbed 137M 0.938 0.260 0.769 0.899 0.717 0.199 0.630
Snowflake-Arctic-Embed-L 568M 0.915 0.196 0.540 0.872 0.650 0.144 0.553
BGE-M3 568M 0.898 0.219 0.573 0.850 0.644 0.145 0.555
BGE-Base-en-v1.5 109M 0.894 0.213 0.527 0.858 0.642 0.142 0.546
CodeT5+-110M 110M 0.870 0.179 0.328 0.815 0.580 0.118 0.482

Multi-Language Code Search (CodeSearchNet)

Language NDCG@10 MRR@10
Python 0.979 0.976
Go 0.797 0.767
Java 0.639 0.600
PHP 0.627 0.585
JavaScript 0.621 0.578
Ruby 0.579 0.535

Full Results (All 12 Tasks)

Task NDCG@10 MRR@10
codesearchnet-python 0.979 0.976
stackoverflow-qa 0.834 0.810
codefeedback-st 0.814 0.775
codesearchnet-go 0.797 0.767
synthetic-text2sql 0.736 0.662
codesearchnet-java 0.639 0.600
codesearchnet-php 0.627 0.585
codesearchnet-javascript 0.621 0.578
codesearchnet-ruby 0.579 0.535
apps 0.349 0.307
codetrans-dl 0.286 0.164
cosqa 0.209 0.165
Average (12 tasks) 0.623 0.577

Usage

With Transformers

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

# Load model
model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")

# CRITICAL: Enable bidirectional attention for embeddings
for layer in model.model.layers:
    layer.self_attn.is_causal = False

model.eval()

def encode(texts, is_query=False):
    # Add instruction prefix for queries
    if is_query:
        texts = [f"Instruct: Find the most relevant code snippet given the following query:\nQuery: {{t}}" for t in texts]
    
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        hidden = outputs.hidden_states[-1]
        
        # Mean pooling
        mask = inputs["attention_mask"].unsqueeze(-1).float()
        embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
        
        # L2 normalize
        embeddings = F.normalize(embeddings, p=2, dim=-1)
    
    return embeddings

# Example: Code Search
query = "How to sort a list in Python"
code_snippets = [
    "def sort_list(lst):\n    return sorted(lst)",
    "def add_numbers(a, b):\n    return a + b",
    "def reverse_string(s):\n    return s[::-1]",
]

query_emb = encode([query], is_query=True)
code_embs = encode(code_snippets, is_query=False)

# Compute similarities
similarities = (query_emb @ code_embs.T).squeeze()
print(f"Query: {{query}}")
for i, (code, sim) in enumerate(zip(code_snippets, similarities)):
    print(f"  [{{sim:.4f}}] {{code[:50]}}...")

Instruction Templates

For optimal performance, use these instruction prefixes for queries:

Task Instruction Template
NL → Code Instruct: Find the most relevant code snippet given the following query:\nQuery: {{query}}
Code → Code Instruct: Find an equivalent code snippet given the following code snippet:\nQuery: {{query}}
Tech Q&A Instruct: Find the most relevant answer given the following question:\nQuery: {{query}}
Text → SQL Instruct: Given a natural language question and schema, find the corresponding SQL query:\nQuery: {{query}}

Note: Document/corpus texts do NOT need instruction prefixes.

Training Details

Training followed a two-stage approach:

Stage 1 — Embedding Conversion (8.8M samples): Converted Qwen2.5-Coder-0.5B from a causal language model to a bidirectional embedding model. Trained on 8.8M samples spanning CoRNStack (Python, Java, JavaScript, Go, Ruby, PHP), CoderPile, StackOverflow, and synthetic data with mined hard negatives.

Stage 2 — Hard Negative Refinement (100K samples): Continued fine-tuning on a curated 100K-sample subset with hard negatives.

  • Base Model: Qwen2.5-Coder-0.5B
  • Architecture: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
  • Loss: InfoNCE with temperature Ï„=0.05
  • Effective Batch Size: 1024 (via GradCache)
  • Hardware: NVIDIA H100 (95GB)

Limitations

  • Strongest on Python; other languages show lower but competitive performance
  • Weaker on competitive programming tasks (APPS) due to long solution lengths vs. 512 training context
  • May not generalize to low-resource programming languages not seen in training

Citation

@misc{{codecompass2026,
  author = {{Faisal Mumtaz}},
  title = {{CodeCompass-Embed: A Code Embedding Model for Semantic Code Search}},
  year = {{2026}},
  publisher = {{Hugging Face}},
  url = {{https://huggingface.co/faisalmumtaz/codecompass-embed}}
}}

License

Apache 2.0

Downloads last month
89
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for faisalmumtaz/codecompass-embed

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(22)
this model

Dataset used to train faisalmumtaz/codecompass-embed

Evaluation results