Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

config.json +33 -0
model.safetensors +3 -0
readme.md +188 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +58 -0
vocab.txt +0 -0

config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "_name_or_path": "cross-encoder/ms-marco-MiniLM-L-6-v2",
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 384,
+  "id2label": {
+    "0": "LABEL_0"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 1536,
+  "label2id": {
+    "LABEL_0": 0
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 6,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "sbert_ce_default_activation_function": "torch.nn.modules.linear.Identity",
+  "torch_dtype": "float32",
+  "transformers_version": "4.48.3",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:44fac56139daae40c68cd68721c226ab6260d31fcb6ccfc3c9cad89cb5fbbcff
+size 90866412

readme.md ADDED Viewed

	@@ -0,0 +1,188 @@

+---
+tags:
+- reranker
+- code search
+- cross-encoder
+- MiniLM
+- staqc
+- information retrieval
+- MRR
+- code understanding
+- python
+- stack-overflow
+library_name: sentence-transformers
+pipeline_tag: text-classification
+license: apache-2.0
+model-index:
+- name: code-reranker-miniLM-staqc
+  results:
+  - task:
+      type: text-classification
+      name: Code Reranking
+    dataset:
+      name: StaQC (Stack Overflow Question-Code)
+      type: custom
+    metrics:
+    - name: MRR
+      type: mean_reciprocal_rank
+      value: 0.9380
+    - name: Top-1 Accuracy
+      type: accuracy
+      value: 0.9100
+---
+# code-reranker-miniLM-staqc
+**A fine-tuned cross-encoder based on `cross-encoder/ms-marco-MiniLM-L-6-v2` for reranking Python code snippets based on natural language queries from Stack Overflow.**
+## Model Description
+This model is a cross-encoder trained on the StaQC dataset (Stack Overflow Question-Code pairs) to rerank relevant Python code snippets given a programming question or natural language intent. It is specifically fine-tuned for Python code search and retrieval tasks where accurate relevance scoring is important.
+* **Architecture**: Cross-Encoder based on MiniLM-L6
+* **Base model**: `cross-encoder/ms-marco-MiniLM-L-6-v2`
+* **Fine-tuned on**: StaQC SCA (Stack Overflow Question-Code) dataset
+* **Task**: Python code snippet reranking for natural language queries
+* **Language**: Python code snippets
+## Use Cases
+* Python code search engines
+* Developer assistants for Python programming
+* AI coding agents with natural language interfaces
+* Evaluation modules in RAG pipelines for Python programming use cases
+* Code recommendation systems
+## Evaluation Results
+The model was evaluated on 500 query-code candidates from the Conala curated dataset.
+| Metric         | Value |
+| -------------- | ----- |
+| MRR            | 0.938 |
+| Top‑1 Accuracy | 0.910 |
+## How to Use
+### Using sentence-transformers
+```python
+from sentence_transformers import CrossEncoder
+# Load the model
+model = CrossEncoder("NamanAgnih0tri/code-reranker-miniLM-staqc")
+# Sample input
+query = "How to convert a string to int in Python?"
+code_snippet = "int_value = int('123')"
+# Get relevance score
+score = model.predict([query, code_snippet])
+print(f"Relevance Score: {score:.4f}")
+```
+### Using transformers directly
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+tokenizer = AutoTokenizer.from_pretrained("NamanAgnih0tri/code-reranker-miniLM-staqc")
+model = AutoModelForSequenceClassification.from_pretrained("NamanAgnih0tri/code-reranker-miniLM-staqc")
+# Sample input
+query = "How to reverse a string in Python?"
+code_snippet = "def reverse_string(s):\n    return s[::-1]"
+# Tokenize and predict relevance
+inputs = tokenizer(query, code_snippet, return_tensors="pt", truncation=True, max_length=512)
+with torch.no_grad():
+    logits = model(**inputs).logits
+    score = logits[0].item()
+print(f"Relevance Score: {score:.4f}")
+```
+### Code Ranking Example
+```python
+from sentence_transformers import CrossEncoder
+model = CrossEncoder("NamanAgnih0tri/code-reranker-miniLM-staqc")
+def rank_code_snippets(query, candidates):
+    """Rank code snippets by relevance to the query."""
+    pairs = [[query, code] for code in candidates]
+    scores = model.predict(pairs)
+    ranked_results = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
+    return ranked_results
+# Example usage
+query = "How to reverse a string in Python?"
+candidates = [
+    "def reverse_string(s):\n    return s[::-1]",
+    "print('hello'[::-1])",
+    "def add(a,b):\n    return a + b",
+    "list = [1,2,3,4]"
+]
+ranked_results = rank_code_snippets(query, candidates)
+for rank, (code, score) in enumerate(ranked_results, 1):
+    print(f"{rank}. Score: {score:.4f}\n{code}\n")
+```
+## Dataset
+* **StaQC SCA (Stack Overflow Question-Code pairs)**
+* Each pair consists of a natural language programming question and a corresponding Python code snippet
+* Positive and negative pairs were used for contrastive fine-tuning
+* Dataset contains 85,294 training examples
+## Training Details
+* **Base Model**: `cross-encoder/ms-marco-MiniLM-L-6-v2`
+* **Optimizer**: AdamW
+* **Epochs**: 3
+* **Batch size**: 8
+* **Learning rate**: 2e-5
+* **Loss**: Cosine Similarity Loss
+* **Training samples**: 170,588 (including negative samples)
+* **Warmup steps**: 10% of total training steps
+## Model Performance Comparison
+| Model | MRR | Top-1 Accuracy |
+|-------|-----|----------------|
+| **code-reranker-miniLM-staqc** | **0.938** | **0.910** |
+| cross-encoder/ms-marco-MiniLM-L-6-v2 | 0.895 | 0.844 |
+| cross-encoder/ms-marco-TinyBERT-L-2-v2 | 0.823 | 0.756 |
+## Limitations
+* Trained specifically on Python code snippets; may not generalize well to other programming languages
+* Model is relatively small; performance may lag behind larger rerankers on complex queries
+* Fine-tuned on Stack Overflow-like questions; may not generalize to code from other domains
+* Limited to text-based code snippets; does not handle complex code structures or dependencies
+## Citation
+If you use this model in your work, please cite it as:
+```bibtex
+@misc{code-reranker-miniLM-staqc,
+  title={Code Reranker using MiniLM and StaQC for Python Code Search},
+  author={Naman Agnihotri},
+  year={2025},
+  howpublished={\url{https://huggingface.co/NamanAgnih0tri/code-reranker-miniLM-staqc}}
+}
+```
+## Author
+* **Name**: Naman Agnihotri
+* **Contact**: [LinkedIn](https://www.linkedin.com/in/namanagnihotri)
+* **GitHub**: [NamanAgnih0tri](https://github.com/NamanAgnih0tri)
+## License
+This model is licensed under the Apache 2.0 License.

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,58 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff