StarCoder2 15B - SecureCode Edition
The most powerful multi-language security model - 600+ programming languages
π― What is This?
This is StarCoder2 15B Instruct fine-tuned on the SecureCode v2.0 dataset - the most comprehensive multi-language code model available, trained on 4 trillion tokens across 600+ programming languages, now enhanced with production-grade security knowledge.
StarCoder2 represents the cutting edge of open-source code generation, developed by BigCode (ServiceNow + Hugging Face). Combined with SecureCode training, this model delivers:
β Unprecedented language coverage - Security awareness across 600+ languages β State-of-the-art code generation - Best open-source model performance β Complex security reasoning - 15B parameters for sophisticated vulnerability analysis β Production-ready quality - Trained on The Stack v2 with rigorous data curation
The Result: The most powerful and versatile security-aware code model in the SecureCode collection.
Why StarCoder2 15B? This model offers:
- π 600+ languages - From mainstream to niche (Solidity, Kotlin, Swift, Haskell, etc.)
- π SOTA performance - Best open-source code model
- π§ Complex reasoning - 15B parameters for sophisticated security analysis
- π¬ Research-grade - Built on The Stack v2 with extensive curation
- π Community-driven - BigCode initiative backed by ServiceNow + HuggingFace
π¨ The Problem This Solves
AI coding assistants produce vulnerable code in 45% of security-relevant scenarios (Veracode 2025). For organizations using diverse tech stacks, this problem multiplies across dozens of languages and frameworks.
Multi-language security challenges:
- Solidity smart contracts: $3+ billion stolen in Web3 exploits (2021-2024)
- Mobile apps (Kotlin/Swift): Frequent authentication bypass vulnerabilities
- Legacy systems (COBOL/Fortran): Undocumented security flaws
- Emerging languages (Rust/Zig): New security patterns needed
StarCoder2 SecureCode Edition addresses security across the entire programming language spectrum.
π‘ Key Features
π Unmatched Language Coverage
StarCoder2 15B trained on 600+ programming languages:
- Mainstream: Python, JavaScript, Java, C++, Go, Rust
- Web3: Solidity, Vyper, Cairo, Move
- Mobile: Kotlin, Swift, Dart
- Systems: C, Rust, Zig, Assembly
- Functional: Haskell, OCaml, Scala, Elixir
- Legacy: COBOL, Fortran, Pascal
- And 580+ more...
Now enhanced with 1,209 security-focused examples covering OWASP Top 10:2025.
π State-of-the-Art Performance
StarCoder2 15B delivers cutting-edge results:
- HumanEval: 72.6% pass@1 (best open-source at release)
- MultiPL-E: 52.3% average across languages
- Leading performance on long-context code tasks
- Trained on The Stack v2 (4T tokens)
π Comprehensive Security Training
Trained on real-world security incidents:
- 224 examples of Broken Access Control
- 199 examples of Authentication Failures
- 125 examples of Injection attacks
- 115 examples of Cryptographic Failures
- Complete OWASP Top 10:2025 coverage
π Advanced Security Analysis
Every response includes:
- Multi-language vulnerability patterns
- Secure implementations with language-specific best practices
- Attack demonstrations with realistic exploits
- Cross-language security guidance - patterns that apply across languages
π Training Details
| Parameter | Value |
|---|---|
| Base Model | bigcode/starcoder2-15b-instruct-v0.1 |
| Fine-tuning Method | LoRA (Low-Rank Adaptation) |
| Training Dataset | SecureCode v2.0 |
| Dataset Size | 841 training examples |
| Training Epochs | 3 |
| LoRA Rank (r) | 16 |
| LoRA Alpha | 32 |
| Learning Rate | 2e-4 |
| Quantization | 4-bit (bitsandbytes) |
| Trainable Parameters | ~78M (0.52% of 15B total) |
| Total Parameters | 15B |
| Context Window | 16K tokens |
| GPU Used | NVIDIA A100 40GB |
| Training Time | ~125 minutes (estimated) |
Training Methodology
LoRA fine-tuning preserves StarCoder2's exceptional multi-language capabilities:
- Trains only 0.52% of parameters
- Maintains SOTA code generation quality
- Adds cross-language security understanding
- Efficient deployment for 15B model
4-bit quantization enables deployment on 24GB+ GPUs while maintaining quality.
π Usage
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
base_model = "bigcode/starcoder2-15b-instruct-v0.1"
model = AutoModelForCausalLM.from_pretrained(
base_model,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
# Load SecureCode adapter
model = PeftModel.from_pretrained(model, "scthornton/starcoder2-15b-securecode")
# Generate secure Solidity smart contract
prompt = """### User:
Write a secure ERC-20 token contract with protection against reentrancy, integer overflow, and access control vulnerabilities.
### Assistant:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Multi-Language Security Analysis
# Analyze Rust code for memory safety issues
rust_prompt = """### User:
Review this Rust web server code for security vulnerabilities:
```rust
use actix_web::{web, App, HttpResponse, HttpServer};
async fn user_profile(user_id: web::Path<String>) -> HttpResponse {
let query = format!("SELECT * FROM users WHERE id = '{}'", user_id);
let result = execute_query(&query).await;
HttpResponse::Ok().json(result)
}
Assistant:
"""
Analyze Kotlin Android code
kotlin_prompt = """### User: Identify authentication vulnerabilities in this Kotlin Android app:
class LoginActivity : AppCompatActivity() {
fun login(username: String, password: String) {
val prefs = getSharedPreferences("auth", MODE_PRIVATE)
prefs.edit().putString("token", generateToken(username, password)).apply()
}
}
Assistant:
"""
### Production Deployment (4-bit Quantization)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
# 4-bit quantization - runs on 24GB+ GPU
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16"
)
model = AutoModelForCausalLM.from_pretrained(
"bigcode/starcoder2-15b-instruct-v0.1",
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
model = PeftModel.from_pretrained(model, "scthornton/starcoder2-15b-securecode")
tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder2-15b-instruct-v0.1", trust_remote_code=True)
π― Use Cases
1. Web3/Blockchain Security
Analyze smart contracts across multiple chains:
Audit this Solidity DeFi protocol for reentrancy, flash loan attacks, and access control issues
2. Multi-Language Codebase Security
Review polyglot applications:
Analyze this microservices app (Go backend, TypeScript frontend, Rust services) for security vulnerabilities
3. Mobile App Security
Secure iOS and Android apps:
Review this Swift iOS app for authentication bypass and data exposure vulnerabilities
4. Legacy System Modernization
Secure legacy code:
Identify security flaws in this COBOL mainframe application and provide modernization guidance
5. Emerging Language Security
Security for new languages:
Write a secure Zig HTTP server with memory safety and input validation
β οΈ Limitations
What This Model Does Well
β Multi-language security analysis (600+ languages) β State-of-the-art code generation β Complex security reasoning β Cross-language pattern recognition
What This Model Doesn't Do
β Not a smart contract auditing firm β Cannot guarantee bug-free code β Not legal/compliance advice β Not a replacement for security experts
Resource Requirements
- Larger model - Requires 24GB+ GPU for optimal performance
- Higher memory - 40GB+ RAM recommended
- Longer inference - Slower than smaller models
π Performance Benchmarks
Hardware Requirements
Minimum:
- 40GB RAM
- 24GB GPU VRAM (with 4-bit quantization)
Recommended:
- 64GB RAM
- 40GB+ GPU (A100, RTX 6000 Ada)
Inference Speed (on A100 40GB):
- ~60 tokens/second (4-bit quantization)
- ~85 tokens/second (bfloat16)
Code Generation (Base Model Scores)
| Benchmark | Score | Rank |
|---|---|---|
| HumanEval | 72.6% | Best open-source |
| MultiPL-E | 52.3% | Top 3 overall |
| Long context | SOTA | #1 |
π¬ Dataset Information
Trained on SecureCode v2.0:
- 1,209 examples with real CVE grounding
- 100% incident validation
- OWASP Top 10:2025 complete coverage
- Multi-language security patterns
π License
Model: Apache 2.0 | Dataset: CC BY-NC-SA 4.0
Powered by the BigCode OpenRAIL-M license commitment.
π Citation
@misc{thornton2025securecode-starcoder2,
title={StarCoder2 15B - SecureCode Edition},
author={Thornton, Scott},
year={2025},
publisher={perfecXion.ai},
url={https://huggingface.co/scthornton/starcoder2-15b-securecode}
}
π Acknowledgments
- BigCode Project (ServiceNow + Hugging Face) for StarCoder2
- The Stack v2 contributors for dataset curation
- OWASP Foundation for vulnerability taxonomy
- Web3 security community for blockchain vulnerability research
π Related Models
- llama-3.2-3b-securecode - Most accessible (3B)
- qwen-coder-7b-securecode - Best code model (7B)
- deepseek-coder-6.7b-securecode - Security-optimized (6.7B)
- codellama-13b-securecode - Enterprise trusted (13B)
Built with β€οΈ for secure multi-language software development
Model tree for scthornton/starcoder2-15b-securecode
Base model
bigcode/starcoder2-15b