StarCoder2 15B - SecureCode Edition

License Training Dataset Base Model perfecXion.ai

The most powerful multi-language security model - 600+ programming languages

πŸ€— Model Card | πŸ“Š Dataset | πŸ’» perfecXion.ai


🎯 What is This?

This is StarCoder2 15B Instruct fine-tuned on the SecureCode v2.0 dataset - the most comprehensive multi-language code model available, trained on 4 trillion tokens across 600+ programming languages, now enhanced with production-grade security knowledge.

StarCoder2 represents the cutting edge of open-source code generation, developed by BigCode (ServiceNow + Hugging Face). Combined with SecureCode training, this model delivers:

βœ… Unprecedented language coverage - Security awareness across 600+ languages βœ… State-of-the-art code generation - Best open-source model performance βœ… Complex security reasoning - 15B parameters for sophisticated vulnerability analysis βœ… Production-ready quality - Trained on The Stack v2 with rigorous data curation

The Result: The most powerful and versatile security-aware code model in the SecureCode collection.

Why StarCoder2 15B? This model offers:

  • 🌍 600+ languages - From mainstream to niche (Solidity, Kotlin, Swift, Haskell, etc.)
  • πŸ† SOTA performance - Best open-source code model
  • 🧠 Complex reasoning - 15B parameters for sophisticated security analysis
  • πŸ”¬ Research-grade - Built on The Stack v2 with extensive curation
  • 🌟 Community-driven - BigCode initiative backed by ServiceNow + HuggingFace

🚨 The Problem This Solves

AI coding assistants produce vulnerable code in 45% of security-relevant scenarios (Veracode 2025). For organizations using diverse tech stacks, this problem multiplies across dozens of languages and frameworks.

Multi-language security challenges:

  • Solidity smart contracts: $3+ billion stolen in Web3 exploits (2021-2024)
  • Mobile apps (Kotlin/Swift): Frequent authentication bypass vulnerabilities
  • Legacy systems (COBOL/Fortran): Undocumented security flaws
  • Emerging languages (Rust/Zig): New security patterns needed

StarCoder2 SecureCode Edition addresses security across the entire programming language spectrum.


πŸ’‘ Key Features

🌍 Unmatched Language Coverage

StarCoder2 15B trained on 600+ programming languages:

  • Mainstream: Python, JavaScript, Java, C++, Go, Rust
  • Web3: Solidity, Vyper, Cairo, Move
  • Mobile: Kotlin, Swift, Dart
  • Systems: C, Rust, Zig, Assembly
  • Functional: Haskell, OCaml, Scala, Elixir
  • Legacy: COBOL, Fortran, Pascal
  • And 580+ more...

Now enhanced with 1,209 security-focused examples covering OWASP Top 10:2025.

πŸ† State-of-the-Art Performance

StarCoder2 15B delivers cutting-edge results:

  • HumanEval: 72.6% pass@1 (best open-source at release)
  • MultiPL-E: 52.3% average across languages
  • Leading performance on long-context code tasks
  • Trained on The Stack v2 (4T tokens)

πŸ” Comprehensive Security Training

Trained on real-world security incidents:

  • 224 examples of Broken Access Control
  • 199 examples of Authentication Failures
  • 125 examples of Injection attacks
  • 115 examples of Cryptographic Failures
  • Complete OWASP Top 10:2025 coverage

πŸ“‹ Advanced Security Analysis

Every response includes:

  1. Multi-language vulnerability patterns
  2. Secure implementations with language-specific best practices
  3. Attack demonstrations with realistic exploits
  4. Cross-language security guidance - patterns that apply across languages

πŸ“Š Training Details

Parameter Value
Base Model bigcode/starcoder2-15b-instruct-v0.1
Fine-tuning Method LoRA (Low-Rank Adaptation)
Training Dataset SecureCode v2.0
Dataset Size 841 training examples
Training Epochs 3
LoRA Rank (r) 16
LoRA Alpha 32
Learning Rate 2e-4
Quantization 4-bit (bitsandbytes)
Trainable Parameters ~78M (0.52% of 15B total)
Total Parameters 15B
Context Window 16K tokens
GPU Used NVIDIA A100 40GB
Training Time ~125 minutes (estimated)

Training Methodology

LoRA fine-tuning preserves StarCoder2's exceptional multi-language capabilities:

  • Trains only 0.52% of parameters
  • Maintains SOTA code generation quality
  • Adds cross-language security understanding
  • Efficient deployment for 15B model

4-bit quantization enables deployment on 24GB+ GPUs while maintaining quality.


πŸš€ Usage

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = "bigcode/starcoder2-15b-instruct-v0.1"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)

# Load SecureCode adapter
model = PeftModel.from_pretrained(model, "scthornton/starcoder2-15b-securecode")

# Generate secure Solidity smart contract
prompt = """### User:
Write a secure ERC-20 token contract with protection against reentrancy, integer overflow, and access control vulnerabilities.

### Assistant:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Multi-Language Security Analysis

# Analyze Rust code for memory safety issues
rust_prompt = """### User:
Review this Rust web server code for security vulnerabilities:

```rust
use actix_web::{web, App, HttpResponse, HttpServer};

async fn user_profile(user_id: web::Path<String>) -> HttpResponse {
    let query = format!("SELECT * FROM users WHERE id = '{}'", user_id);
    let result = execute_query(&query).await;
    HttpResponse::Ok().json(result)
}

Assistant:

"""

Analyze Kotlin Android code

kotlin_prompt = """### User: Identify authentication vulnerabilities in this Kotlin Android app:

class LoginActivity : AppCompatActivity() {
    fun login(username: String, password: String) {
        val prefs = getSharedPreferences("auth", MODE_PRIVATE)
        prefs.edit().putString("token", generateToken(username, password)).apply()
    }
}

Assistant:

"""


### Production Deployment (4-bit Quantization)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

# 4-bit quantization - runs on 24GB+ GPU
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16"
)

model = AutoModelForCausalLM.from_pretrained(
    "bigcode/starcoder2-15b-instruct-v0.1",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

model = PeftModel.from_pretrained(model, "scthornton/starcoder2-15b-securecode")
tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder2-15b-instruct-v0.1", trust_remote_code=True)

🎯 Use Cases

1. Web3/Blockchain Security

Analyze smart contracts across multiple chains:

Audit this Solidity DeFi protocol for reentrancy, flash loan attacks, and access control issues

2. Multi-Language Codebase Security

Review polyglot applications:

Analyze this microservices app (Go backend, TypeScript frontend, Rust services) for security vulnerabilities

3. Mobile App Security

Secure iOS and Android apps:

Review this Swift iOS app for authentication bypass and data exposure vulnerabilities

4. Legacy System Modernization

Secure legacy code:

Identify security flaws in this COBOL mainframe application and provide modernization guidance

5. Emerging Language Security

Security for new languages:

Write a secure Zig HTTP server with memory safety and input validation

⚠️ Limitations

What This Model Does Well

βœ… Multi-language security analysis (600+ languages) βœ… State-of-the-art code generation βœ… Complex security reasoning βœ… Cross-language pattern recognition

What This Model Doesn't Do

❌ Not a smart contract auditing firm ❌ Cannot guarantee bug-free code ❌ Not legal/compliance advice ❌ Not a replacement for security experts

Resource Requirements

  • Larger model - Requires 24GB+ GPU for optimal performance
  • Higher memory - 40GB+ RAM recommended
  • Longer inference - Slower than smaller models

πŸ“ˆ Performance Benchmarks

Hardware Requirements

Minimum:

  • 40GB RAM
  • 24GB GPU VRAM (with 4-bit quantization)

Recommended:

  • 64GB RAM
  • 40GB+ GPU (A100, RTX 6000 Ada)

Inference Speed (on A100 40GB):

  • ~60 tokens/second (4-bit quantization)
  • ~85 tokens/second (bfloat16)

Code Generation (Base Model Scores)

Benchmark Score Rank
HumanEval 72.6% Best open-source
MultiPL-E 52.3% Top 3 overall
Long context SOTA #1

πŸ”¬ Dataset Information

Trained on SecureCode v2.0:

  • 1,209 examples with real CVE grounding
  • 100% incident validation
  • OWASP Top 10:2025 complete coverage
  • Multi-language security patterns

πŸ“„ License

Model: Apache 2.0 | Dataset: CC BY-NC-SA 4.0

Powered by the BigCode OpenRAIL-M license commitment.


πŸ“š Citation

@misc{thornton2025securecode-starcoder2,
  title={StarCoder2 15B - SecureCode Edition},
  author={Thornton, Scott},
  year={2025},
  publisher={perfecXion.ai},
  url={https://huggingface.co/scthornton/starcoder2-15b-securecode}
}

πŸ™ Acknowledgments

  • BigCode Project (ServiceNow + Hugging Face) for StarCoder2
  • The Stack v2 contributors for dataset curation
  • OWASP Foundation for vulnerability taxonomy
  • Web3 security community for blockchain vulnerability research

πŸ”— Related Models

View Collection


Built with ❀️ for secure multi-language software development

perfecXion.ai | Contact

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for scthornton/starcoder2-15b-securecode

Finetuned
(1)
this model

Dataset used to train scthornton/starcoder2-15b-securecode

Collection including scthornton/starcoder2-15b-securecode