SQL-Assistant / README.md
manuelaschrittwieser's picture
Update README.md
91f1052 verified
---
title: SQL Assistant
emoji: πŸƒ
colorFrom: blue
colorTo: gray
sdk: gradio
sdk_version: 6.1.0
app_file: app.py
pinned: false
license: apache-2.0
models:
- manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant
---
# SQL Assistant πŸš€
<div align="center">
**A specialized AI assistant for generating SQL queries from natural language questions**
[![Hugging Face Spaces](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Spaces-yellow)](https://huggingface.co/spaces/manuelaschrittwieser/SQL-Assistant)
[![Model](https://img.shields.io/badge/Model-Qwen2.5--1.5B--SQL--Assistant-blue)](https://huggingface.co/manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant)
[![License](https://img.shields.io/badge/License-Open%20Source-green)](https://github.com/MANU-de/SQL-Assistant)
*Fine-tuned using Parameter-Efficient Fine-Tuning (QLoRA) for accurate, schema-aware SQL generation*
</div>
---
## 🎯 Overview
**SQL Assistant** is a fine-tuned language model specifically designed to convert natural language questions into syntactically correct SQL queries. Built on **Qwen2.5-1.5B-Instruct** and fine-tuned using **QLoRA** (Quantized LoRA) on the `b-mc2/sql-create-context` dataset, this model excels at generating clean, executable SQL queries while strictly adhering to provided database schemas.
### Key Features
- βœ… **Schema-Aware Generation**: Strictly adheres to provided CREATE TABLE statements, reducing hallucination
- βœ… **Clean SQL Output**: Produces executable SQL queries without explanations or markdown formatting
- βœ… **Parameter-Efficient**: Uses only ~1% additional parameters (16M LoRA adapters) over the base model
- βœ… **Memory Efficient**: 4-bit quantization enables deployment on consumer hardware
- βœ… **Fast Inference**: Optimized for real-time SQL generation
- βœ… **Production-Ready**: Suitable for integration into database tools and applications
---
## πŸ—οΈ Architecture & Methodology
### Base Model
- **Model**: [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)
- **Parameters**: 1.5 billion
- **Architecture**: Transformer-based causal language model
- **Context Window**: 32k tokens
- **Specialization**: Instruction-tuned for structured outputs
### Fine-Tuning Approach
The model was fine-tuned using **QLoRA** (Quantized LoRA), a state-of-the-art parameter-efficient fine-tuning technique:
#### Quantization Configuration
- **Method**: 4-bit NF4 (Normal Float 4) quantization
- **Memory Reduction**: ~75% reduction in VRAM usage
- **Compute Dtype**: float16 for efficient computation
#### LoRA Configuration
- **Rank (r)**: 16
- **LoRA Alpha**: 16
- **LoRA Dropout**: 0.05
- **Target Modules**: `["q_proj", "k_proj", "v_proj", "o_proj"]` (attention layers)
- **Trainable Parameters**: ~16M (1.1% of base model)
- **Adapter Size**: ~65MB
### Training Details
| Hyperparameter | Value |
|----------------|-------|
| **Dataset** | b-mc2/sql-create-context (1,000 samples) |
| **Training Samples** | 1,000 |
| **Epochs** | 1 |
| **Batch Size** | 4 per device |
| **Gradient Accumulation** | 2 steps (effective batch size: 8) |
| **Learning Rate** | 2e-4 |
| **Max Sequence Length** | 512 tokens |
| **Optimizer** | paged_adamw_32bit |
| **Mixed Precision** | FP16 |
| **Training Time** | ~30 minutes (NVIDIA T4 GPU) |
### Dataset
- **Source**: [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)
- **Total Size**: ~78,600 examples
- **Training Subset**: 1,000 samples (for rapid prototyping)
- **Coverage**: Simple SELECT, JOINs, aggregations, GROUP BY, subqueries, nested structures
---
## πŸ’» Usage
### Interactive Demo
Try the model directly in your browser using the [Hugging Face Space](https://huggingface.co/spaces/manuelaschrittwieser/SQL-Assistant).
### Python API
#### Basic Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
# Load base model with quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16
)
base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"
adapter_model_id = "manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant"
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
# Load fine-tuned adapter
model = PeftModel.from_pretrained(base_model, adapter_model_id)
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
# Prepare input
context = """CREATE TABLE employees (
employee_id INT PRIMARY KEY,
name VARCHAR(255) NOT NULL,
role VARCHAR(255),
manager_id INT,
FOREIGN KEY (manager_id) REFERENCES employees(employee_id)
)"""
question = "Which employees report to the manager 'Julia KΓΆnig'?"
# Format using Qwen chat template
messages = [
{"role": "system", "content": "You are a SQL expert."},
{"role": "user", "content": f"{context}\nQuestion: {question}"}
]
# Tokenize and generate
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.1,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
# Decode output
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)
```
#### Expected Output
```sql
SELECT e1.name
FROM employees e1
INNER JOIN employees e2 ON e1.manager_id = e2.employee_id
WHERE e2.name = 'Julia KΓΆnig'
```
### Input Format
The model expects inputs in the following format:
1. **Context**: SQL `CREATE TABLE` statement(s) defining the database schema
2. **Question**: Natural language question about the database
**Example Input:**
```
Context: CREATE TABLE students (id INT, name VARCHAR, grade INT, subject VARCHAR)
Question: List the names of students in grade 10 who study Math.
```
---
## πŸ“Š Performance & Evaluation
### Quantitative Metrics
| Metric | Base Model | Fine-Tuned Model | Improvement |
|--------|------------|------------------|-------------|
| **Schema Adherence** | ~75% | ~95% | βœ… +20% |
| **Format Consistency** | ~60% | ~98% | βœ… +38% |
| **Syntax Validity** | ~85% | ~90% | βœ… +5% |
### Qualitative Improvements
#### 1. Format Consistency
- **Base Model**: Often includes explanations like "Here's the SQL query:" or markdown formatting
- **Fine-Tuned Model**: Produces clean, executable SQL without additional text
#### 2. Schema Awareness
- **Base Model**: May reference columns not in the provided schema
- **Fine-Tuned Model**: Strictly adheres to schema, significantly reducing hallucination
#### 3. Syntax Precision
- **Base Model**: Good general syntax but occasional errors in complex queries
- **Fine-Tuned Model**: More accurate SQL syntax, especially in JOINs and aggregations
### Example Comparisons
#### Example 1: Simple Query
**Input:**
```
Context: CREATE TABLE employees (name VARCHAR, dept VARCHAR, salary INT)
Question: Who works in Sales and earns more than 50k?
```
**Base Model Output:**
```
Here's a SQL query to find employees in Sales earning more than 50k:
SELECT name
FROM employees
WHERE dept = 'Sales' AND salary > 50000
```
**Fine-Tuned Model Output:**
```sql
SELECT name FROM employees WHERE dept = 'Sales' AND salary > 50000
```
#### Example 2: Complex Self-Join
**Input:**
```
Context: CREATE TABLE employees (employee_id INT PRIMARY KEY, name VARCHAR(255) NOT NULL, role VARCHAR(255), manager_id INT, FOREIGN KEY (manager_id) REFERENCES employees(employee_id))
Question: Which employees report to the manager "Julia KΓΆnig"?
```
**Base Model Output:**
```
To find employees reporting to Julia KΓΆnig, you need to join the employees table with itself:
SELECT e1.name
FROM employees e1
JOIN employees e2 ON e1.manager_id = e2.employee_id
WHERE e2.name = 'Julia KΓΆnig'
```
**Fine-Tuned Model Output:**
```sql
SELECT e1.name
FROM employees e1
INNER JOIN employees e2 ON e1.manager_id = e2.employee_id
WHERE e2.name = 'Julia KΓΆnig'
```
---
## πŸ”§ Technical Specifications
### Model Efficiency
| Metric | Value |
|--------|-------|
| **Base Model Parameters** | 1.5B |
| **LoRA Adapter Parameters** | ~16M (1.1%) |
| **Total Trainable Parameters** | ~16M |
| **Model Storage (Adapter Only)** | ~65MB |
| **Memory Usage (Training)** | ~4GB VRAM |
| **Memory Usage (Inference)** | ~2GB VRAM |
| **Inference Speed** | ~50-100 tokens/second |
### Supported SQL Features
- βœ… Simple SELECT queries with WHERE clauses
- βœ… JOIN operations (INNER, LEFT, self-joins)
- βœ… Aggregation functions (COUNT, SUM, AVG, MAX, MIN)
- βœ… GROUP BY and HAVING clauses
- βœ… Subqueries and nested structures
- βœ… Various data types and constraints
- βœ… Foreign key relationships
### Limitations
- ⚠️ **Context Length**: Limited to 512 tokens (may truncate very large schemas)
- ⚠️ **Training Data**: Currently trained on 1,000 samples (subset of full dataset)
- ⚠️ **SQL Dialects**: Optimized for standard SQL; may not support all database-specific extensions
- ⚠️ **Complex Queries**: May struggle with very deeply nested subqueries or complex multi-table JOINs
- ⚠️ **Validation**: Generated queries should be validated before execution on production databases
---
## πŸš€ Deployment
### Requirements
```bash
torch>=2.0.0
transformers>=4.40.0
peft>=0.6.0
bitsandbytes>=0.41.0
accelerate>=0.26.0
numpy<2.0.0
```
### Installation
```bash
pip install torch transformers peft bitsandbytes accelerate "numpy<2.0"
```
### Hardware Requirements
- **Minimum**: CPU (slow inference)
- **Recommended**: NVIDIA GPU with 4GB+ VRAM
- **Optimal**: NVIDIA GPU with 8GB+ VRAM (T4, V100, RTX 3060+)
---
## πŸ“š Research & Methodology
For detailed information about the training methodology, evaluation metrics, and technical insights, refer to the comprehensive [Technical Publication on ReadyTensor](https://app.readytensor.ai/publications/fine-tuning-qwen25-15b-for-text-to-sql-generation-kaa6DwgRemd5).
### Key Research Contributions
1. **Parameter-Efficient Fine-Tuning**: Demonstrates effective domain specialization using only 1% additional parameters
2. **Schema-Aware Generation**: Significant improvement in schema adherence through targeted fine-tuning
3. **Resource Efficiency**: Enables deployment on consumer hardware through quantization and LoRA
### Training Monitoring
- **Weights & Biases Dashboard**: [View Training Run](https://wandb.ai/manuelaschrittwieser99-neuralstack-ms/huggingface/runs/6zvb2ezt)
---
## πŸ”— Resources
### Model & Dataset Links
- **Fine-Tuned Model**: [manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant](https://huggingface.co/manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant)
- **Base Model**: [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)
- **Dataset**: [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)
- **GitHub Repository**: [SQL-Assistant](https://github.com/MANU-de/SQL-Assistant)
### Key Papers & References
1. **LoRA**: Hu, E. J., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." *arXiv preprint arXiv:2106.09685*.
2. **QLoRA**: Dettmers, T., et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." *arXiv preprint arXiv:2305.14314*.
3. **Text-to-SQL**: Zhong, V., et al. (2017). "Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning." *arXiv preprint arXiv:1709.00103*.
---
## ⚠️ Ethical Considerations & Safety
- **Query Validation**: Always validate generated SQL queries before execution on production databases
- **Security**: Be mindful of potential SQL injection risks; use parameterized queries in production
- **Testing**: Test queries in a safe environment before applying to real databases
- **Data Privacy**: Ensure compliance with data privacy regulations when processing database schemas
---
## 🀝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
### Future Improvements
- [ ] Full dataset training (78k+ examples)
- [ ] Multi-epoch training with validation
- [ ] Support for multiple SQL dialects
- [ ] Extended context length (1024+ tokens)
- [ ] Comprehensive benchmark evaluation (Spider, WikiSQL, BIRD)
- [ ] Execution accuracy validation
- [ ] API wrapper for easy integration
---
## πŸ“„ License
This project is open source. Please refer to the license of the base model ([Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)) and dataset ([b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)) for usage terms.
---
## πŸ™ Acknowledgments
- **Qwen Team** for the excellent base model (Qwen2.5-1.5B-Instruct)
- **b-mc2** for the high-quality sql-create-context dataset
- **Hugging Face** for the Transformers, PEFT, and TRL libraries
- **BitsAndBytes** team for efficient quantization support
---
## πŸ“§ Contact
For questions, issues, or contributions:
- **GitHub Issues**: [SQL-Assistant Repository](https://github.com/MANU-de/SQL-Assistant)
- **Hugging Face**: [@manuelaschrittwieser](https://huggingface.co/manuelaschrittwieser)
---
<div align="center">
**Made with ❀️ using QLoRA and Hugging Face Transformers**
[⭐ Star on GitHub](https://github.com/MANU-de/SQL-Assistant) | [πŸ€— Try on Hugging Face](https://huggingface.co/spaces/manuelaschrittwieser/SQL-Assistant)
</div>