|
|
--- |
|
|
title: SQL Assistant |
|
|
emoji: π |
|
|
colorFrom: blue |
|
|
colorTo: gray |
|
|
sdk: gradio |
|
|
sdk_version: 6.1.0 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
license: apache-2.0 |
|
|
models: |
|
|
- manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant |
|
|
--- |
|
|
|
|
|
# SQL Assistant π |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**A specialized AI assistant for generating SQL queries from natural language questions** |
|
|
|
|
|
[](https://huggingface.co/spaces/manuelaschrittwieser/SQL-Assistant) |
|
|
[](https://huggingface.co/manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant) |
|
|
[](https://github.com/MANU-de/SQL-Assistant) |
|
|
|
|
|
*Fine-tuned using Parameter-Efficient Fine-Tuning (QLoRA) for accurate, schema-aware SQL generation* |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## π― Overview |
|
|
|
|
|
**SQL Assistant** is a fine-tuned language model specifically designed to convert natural language questions into syntactically correct SQL queries. Built on **Qwen2.5-1.5B-Instruct** and fine-tuned using **QLoRA** (Quantized LoRA) on the `b-mc2/sql-create-context` dataset, this model excels at generating clean, executable SQL queries while strictly adhering to provided database schemas. |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- β
**Schema-Aware Generation**: Strictly adheres to provided CREATE TABLE statements, reducing hallucination |
|
|
- β
**Clean SQL Output**: Produces executable SQL queries without explanations or markdown formatting |
|
|
- β
**Parameter-Efficient**: Uses only ~1% additional parameters (16M LoRA adapters) over the base model |
|
|
- β
**Memory Efficient**: 4-bit quantization enables deployment on consumer hardware |
|
|
- β
**Fast Inference**: Optimized for real-time SQL generation |
|
|
- β
**Production-Ready**: Suitable for integration into database tools and applications |
|
|
|
|
|
--- |
|
|
|
|
|
## ποΈ Architecture & Methodology |
|
|
|
|
|
### Base Model |
|
|
|
|
|
- **Model**: [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) |
|
|
- **Parameters**: 1.5 billion |
|
|
- **Architecture**: Transformer-based causal language model |
|
|
- **Context Window**: 32k tokens |
|
|
- **Specialization**: Instruction-tuned for structured outputs |
|
|
|
|
|
### Fine-Tuning Approach |
|
|
|
|
|
The model was fine-tuned using **QLoRA** (Quantized LoRA), a state-of-the-art parameter-efficient fine-tuning technique: |
|
|
|
|
|
#### Quantization Configuration |
|
|
- **Method**: 4-bit NF4 (Normal Float 4) quantization |
|
|
- **Memory Reduction**: ~75% reduction in VRAM usage |
|
|
- **Compute Dtype**: float16 for efficient computation |
|
|
|
|
|
#### LoRA Configuration |
|
|
- **Rank (r)**: 16 |
|
|
- **LoRA Alpha**: 16 |
|
|
- **LoRA Dropout**: 0.05 |
|
|
- **Target Modules**: `["q_proj", "k_proj", "v_proj", "o_proj"]` (attention layers) |
|
|
- **Trainable Parameters**: ~16M (1.1% of base model) |
|
|
- **Adapter Size**: ~65MB |
|
|
|
|
|
### Training Details |
|
|
|
|
|
| Hyperparameter | Value | |
|
|
|----------------|-------| |
|
|
| **Dataset** | b-mc2/sql-create-context (1,000 samples) | |
|
|
| **Training Samples** | 1,000 | |
|
|
| **Epochs** | 1 | |
|
|
| **Batch Size** | 4 per device | |
|
|
| **Gradient Accumulation** | 2 steps (effective batch size: 8) | |
|
|
| **Learning Rate** | 2e-4 | |
|
|
| **Max Sequence Length** | 512 tokens | |
|
|
| **Optimizer** | paged_adamw_32bit | |
|
|
| **Mixed Precision** | FP16 | |
|
|
| **Training Time** | ~30 minutes (NVIDIA T4 GPU) | |
|
|
|
|
|
### Dataset |
|
|
|
|
|
- **Source**: [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) |
|
|
- **Total Size**: ~78,600 examples |
|
|
- **Training Subset**: 1,000 samples (for rapid prototyping) |
|
|
- **Coverage**: Simple SELECT, JOINs, aggregations, GROUP BY, subqueries, nested structures |
|
|
|
|
|
--- |
|
|
|
|
|
## π» Usage |
|
|
|
|
|
### Interactive Demo |
|
|
|
|
|
Try the model directly in your browser using the [Hugging Face Space](https://huggingface.co/spaces/manuelaschrittwieser/SQL-Assistant). |
|
|
|
|
|
### Python API |
|
|
|
|
|
#### Basic Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig |
|
|
from peft import PeftModel |
|
|
import torch |
|
|
|
|
|
# Load base model with quantization |
|
|
bnb_config = BitsAndBytesConfig( |
|
|
load_in_4bit=True, |
|
|
bnb_4bit_quant_type="nf4", |
|
|
bnb_4bit_compute_dtype=torch.float16 |
|
|
) |
|
|
|
|
|
base_model_id = "Qwen/Qwen2.5-1.5B-Instruct" |
|
|
adapter_model_id = "manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant" |
|
|
|
|
|
# Load base model |
|
|
base_model = AutoModelForCausalLM.from_pretrained( |
|
|
base_model_id, |
|
|
quantization_config=bnb_config, |
|
|
device_map="auto", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
# Load fine-tuned adapter |
|
|
model = PeftModel.from_pretrained(base_model, adapter_model_id) |
|
|
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True) |
|
|
|
|
|
# Prepare input |
|
|
context = """CREATE TABLE employees ( |
|
|
employee_id INT PRIMARY KEY, |
|
|
name VARCHAR(255) NOT NULL, |
|
|
role VARCHAR(255), |
|
|
manager_id INT, |
|
|
FOREIGN KEY (manager_id) REFERENCES employees(employee_id) |
|
|
)""" |
|
|
|
|
|
question = "Which employees report to the manager 'Julia KΓΆnig'?" |
|
|
|
|
|
# Format using Qwen chat template |
|
|
messages = [ |
|
|
{"role": "system", "content": "You are a SQL expert."}, |
|
|
{"role": "user", "content": f"{context}\nQuestion: {question}"} |
|
|
] |
|
|
|
|
|
# Tokenize and generate |
|
|
inputs = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
add_generation_prompt=True, |
|
|
return_tensors="pt" |
|
|
).to(model.device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=256, |
|
|
temperature=0.1, |
|
|
do_sample=True, |
|
|
pad_token_id=tokenizer.eos_token_id |
|
|
) |
|
|
|
|
|
# Decode output |
|
|
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
#### Expected Output |
|
|
|
|
|
```sql |
|
|
SELECT e1.name |
|
|
FROM employees e1 |
|
|
INNER JOIN employees e2 ON e1.manager_id = e2.employee_id |
|
|
WHERE e2.name = 'Julia KΓΆnig' |
|
|
``` |
|
|
|
|
|
### Input Format |
|
|
|
|
|
The model expects inputs in the following format: |
|
|
|
|
|
1. **Context**: SQL `CREATE TABLE` statement(s) defining the database schema |
|
|
2. **Question**: Natural language question about the database |
|
|
|
|
|
**Example Input:** |
|
|
``` |
|
|
Context: CREATE TABLE students (id INT, name VARCHAR, grade INT, subject VARCHAR) |
|
|
Question: List the names of students in grade 10 who study Math. |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Performance & Evaluation |
|
|
|
|
|
### Quantitative Metrics |
|
|
|
|
|
| Metric | Base Model | Fine-Tuned Model | Improvement | |
|
|
|--------|------------|------------------|-------------| |
|
|
| **Schema Adherence** | ~75% | ~95% | β
+20% | |
|
|
| **Format Consistency** | ~60% | ~98% | β
+38% | |
|
|
| **Syntax Validity** | ~85% | ~90% | β
+5% | |
|
|
|
|
|
### Qualitative Improvements |
|
|
|
|
|
#### 1. Format Consistency |
|
|
- **Base Model**: Often includes explanations like "Here's the SQL query:" or markdown formatting |
|
|
- **Fine-Tuned Model**: Produces clean, executable SQL without additional text |
|
|
|
|
|
#### 2. Schema Awareness |
|
|
- **Base Model**: May reference columns not in the provided schema |
|
|
- **Fine-Tuned Model**: Strictly adheres to schema, significantly reducing hallucination |
|
|
|
|
|
#### 3. Syntax Precision |
|
|
- **Base Model**: Good general syntax but occasional errors in complex queries |
|
|
- **Fine-Tuned Model**: More accurate SQL syntax, especially in JOINs and aggregations |
|
|
|
|
|
### Example Comparisons |
|
|
|
|
|
#### Example 1: Simple Query |
|
|
|
|
|
**Input:** |
|
|
``` |
|
|
Context: CREATE TABLE employees (name VARCHAR, dept VARCHAR, salary INT) |
|
|
Question: Who works in Sales and earns more than 50k? |
|
|
``` |
|
|
|
|
|
**Base Model Output:** |
|
|
``` |
|
|
Here's a SQL query to find employees in Sales earning more than 50k: |
|
|
|
|
|
SELECT name |
|
|
FROM employees |
|
|
WHERE dept = 'Sales' AND salary > 50000 |
|
|
``` |
|
|
|
|
|
**Fine-Tuned Model Output:** |
|
|
```sql |
|
|
SELECT name FROM employees WHERE dept = 'Sales' AND salary > 50000 |
|
|
``` |
|
|
|
|
|
#### Example 2: Complex Self-Join |
|
|
|
|
|
**Input:** |
|
|
``` |
|
|
Context: CREATE TABLE employees (employee_id INT PRIMARY KEY, name VARCHAR(255) NOT NULL, role VARCHAR(255), manager_id INT, FOREIGN KEY (manager_id) REFERENCES employees(employee_id)) |
|
|
Question: Which employees report to the manager "Julia KΓΆnig"? |
|
|
``` |
|
|
|
|
|
**Base Model Output:** |
|
|
``` |
|
|
To find employees reporting to Julia KΓΆnig, you need to join the employees table with itself: |
|
|
|
|
|
SELECT e1.name |
|
|
FROM employees e1 |
|
|
JOIN employees e2 ON e1.manager_id = e2.employee_id |
|
|
WHERE e2.name = 'Julia KΓΆnig' |
|
|
``` |
|
|
|
|
|
**Fine-Tuned Model Output:** |
|
|
```sql |
|
|
SELECT e1.name |
|
|
FROM employees e1 |
|
|
INNER JOIN employees e2 ON e1.manager_id = e2.employee_id |
|
|
WHERE e2.name = 'Julia KΓΆnig' |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Technical Specifications |
|
|
|
|
|
### Model Efficiency |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| **Base Model Parameters** | 1.5B | |
|
|
| **LoRA Adapter Parameters** | ~16M (1.1%) | |
|
|
| **Total Trainable Parameters** | ~16M | |
|
|
| **Model Storage (Adapter Only)** | ~65MB | |
|
|
| **Memory Usage (Training)** | ~4GB VRAM | |
|
|
| **Memory Usage (Inference)** | ~2GB VRAM | |
|
|
| **Inference Speed** | ~50-100 tokens/second | |
|
|
|
|
|
### Supported SQL Features |
|
|
|
|
|
- β
Simple SELECT queries with WHERE clauses |
|
|
- β
JOIN operations (INNER, LEFT, self-joins) |
|
|
- β
Aggregation functions (COUNT, SUM, AVG, MAX, MIN) |
|
|
- β
GROUP BY and HAVING clauses |
|
|
- β
Subqueries and nested structures |
|
|
- β
Various data types and constraints |
|
|
- β
Foreign key relationships |
|
|
|
|
|
### Limitations |
|
|
|
|
|
- β οΈ **Context Length**: Limited to 512 tokens (may truncate very large schemas) |
|
|
- β οΈ **Training Data**: Currently trained on 1,000 samples (subset of full dataset) |
|
|
- β οΈ **SQL Dialects**: Optimized for standard SQL; may not support all database-specific extensions |
|
|
- β οΈ **Complex Queries**: May struggle with very deeply nested subqueries or complex multi-table JOINs |
|
|
- β οΈ **Validation**: Generated queries should be validated before execution on production databases |
|
|
|
|
|
--- |
|
|
|
|
|
## π Deployment |
|
|
|
|
|
### Requirements |
|
|
|
|
|
```bash |
|
|
torch>=2.0.0 |
|
|
transformers>=4.40.0 |
|
|
peft>=0.6.0 |
|
|
bitsandbytes>=0.41.0 |
|
|
accelerate>=0.26.0 |
|
|
numpy<2.0.0 |
|
|
``` |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install torch transformers peft bitsandbytes accelerate "numpy<2.0" |
|
|
``` |
|
|
|
|
|
### Hardware Requirements |
|
|
|
|
|
- **Minimum**: CPU (slow inference) |
|
|
- **Recommended**: NVIDIA GPU with 4GB+ VRAM |
|
|
- **Optimal**: NVIDIA GPU with 8GB+ VRAM (T4, V100, RTX 3060+) |
|
|
|
|
|
--- |
|
|
|
|
|
## π Research & Methodology |
|
|
|
|
|
For detailed information about the training methodology, evaluation metrics, and technical insights, refer to the comprehensive [Technical Publication on ReadyTensor](https://app.readytensor.ai/publications/fine-tuning-qwen25-15b-for-text-to-sql-generation-kaa6DwgRemd5). |
|
|
|
|
|
### Key Research Contributions |
|
|
|
|
|
1. **Parameter-Efficient Fine-Tuning**: Demonstrates effective domain specialization using only 1% additional parameters |
|
|
2. **Schema-Aware Generation**: Significant improvement in schema adherence through targeted fine-tuning |
|
|
3. **Resource Efficiency**: Enables deployment on consumer hardware through quantization and LoRA |
|
|
|
|
|
### Training Monitoring |
|
|
|
|
|
- **Weights & Biases Dashboard**: [View Training Run](https://wandb.ai/manuelaschrittwieser99-neuralstack-ms/huggingface/runs/6zvb2ezt) |
|
|
|
|
|
--- |
|
|
|
|
|
## π Resources |
|
|
|
|
|
### Model & Dataset Links |
|
|
|
|
|
- **Fine-Tuned Model**: [manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant](https://huggingface.co/manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant) |
|
|
- **Base Model**: [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) |
|
|
- **Dataset**: [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) |
|
|
- **GitHub Repository**: [SQL-Assistant](https://github.com/MANU-de/SQL-Assistant) |
|
|
|
|
|
### Key Papers & References |
|
|
|
|
|
1. **LoRA**: Hu, E. J., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." *arXiv preprint arXiv:2106.09685*. |
|
|
2. **QLoRA**: Dettmers, T., et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." *arXiv preprint arXiv:2305.14314*. |
|
|
3. **Text-to-SQL**: Zhong, V., et al. (2017). "Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning." *arXiv preprint arXiv:1709.00103*. |
|
|
|
|
|
--- |
|
|
|
|
|
## β οΈ Ethical Considerations & Safety |
|
|
|
|
|
- **Query Validation**: Always validate generated SQL queries before execution on production databases |
|
|
- **Security**: Be mindful of potential SQL injection risks; use parameterized queries in production |
|
|
- **Testing**: Test queries in a safe environment before applying to real databases |
|
|
- **Data Privacy**: Ensure compliance with data privacy regulations when processing database schemas |
|
|
|
|
|
--- |
|
|
|
|
|
## π€ Contributing |
|
|
|
|
|
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change. |
|
|
|
|
|
### Future Improvements |
|
|
|
|
|
- [ ] Full dataset training (78k+ examples) |
|
|
- [ ] Multi-epoch training with validation |
|
|
- [ ] Support for multiple SQL dialects |
|
|
- [ ] Extended context length (1024+ tokens) |
|
|
- [ ] Comprehensive benchmark evaluation (Spider, WikiSQL, BIRD) |
|
|
- [ ] Execution accuracy validation |
|
|
- [ ] API wrapper for easy integration |
|
|
|
|
|
--- |
|
|
|
|
|
## π License |
|
|
|
|
|
This project is open source. Please refer to the license of the base model ([Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)) and dataset ([b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)) for usage terms. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Acknowledgments |
|
|
|
|
|
- **Qwen Team** for the excellent base model (Qwen2.5-1.5B-Instruct) |
|
|
- **b-mc2** for the high-quality sql-create-context dataset |
|
|
- **Hugging Face** for the Transformers, PEFT, and TRL libraries |
|
|
- **BitsAndBytes** team for efficient quantization support |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Contact |
|
|
|
|
|
For questions, issues, or contributions: |
|
|
|
|
|
- **GitHub Issues**: [SQL-Assistant Repository](https://github.com/MANU-de/SQL-Assistant) |
|
|
- **Hugging Face**: [@manuelaschrittwieser](https://huggingface.co/manuelaschrittwieser) |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**Made with β€οΈ using QLoRA and Hugging Face Transformers** |
|
|
|
|
|
[β Star on GitHub](https://github.com/MANU-de/SQL-Assistant) | [π€ Try on Hugging Face](https://huggingface.co/spaces/manuelaschrittwieser/SQL-Assistant) |
|
|
|
|
|
</div> |
|
|
|
|
|
|
|
|
|