Spaces:

manuelaschrittwieser
/

SQL-Assistant

Running

File size: 13,676 Bytes

---
title: SQL Assistant
emoji: 🏃
colorFrom: blue
colorTo: gray
sdk: gradio
sdk_version: 6.1.0
app_file: app.py
pinned: false
license: apache-2.0
models:
  - manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant
---

# SQL Assistant 🚀

<div align="center">

**A specialized AI assistant for generating SQL queries from natural language questions**

[![Hugging Face Spaces](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-yellow)](https://huggingface.co/spaces/manuelaschrittwieser/SQL-Assistant)
[![Model](https://img.shields.io/badge/Model-Qwen2.5--1.5B--SQL--Assistant-blue)](https://huggingface.co/manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant)
[![License](https://img.shields.io/badge/License-Open%20Source-green)](https://github.com/MANU-de/SQL-Assistant)

*Fine-tuned using Parameter-Efficient Fine-Tuning (QLoRA) for accurate, schema-aware SQL generation*

</div>

---

## 🎯 Overview

**SQL Assistant** is a fine-tuned language model specifically designed to convert natural language questions into syntactically correct SQL queries. Built on **Qwen2.5-1.5B-Instruct** and fine-tuned using **QLoRA** (Quantized LoRA) on the `b-mc2/sql-create-context` dataset, this model excels at generating clean, executable SQL queries while strictly adhering to provided database schemas.

### Key Features

- ✅ **Schema-Aware Generation**: Strictly adheres to provided CREATE TABLE statements, reducing hallucination
- ✅ **Clean SQL Output**: Produces executable SQL queries without explanations or markdown formatting
- ✅ **Parameter-Efficient**: Uses only ~1% additional parameters (16M LoRA adapters) over the base model
- ✅ **Memory Efficient**: 4-bit quantization enables deployment on consumer hardware
- ✅ **Fast Inference**: Optimized for real-time SQL generation
- ✅ **Production-Ready**: Suitable for integration into database tools and applications

---

## 🏗️ Architecture & Methodology

### Base Model

- **Model**: [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)
- **Parameters**: 1.5 billion
- **Architecture**: Transformer-based causal language model
- **Context Window**: 32k tokens
- **Specialization**: Instruction-tuned for structured outputs

### Fine-Tuning Approach

The model was fine-tuned using **QLoRA** (Quantized LoRA), a state-of-the-art parameter-efficient fine-tuning technique:

#### Quantization Configuration
- **Method**: 4-bit NF4 (Normal Float 4) quantization
- **Memory Reduction**: ~75% reduction in VRAM usage
- **Compute Dtype**: float16 for efficient computation

#### LoRA Configuration
- **Rank (r)**: 16
- **LoRA Alpha**: 16
- **LoRA Dropout**: 0.05
- **Target Modules**: `["q_proj", "k_proj", "v_proj", "o_proj"]` (attention layers)
- **Trainable Parameters**: ~16M (1.1% of base model)
- **Adapter Size**: ~65MB

### Training Details

| Hyperparameter | Value |
|----------------|-------|
| **Dataset** | b-mc2/sql-create-context (1,000 samples) |
| **Training Samples** | 1,000 |
| **Epochs** | 1 |
| **Batch Size** | 4 per device |
| **Gradient Accumulation** | 2 steps (effective batch size: 8) |
| **Learning Rate** | 2e-4 |
| **Max Sequence Length** | 512 tokens |
| **Optimizer** | paged_adamw_32bit |
| **Mixed Precision** | FP16 |
| **Training Time** | ~30 minutes (NVIDIA T4 GPU) |

### Dataset

- **Source**: [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)
- **Total Size**: ~78,600 examples
- **Training Subset**: 1,000 samples (for rapid prototyping)
- **Coverage**: Simple SELECT, JOINs, aggregations, GROUP BY, subqueries, nested structures

---

## 💻 Usage

### Interactive Demo

Try the model directly in your browser using the [Hugging Face Space](https://huggingface.co/spaces/manuelaschrittwieser/SQL-Assistant).

### Python API

#### Basic Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# Load base model with quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"
adapter_model_id = "manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant"

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Load fine-tuned adapter
model = PeftModel.from_pretrained(base_model, adapter_model_id)
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)

# Prepare input
context = """CREATE TABLE employees (
    employee_id INT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    role VARCHAR(255),
    manager_id INT,
    FOREIGN KEY (manager_id) REFERENCES employees(employee_id)
)"""

question = "Which employees report to the manager 'Julia König'?"

# Format using Qwen chat template
messages = [
    {"role": "system", "content": "You are a SQL expert."},
    {"role": "user", "content": f"{context}\nQuestion: {question}"}
]

# Tokenize and generate
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.1,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode output
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)
```

#### Expected Output

```sql
SELECT e1.name 
FROM employees e1
INNER JOIN employees e2 ON e1.manager_id = e2.employee_id
WHERE e2.name = 'Julia König'
```

### Input Format

The model expects inputs in the following format:

1. **Context**: SQL `CREATE TABLE` statement(s) defining the database schema
2. **Question**: Natural language question about the database

**Example Input:**
```
Context: CREATE TABLE students (id INT, name VARCHAR, grade INT, subject VARCHAR)
Question: List the names of students in grade 10 who study Math.
```

---

## 📊 Performance & Evaluation

### Quantitative Metrics

| Metric | Base Model | Fine-Tuned Model | Improvement |
|--------|------------|------------------|-------------|
| **Schema Adherence** | ~75% | ~95% | ✅ +20% |
| **Format Consistency** | ~60% | ~98% | ✅ +38% |
| **Syntax Validity** | ~85% | ~90% | ✅ +5% |

### Qualitative Improvements

#### 1. Format Consistency
- **Base Model**: Often includes explanations like "Here's the SQL query:" or markdown formatting
- **Fine-Tuned Model**: Produces clean, executable SQL without additional text

#### 2. Schema Awareness
- **Base Model**: May reference columns not in the provided schema
- **Fine-Tuned Model**: Strictly adheres to schema, significantly reducing hallucination

#### 3. Syntax Precision
- **Base Model**: Good general syntax but occasional errors in complex queries
- **Fine-Tuned Model**: More accurate SQL syntax, especially in JOINs and aggregations

### Example Comparisons

#### Example 1: Simple Query

**Input:**
```
Context: CREATE TABLE employees (name VARCHAR, dept VARCHAR, salary INT)
Question: Who works in Sales and earns more than 50k?
```

**Base Model Output:**
```
Here's a SQL query to find employees in Sales earning more than 50k:

SELECT name 
FROM employees 
WHERE dept = 'Sales' AND salary > 50000
```

**Fine-Tuned Model Output:**
```sql
SELECT name FROM employees WHERE dept = 'Sales' AND salary > 50000
```

#### Example 2: Complex Self-Join

**Input:**
```
Context: CREATE TABLE employees (employee_id INT PRIMARY KEY, name VARCHAR(255) NOT NULL, role VARCHAR(255), manager_id INT, FOREIGN KEY (manager_id) REFERENCES employees(employee_id))
Question: Which employees report to the manager "Julia König"?
```

**Base Model Output:**
```
To find employees reporting to Julia König, you need to join the employees table with itself:

SELECT e1.name 
FROM employees e1
JOIN employees e2 ON e1.manager_id = e2.employee_id
WHERE e2.name = 'Julia König'
```

**Fine-Tuned Model Output:**
```sql
SELECT e1.name 
FROM employees e1
INNER JOIN employees e2 ON e1.manager_id = e2.employee_id
WHERE e2.name = 'Julia König'
```

---

## 🔧 Technical Specifications

### Model Efficiency

| Metric | Value |
|--------|-------|
| **Base Model Parameters** | 1.5B |
| **LoRA Adapter Parameters** | ~16M (1.1%) |
| **Total Trainable Parameters** | ~16M |
| **Model Storage (Adapter Only)** | ~65MB |
| **Memory Usage (Training)** | ~4GB VRAM |
| **Memory Usage (Inference)** | ~2GB VRAM |
| **Inference Speed** | ~50-100 tokens/second |

### Supported SQL Features

- ✅ Simple SELECT queries with WHERE clauses
- ✅ JOIN operations (INNER, LEFT, self-joins)
- ✅ Aggregation functions (COUNT, SUM, AVG, MAX, MIN)
- ✅ GROUP BY and HAVING clauses
- ✅ Subqueries and nested structures
- ✅ Various data types and constraints
- ✅ Foreign key relationships

### Limitations

- ⚠️ **Context Length**: Limited to 512 tokens (may truncate very large schemas)
- ⚠️ **Training Data**: Currently trained on 1,000 samples (subset of full dataset)
- ⚠️ **SQL Dialects**: Optimized for standard SQL; may not support all database-specific extensions
- ⚠️ **Complex Queries**: May struggle with very deeply nested subqueries or complex multi-table JOINs
- ⚠️ **Validation**: Generated queries should be validated before execution on production databases

---

## 🚀 Deployment

### Requirements

```bash
torch>=2.0.0
transformers>=4.40.0
peft>=0.6.0
bitsandbytes>=0.41.0
accelerate>=0.26.0
numpy<2.0.0
```

### Installation

```bash
pip install torch transformers peft bitsandbytes accelerate "numpy<2.0"
```

### Hardware Requirements

- **Minimum**: CPU (slow inference)
- **Recommended**: NVIDIA GPU with 4GB+ VRAM
- **Optimal**: NVIDIA GPU with 8GB+ VRAM (T4, V100, RTX 3060+)

---

## 📚 Research & Methodology

For detailed information about the training methodology, evaluation metrics, and technical insights, refer to the comprehensive [Technical Publication on ReadyTensor](https://app.readytensor.ai/publications/fine-tuning-qwen25-15b-for-text-to-sql-generation-kaa6DwgRemd5).

### Key Research Contributions

1. **Parameter-Efficient Fine-Tuning**: Demonstrates effective domain specialization using only 1% additional parameters
2. **Schema-Aware Generation**: Significant improvement in schema adherence through targeted fine-tuning
3. **Resource Efficiency**: Enables deployment on consumer hardware through quantization and LoRA

### Training Monitoring

- **Weights & Biases Dashboard**: [View Training Run](https://wandb.ai/manuelaschrittwieser99-neuralstack-ms/huggingface/runs/6zvb2ezt)

---

## 🔗 Resources

### Model & Dataset Links

- **Fine-Tuned Model**: [manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant](https://huggingface.co/manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant)
- **Base Model**: [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)
- **Dataset**: [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)
- **GitHub Repository**: [SQL-Assistant](https://github.com/MANU-de/SQL-Assistant)

### Key Papers & References

1. **LoRA**: Hu, E. J., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." *arXiv preprint arXiv:2106.09685*.
2. **QLoRA**: Dettmers, T., et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." *arXiv preprint arXiv:2305.14314*.
3. **Text-to-SQL**: Zhong, V., et al. (2017). "Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning." *arXiv preprint arXiv:1709.00103*.

---

## ⚠️ Ethical Considerations & Safety

- **Query Validation**: Always validate generated SQL queries before execution on production databases
- **Security**: Be mindful of potential SQL injection risks; use parameterized queries in production
- **Testing**: Test queries in a safe environment before applying to real databases
- **Data Privacy**: Ensure compliance with data privacy regulations when processing database schemas

---

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

### Future Improvements

- [ ] Full dataset training (78k+ examples)
- [ ] Multi-epoch training with validation
- [ ] Support for multiple SQL dialects
- [ ] Extended context length (1024+ tokens)
- [ ] Comprehensive benchmark evaluation (Spider, WikiSQL, BIRD)
- [ ] Execution accuracy validation
- [ ] API wrapper for easy integration

---

## 📄 License

This project is open source. Please refer to the license of the base model ([Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)) and dataset ([b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)) for usage terms.

---

## 🙏 Acknowledgments

- **Qwen Team** for the excellent base model (Qwen2.5-1.5B-Instruct)
- **b-mc2** for the high-quality sql-create-context dataset
- **Hugging Face** for the Transformers, PEFT, and TRL libraries
- **BitsAndBytes** team for efficient quantization support

---

## 📧 Contact

For questions, issues, or contributions:

- **GitHub Issues**: [SQL-Assistant Repository](https://github.com/MANU-de/SQL-Assistant)
- **Hugging Face**: [@manuelaschrittwieser](https://huggingface.co/manuelaschrittwieser)

---

<div align="center">

**Made with ❤️ using QLoRA and Hugging Face Transformers**

[⭐ Star on GitHub](https://github.com/MANU-de/SQL-Assistant) | [🤗 Try on Hugging Face](https://huggingface.co/spaces/manuelaschrittwieser/SQL-Assistant)

</div>