Spaces:

manuelaschrittwieser
/

SQL-Assistant

Running

App Files Files Community

SQL-Assistant / README.md

manuelaschrittwieser

Update README.md

91f1052 verified about 15 hours ago

preview code

raw

history blame contribute delete

13.7 kB

	---
	title: SQL Assistant
	emoji: 🏃
	colorFrom: blue
	colorTo: gray
	sdk: gradio
	sdk_version: 6.1.0
	app_file: app.py
	pinned: false
	license: apache-2.0
	models:
	- manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant
	---

	# SQL Assistant 🚀

	<div align="center">

	A specialized AI assistant for generating SQL queries from natural language questions

	[![Hugging Face Spaces](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-yellow)](https://huggingface.co/spaces/manuelaschrittwieser/SQL-Assistant)
	[![Model](https://img.shields.io/badge/Model-Qwen2.5--1.5B--SQL--Assistant-blue)](https://huggingface.co/manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant)
	[![License](https://img.shields.io/badge/License-Open%20Source-green)](https://github.com/MANU-de/SQL-Assistant)

	Fine-tuned using Parameter-Efficient Fine-Tuning (QLoRA) for accurate, schema-aware SQL generation

	</div>

	---

	## 🎯 Overview

	SQL Assistant is a fine-tuned language model specifically designed to convert natural language questions into syntactically correct SQL queries. Built on Qwen2.5-1.5B-Instruct and fine-tuned using QLoRA (Quantized LoRA) on the `b-mc2/sql-create-context` dataset, this model excels at generating clean, executable SQL queries while strictly adhering to provided database schemas.

	### Key Features

	- ✅ Schema-Aware Generation: Strictly adheres to provided CREATE TABLE statements, reducing hallucination
	- ✅ Clean SQL Output: Produces executable SQL queries without explanations or markdown formatting
	- ✅ Parameter-Efficient: Uses only ~1% additional parameters (16M LoRA adapters) over the base model
	- ✅ Memory Efficient: 4-bit quantization enables deployment on consumer hardware
	- ✅ Fast Inference: Optimized for real-time SQL generation
	- ✅ Production-Ready: Suitable for integration into database tools and applications

	---

	## 🏗️ Architecture & Methodology

	### Base Model

	- Model: [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)
	- Parameters: 1.5 billion
	- Architecture: Transformer-based causal language model
	- Context Window: 32k tokens
	- Specialization: Instruction-tuned for structured outputs

	### Fine-Tuning Approach

	The model was fine-tuned using QLoRA (Quantized LoRA), a state-of-the-art parameter-efficient fine-tuning technique:

	#### Quantization Configuration
	- Method: 4-bit NF4 (Normal Float 4) quantization
	- Memory Reduction: ~75% reduction in VRAM usage
	- Compute Dtype: float16 for efficient computation

	#### LoRA Configuration
	- Rank (r): 16
	- LoRA Alpha: 16
	- LoRA Dropout: 0.05
	- Target Modules: `["q_proj", "k_proj", "v_proj", "o_proj"]` (attention layers)
	- Trainable Parameters: ~16M (1.1% of base model)
	- Adapter Size: ~65MB

	### Training Details

	\| Hyperparameter \| Value \|
	\|----------------\|-------\|
	\| Dataset \| b-mc2/sql-create-context (1,000 samples) \|
	\| Training Samples \| 1,000 \|
	\| Epochs \| 1 \|
	\| Batch Size \| 4 per device \|
	\| Gradient Accumulation \| 2 steps (effective batch size: 8) \|
	\| Learning Rate \| 2e-4 \|
	\| Max Sequence Length \| 512 tokens \|
	\| Optimizer \| paged_adamw_32bit \|
	\| Mixed Precision \| FP16 \|
	\| Training Time \| ~30 minutes (NVIDIA T4 GPU) \|

	### Dataset

	- Source: [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)
	- Total Size: ~78,600 examples
	- Training Subset: 1,000 samples (for rapid prototyping)
	- Coverage: Simple SELECT, JOINs, aggregations, GROUP BY, subqueries, nested structures

	---

	## 💻 Usage

	### Interactive Demo

	Try the model directly in your browser using the [Hugging Face Space](https://huggingface.co/spaces/manuelaschrittwieser/SQL-Assistant).

	### Python API

	#### Basic Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
	from peft import PeftModel
	import torch

	# Load base model with quantization
	bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.float16
	)

	base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"
	adapter_model_id = "manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant"

	# Load base model
	base_model = AutoModelForCausalLM.from_pretrained(
	base_model_id,
	quantization_config=bnb_config,
	device_map="auto",
	trust_remote_code=True
	)

	# Load fine-tuned adapter
	model = PeftModel.from_pretrained(base_model, adapter_model_id)
	tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)

	# Prepare input
	context = """CREATE TABLE employees (
	employee_id INT PRIMARY KEY,
	name VARCHAR(255) NOT NULL,
	role VARCHAR(255),
	manager_id INT,
	FOREIGN KEY (manager_id) REFERENCES employees(employee_id)
	)"""

	question = "Which employees report to the manager 'Julia König'?"

	# Format using Qwen chat template
	messages = [
	{"role": "system", "content": "You are a SQL expert."},
	{"role": "user", "content": f"{context}\nQuestion: {question}"}
	]

	# Tokenize and generate
	inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	return_tensors="pt"
	).to(model.device)

	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=256,
	temperature=0.1,
	do_sample=True,
	pad_token_id=tokenizer.eos_token_id
	)

	# Decode output
	response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
	print(response)
	```

	#### Expected Output

	```sql
	SELECT e1.name
	FROM employees e1
	INNER JOIN employees e2 ON e1.manager_id = e2.employee_id
	WHERE e2.name = 'Julia König'
	```

	### Input Format

	The model expects inputs in the following format:

	1. Context: SQL `CREATE TABLE` statement(s) defining the database schema
	2. Question: Natural language question about the database

	Example Input:
	```
	Context: CREATE TABLE students (id INT, name VARCHAR, grade INT, subject VARCHAR)
	Question: List the names of students in grade 10 who study Math.
	```

	---

	## 📊 Performance & Evaluation

	### Quantitative Metrics

	\| Metric \| Base Model \| Fine-Tuned Model \| Improvement \|
	\|--------\|------------\|------------------\|-------------\|
	\| Schema Adherence \| ~75% \| ~95% \| ✅ +20% \|
	\| Format Consistency \| ~60% \| ~98% \| ✅ +38% \|
	\| Syntax Validity \| ~85% \| ~90% \| ✅ +5% \|

	### Qualitative Improvements

	#### 1. Format Consistency
	- Base Model: Often includes explanations like "Here's the SQL query:" or markdown formatting
	- Fine-Tuned Model: Produces clean, executable SQL without additional text

	#### 2. Schema Awareness
	- Base Model: May reference columns not in the provided schema
	- Fine-Tuned Model: Strictly adheres to schema, significantly reducing hallucination

	#### 3. Syntax Precision
	- Base Model: Good general syntax but occasional errors in complex queries
	- Fine-Tuned Model: More accurate SQL syntax, especially in JOINs and aggregations

	### Example Comparisons

	#### Example 1: Simple Query

	Input:
	```
	Context: CREATE TABLE employees (name VARCHAR, dept VARCHAR, salary INT)
	Question: Who works in Sales and earns more than 50k?
	```

	Base Model Output:
	```
	Here's a SQL query to find employees in Sales earning more than 50k:

	SELECT name
	FROM employees
	WHERE dept = 'Sales' AND salary > 50000
	```

	Fine-Tuned Model Output:
	```sql
	SELECT name FROM employees WHERE dept = 'Sales' AND salary > 50000
	```

	#### Example 2: Complex Self-Join

	Input:
	```
	Context: CREATE TABLE employees (employee_id INT PRIMARY KEY, name VARCHAR(255) NOT NULL, role VARCHAR(255), manager_id INT, FOREIGN KEY (manager_id) REFERENCES employees(employee_id))
	Question: Which employees report to the manager "Julia König"?
	```

	Base Model Output:
	```
	To find employees reporting to Julia König, you need to join the employees table with itself:

	SELECT e1.name
	FROM employees e1
	JOIN employees e2 ON e1.manager_id = e2.employee_id
	WHERE e2.name = 'Julia König'
	```

	Fine-Tuned Model Output:
	```sql
	SELECT e1.name
	FROM employees e1
	INNER JOIN employees e2 ON e1.manager_id = e2.employee_id
	WHERE e2.name = 'Julia König'
	```

	---

	## 🔧 Technical Specifications

	### Model Efficiency

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Base Model Parameters \| 1.5B \|
	\| LoRA Adapter Parameters \| ~16M (1.1%) \|
	\| Total Trainable Parameters \| ~16M \|
	\| Model Storage (Adapter Only) \| ~65MB \|
	\| Memory Usage (Training) \| ~4GB VRAM \|
	\| Memory Usage (Inference) \| ~2GB VRAM \|
	\| Inference Speed \| ~50-100 tokens/second \|

	### Supported SQL Features

	- ✅ Simple SELECT queries with WHERE clauses
	- ✅ JOIN operations (INNER, LEFT, self-joins)
	- ✅ Aggregation functions (COUNT, SUM, AVG, MAX, MIN)
	- ✅ GROUP BY and HAVING clauses
	- ✅ Subqueries and nested structures
	- ✅ Various data types and constraints
	- ✅ Foreign key relationships

	### Limitations

	- ⚠️ Context Length: Limited to 512 tokens (may truncate very large schemas)
	- ⚠️ Training Data: Currently trained on 1,000 samples (subset of full dataset)
	- ⚠️ SQL Dialects: Optimized for standard SQL; may not support all database-specific extensions
	- ⚠️ Complex Queries: May struggle with very deeply nested subqueries or complex multi-table JOINs
	- ⚠️ Validation: Generated queries should be validated before execution on production databases

	---

	## 🚀 Deployment

	### Requirements

	```bash
	torch>=2.0.0
	transformers>=4.40.0
	peft>=0.6.0
	bitsandbytes>=0.41.0
	accelerate>=0.26.0
	numpy<2.0.0
	```

	### Installation

	```bash
	pip install torch transformers peft bitsandbytes accelerate "numpy<2.0"
	```

	### Hardware Requirements

	- Minimum: CPU (slow inference)
	- Recommended: NVIDIA GPU with 4GB+ VRAM
	- Optimal: NVIDIA GPU with 8GB+ VRAM (T4, V100, RTX 3060+)

	---

	## 📚 Research & Methodology

	For detailed information about the training methodology, evaluation metrics, and technical insights, refer to the comprehensive [Technical Publication on ReadyTensor](https://app.readytensor.ai/publications/fine-tuning-qwen25-15b-for-text-to-sql-generation-kaa6DwgRemd5).

	### Key Research Contributions

	1. Parameter-Efficient Fine-Tuning: Demonstrates effective domain specialization using only 1% additional parameters
	2. Schema-Aware Generation: Significant improvement in schema adherence through targeted fine-tuning
	3. Resource Efficiency: Enables deployment on consumer hardware through quantization and LoRA

	### Training Monitoring

	- Weights & Biases Dashboard: [View Training Run](https://wandb.ai/manuelaschrittwieser99-neuralstack-ms/huggingface/runs/6zvb2ezt)

	---

	## 🔗 Resources

	### Model & Dataset Links

	- Fine-Tuned Model: [manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant](https://huggingface.co/manuelaschrittwieser/Qwen2.5-1.5B-SQL-Assistant)
	- Base Model: [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)
	- Dataset: [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)
	- GitHub Repository: [SQL-Assistant](https://github.com/MANU-de/SQL-Assistant)

	### Key Papers & References

	1. LoRA: Hu, E. J., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv preprint arXiv:2106.09685.
	2. QLoRA: Dettmers, T., et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv preprint arXiv:2305.14314.
	3. Text-to-SQL: Zhong, V., et al. (2017). "Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning." arXiv preprint arXiv:1709.00103.

	---

	## ⚠️ Ethical Considerations & Safety

	- Query Validation: Always validate generated SQL queries before execution on production databases
	- Security: Be mindful of potential SQL injection risks; use parameterized queries in production
	- Testing: Test queries in a safe environment before applying to real databases
	- Data Privacy: Ensure compliance with data privacy regulations when processing database schemas

	---

	## 🤝 Contributing

	Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

	### Future Improvements

	- [ ] Full dataset training (78k+ examples)
	- [ ] Multi-epoch training with validation
	- [ ] Support for multiple SQL dialects
	- [ ] Extended context length (1024+ tokens)
	- [ ] Comprehensive benchmark evaluation (Spider, WikiSQL, BIRD)
	- [ ] Execution accuracy validation
	- [ ] API wrapper for easy integration

	---

	## 📄 License

	This project is open source. Please refer to the license of the base model ([Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)) and dataset ([b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)) for usage terms.

	---

	## 🙏 Acknowledgments

	- Qwen Team for the excellent base model (Qwen2.5-1.5B-Instruct)
	- b-mc2 for the high-quality sql-create-context dataset
	- Hugging Face for the Transformers, PEFT, and TRL libraries
	- BitsAndBytes team for efficient quantization support

	---

	## 📧 Contact

	For questions, issues, or contributions:

	- GitHub Issues: [SQL-Assistant Repository](https://github.com/MANU-de/SQL-Assistant)
	- Hugging Face: [@manuelaschrittwieser](https://huggingface.co/manuelaschrittwieser)

	---

	<div align="center">

	Made with ❤️ using QLoRA and Hugging Face Transformers

	[⭐ Star on GitHub](https://github.com/MANU-de/SQL-Assistant) \| [🤗 Try on Hugging Face](https://huggingface.co/spaces/manuelaschrittwieser/SQL-Assistant)

	</div>