| --- |
| language: |
| - en |
| license: apache-2.0 |
| library_name: transformers |
| tags: |
| - sql |
| - forensics |
| - text-to-sql |
| - llama |
| - fine-tuned |
| base_model: unsloth/Llama-3.2-3B-Instruct |
| datasets: |
| - pawlaszc/mobile-forensics-sql |
| metrics: |
| - accuracy |
| model-index: |
| - name: ForensicSQL-Llama-3.2-3B |
| results: |
| - task: |
| type: text-to-sql |
| name: Text-to-SQL Generation |
| dataset: |
| type: mobile-forensics |
| name: Mobile Forensics SQL Dataset |
| metrics: |
| - type: accuracy |
| value: 91.0 |
| name: Overall Accuracy |
| - type: accuracy |
| value: 95.1 |
| name: Easy Queries Accuracy |
| - type: accuracy |
| value: 87.5 |
| name: Medium Queries Accuracy |
| - type: accuracy |
| value: 88.9 |
| name: Hard Queries Accuracy |
| --- |
| |
| # ForensicSQL-Llama-3.2-3B |
|
|
| ## Model Description |
|
|
| **ForSQLiteLM** (ForensicSQL-Llama-3.2-3B) is a fine-tuned Llama 3.2-3B model specialized |
| for generating SQLite queries from natural language requests against mobile forensic databases. |
| The model converts investigative questions into executable SQL queries across a wide range of |
| forensic artefact databases — WhatsApp, Signal, iMessage, Android SMS, iOS Health, WeChat, |
| Instagram, blockchain wallets, and many more. |
|
|
| This model was developed as part of a research project and accompanying journal paper |
| investigating LLM fine-tuning for forensic database analysis, and is integrated into |
| [FQLite](https://github.com/pawlaszczyk/fqlite), an established open-source forensic |
| analysis tool. |
|
|
| > **Key result:** 93.0% execution accuracy on a 100-example held-out test set — within |
| > 4 percentage points of GPT-4o (95.0%) evaluated under identical conditions |
| > (McNemar test: p ≈ 0.39, not significant at α = 0.05), while running fully locally |
| > with no internet connectivity required. |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |---|---| |
| | **Base Model** | meta-llama/Llama-3.2-3B-Instruct | |
| | **Fine-tuning Method** | Full fine-tune (bf16) | |
| | **Training Dataset** | SQLiteDS — 800 training examples, 191 forensic artifact categories | |
| | **Training Framework** | Hugging Face Transformers | |
| | **Best Val Loss** | 0.3043 (7 epochs) | |
| | **Model Size (bf16)** | ~6 GB | |
| | **Hardware Required** | 16 GB unified memory (Apple M-series) or equivalent GPU | |
|
|
| ## Performance |
|
|
| ### Overall Results (fixed dataset, n=100, best configuration) |
|
|
| | Metric | Value | |
| |---|---| |
| | **Overall Accuracy** | **93.0%** (93/100) | |
| | 95% CI (Wilson) | [86.3%, 96.6%] | |
| | Executable Queries | 94/100 | |
| | GPT-4o Accuracy | 95.0% (gap: 4 pp, p ≈ 0.39) | |
| | Base Model (no fine-tuning) | 35.0% | |
| | Improvement over base | +56 pp | |
|
|
| ### Accuracy by Query Difficulty |
|
|
| | Difficulty | Accuracy | n | 95% CI | vs. GPT-4o | |
| |---|---|---|---|---| |
| | Easy (single-table) | **95.1%** | 39/41 | [83.9%, 98.7%] | 0.0 pp | |
| | Medium (joins, aggregation) | **87.5%** | 28/32 | [71.9%, 95.0%] | 0.0 pp | |
| | Hard (CTEs, window functions) | **88.9%** | 24/27 | [71.9%, 96.1%] | −3.7 pp | |
|
|
| ForSQLiteLM matches GPT-4o exactly on Easy and Medium queries. The remaining gap |
| is concentrated on Hard queries (complex CTEs, window functions, multi-table joins). |
|
|
| ### Accuracy by Forensic Domain |
|
|
| | Domain | Accuracy | n | 95% CI | |
| |---|---|---|---| |
| | Messaging & Social | **100.0%** | 28/28 | [87.9%, 100.0%] | |
| | Android Artifacts | **100.0%** | 17/18 | [74.2%, 99.0%] | |
| | Productivity & Other | **88.9%** | 16/18 | [67.2%, 96.9%] | |
| | iOS CoreData | **92.0%** | 21/25 | [65.3%, 93.6%] | |
| | Finance & Crypto | **81.8%** | 9/11 | [52.3%, 94.9%] | |
|
|
| ### Prompt Configuration Ablation |
|
|
| | Configuration | Overall | Easy | Medium | Hard | iOS | |
| |---|---|---|---|---|---| |
| | **WITHOUT App Name** ★ | **93.0%** | **95.1%** | 87.5% | **88.9%** | 92.0% | |
| | WITH App Name | 88.0% | 92.7% | 87.5% | 81.5% | **88.0%** | |
|
|
| ★ Primary configuration — omitting the application name from the prompt yields |
| 3 pp higher overall accuracy. Interestingly, including the app name helps iOS |
| CoreData schemas (+4 pp) but hurts Hard queries (−7.4 pp); the primary |
| configuration without app name is recommended for general use. |
|
|
| ### Post-Processing Pipeline Contribution |
|
|
| | Component | Queries saved | |
| |---|---| |
| | Execution feedback (retry) | 7 | |
| | Alias normalization | 18 | |
| | Column corrections (Levenshtein) | 2 | |
|
|
| ### Training Progression |
|
|
| | Configuration | Val Loss | Accuracy | Δ | |
| |---|---|---|---| |
| | Base model (no fine-tuning) | — | 35.0% | — | |
| | Fine-tuned, no augmentation | — | 68.0% | +33 pp | |
| | + Data augmentation (2.4×) | — | 74.0% | +6 pp | |
| | + Extended training (7 epochs) | 0.3617 | 92.0% | +10 pp | |
| | + Post-processing pipeline | 0.3617 | 87.0% | +3 pp | |
| | + Execution feedback | 0.3617 | 90.0% | +3 pp | |
| | + Corrected training dataset (v5) | **0.3043** | **93.0%** | +1 pp | |
|
|
| ## Intended Use |
|
|
| ### Primary Use Cases |
| - Mobile forensics investigations: automated SQL query drafting against seized device databases |
| - Integration into forensic tools (FQLite, Autopsy, ALEAPP/iLEAPP workflows) |
| - Research in domain-specific Text-to-SQL |
| - Educational use for learning forensic database analysis |
|
|
| ### Important: This Model is a Drafting Assistant |
|
|
| > **ForSQLiteLM is not a replacement for SQL expertise.** It generates candidate queries |
| > that require review by a practitioner with sufficient SQL knowledge before any reliance |
| > is placed on their results. The 93.0% accuracy means approximately **1 in 14 queries |
| > contains an error**. In court-admissible or case-critical work, all outputs must be |
| > independently validated. |
|
|
| ### Out-of-Scope Use |
| - Autonomous forensic decision-making without human review |
| - General-purpose SQL generation outside the forensic domain |
| - Non-SQLite databases (PostgreSQL, MySQL, etc.) |
|
|
| ## How to Use |
|
|
| ### Quick Start (Transformers) |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| model_name = "pawlaszc/ForensicSQL-Llama-3.2-3B" |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, |
| torch_dtype=torch.bfloat16, |
| device_map="auto" |
| ) |
| model.eval() |
| |
| schema = """ |
| CREATE TABLE message ( |
| ROWID INTEGER PRIMARY KEY, |
| text TEXT, |
| handle_id INTEGER, |
| date INTEGER, |
| is_from_me INTEGER, |
| cache_has_attachments INTEGER |
| ); |
| CREATE TABLE handle ( |
| ROWID INTEGER PRIMARY KEY, |
| id TEXT, |
| service TEXT |
| ); |
| """ |
| |
| request = "Find all messages received in the last 7 days that contain attachments" |
| |
| # Note: do NOT use apply_chat_template — use plain-text prompt |
| prompt = f"""Generate a valid SQLite query for this forensic database request. |
| |
| Database Schema: |
| {schema} |
| |
| Request: {request} |
| |
| SQLite Query: |
| """ |
| |
| inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048) |
| inputs = {k: v.to(model.device) for k, v in inputs.items()} |
| |
| with torch.no_grad(): |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=300, |
| do_sample=False, # greedy decoding — do not change |
| ) |
| |
| input_length = inputs['input_ids'].shape[1] |
| sql = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True) |
| print(sql.strip()) |
| ``` |
|
|
| > **Important:** Use plain-text tokenization (do **not** call `apply_chat_template`). |
| > The model was trained and evaluated with a plain-text prompt format. |
| > Use `do_sample=False` (greedy decoding) for reproducible results. |
| |
| ### Python Helper Class |
| |
| ```python |
| class ForensicSQLGenerator: |
| def __init__(self, model_name="pawlaszc/ForensicSQL-Llama-3.2-3B"): |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| self.tokenizer = AutoTokenizer.from_pretrained(model_name) |
| self.model = AutoModelForCausalLM.from_pretrained( |
| model_name, |
| torch_dtype=torch.bfloat16, |
| device_map="auto" |
| ) |
| self.model.eval() |
| |
| def generate_sql(self, schema: str, request: str) -> str: |
| prompt = ( |
| "Generate a valid SQLite query for this forensic database request.\n\n" |
| f"Database Schema:\n{schema}\n\n" |
| f"Request: {request}\n\n" |
| "SQLite Query:\n" |
| ) |
| inputs = self.tokenizer( |
| prompt, return_tensors="pt", truncation=True, max_length=4096 |
| ) |
| inputs = {k: v.to(self.model.device) for k, v in inputs.items()} |
| input_length = inputs["input_ids"].shape[1] |
| |
| with torch.no_grad(): |
| outputs = self.model.generate( |
| **inputs, max_new_tokens=300, do_sample=False |
| ) |
| |
| sql = self.tokenizer.decode( |
| outputs[0][input_length:], skip_special_tokens=True |
| ) |
| # Return first statement only, normalized |
| return sql.strip().split("\n")[0].strip().rstrip(";") + ";" |
| |
|
|
| # Usage |
| generator = ForensicSQLGenerator() |
| sql = generator.generate_sql(schema, "Find all unread messages from the last 24 hours") |
| print(sql) |
| ``` |
| |
| ### With Ollama / llama.cpp (GGUF) |
| |
| ```bash |
| # With llama.cpp |
| ./llama-cli -m forensic-sql-q4_k_m.gguf \ |
| --temp 0 \ |
| -p "Generate a valid SQLite query for this forensic database request. |
| |
| Database Schema: |
| CREATE TABLE sms (_id INTEGER PRIMARY KEY, address TEXT, body TEXT, date INTEGER); |
|
|
| Request: Find all messages sent after midnight |
|
|
| SQLite Query:" |
|
|
| # With Ollama — create a Modelfile |
| cat > Modelfile << 'EOF' |
| FROM ./forensic-sql-q4_k_m.gguf |
| PARAMETER temperature 0 |
| PARAMETER num_predict 300 |
| EOF |
| |
| ollama create forensic-sql -f Modelfile |
| ollama run forensic-sql |
| ``` |
| |
| ## Training Details |
| |
| ### Dataset — SQLiteDS |
| |
| - **Total examples:** 1,000 (800 train / 100 val / 100 test), fixed random seed 42 |
| - **Forensic artifact categories:** 191 |
| - **Reference query validation:** All 1,000 reference queries validated for execution |
| correctness against in-memory SQLite; 50 queries (5%) corrected before final training |
| - **Augmentation:** 3.4× expansion via instruction paraphrasing, WHERE clause reordering, |
| and LIMIT injection — augmented examples confined to training split only |
| - **Dataset:** [pawlaszc/mobile-forensics-sql](https://huggingface.co/datasets/pawlaszc/mobile-forensics-sql) |
| - **License:** CC BY 4.0 |
| |
| ### Hyperparameters |
| |
| | Parameter | Value | |
| |---|---| |
| | Training method | Full fine-tune (no LoRA) | |
| | Precision | bfloat16 | |
| | Epochs | 7 | |
| | Learning rate | 2e-5 (peak) | |
| | LR scheduler | Cosine with warmup | |
| | Batch size | 1 + gradient accumulation 4 | |
| | Max sequence length | 4096 | |
| | Optimizer | AdamW | |
| | Hardware | Apple M-series, 16 GB unified memory | |
| | Training time | ~17.6 hours | |
| | Best val loss | 0.3043 (epoch 7) | |
| |
| ## Limitations |
| |
| ### Known Issues |
| |
| 1. **iOS CoreData Schemas (92.0%):** The Z-prefix column naming convention |
| (e.g., `ZISFROMME`, `ZTIMESTAMP`) provides no semantic signal from column |
| names alone, making these schemas harder to reason about. |
| 2. **Hard Queries — 3.7 pp gap to GPT-4o:** Complex CTEs, recursive queries, |
| and window functions are the primary remaining challenge. |
| 3. **Finance & Crypto (81.8%, n=11):** Small test set; confidence intervals are |
| wide. Interpret with caution. |
| 4. **~1 in 11 error rate:** Approximately 9% of generated queries will contain |
| errors. Expert review of all outputs is required before use in investigations. |
| |
| ### When Human Review is Especially Important |
| - Complex multi-table queries with CTEs or window functions |
| - Case-critical or court-admissible investigations |
| - Any query that will be used to draw conclusions about a suspect |
| - Queries involving rare or unusual forensic artifact schemas |
| |
| ## Evaluation |
| |
| - **Test set:** 100 examples, held-out, seed=42, non-augmented |
| - **Metric:** Execution accuracy — query is correct iff it executes without error |
| AND returns a result set identical to the reference query |
| - **Reference validation:** All reference queries validated for execution correctness |
| before evaluation; 5 broken queries in the test set were corrected |
| - **Evaluation script:** Available in the dataset repository on Zenodo ([DOI]) |
| |
| ## Citation |
| |
| If you use this model or the SQLiteDS dataset in your research, please cite: |
| |
| ```bibtex |
| @article{pawlaszczyk2026forsqlitelm, |
| author = {Dirk Pawlaszczyk}, |
| title = {AI-Based Automated SQL Query Generation for SQLite Databases |
| in Mobile Forensics}, |
| journal = {Forensic Science International: Digital Investigation}, |
| year = {2026}, |
| note = {FSIDI-D-26-00029} |
| } |
| ``` |
| |
| ## License |
| |
| Apache 2.0 — following the base Llama 3.2 license terms. |
| |
| ## Acknowledgments |
| |
| - Base model: Meta's Llama 3.2-3B-Instruct |
| - Training framework: Hugging Face Transformers |
| - Forensic tool integration: [FQLite](https://github.com/pawlaszczyk/fqlite) |
| - Schema sources: iLEAPP, ALEAPP, Autopsy (used under their respective open-source licenses) |
| |
| ## Additional Resources |
| |
| - **Dataset (Zenodo):** [SQLiteDS — DOI to be added on publication] |
| - **Dataset (HuggingFace):** [pawlaszc/mobile-forensics-sql](https://huggingface.co/datasets/pawlaszc/mobile-forensics-sql) |
| - **FQLite integration:** [github.com/pawlaszczyk/fqlite](https://github.com/pawlaszczyk/fqlite) |
| - **Paper:** FSIDI-D-26-00029 (under review) |
| |
| --- |
| |
| **Disclaimer:** ForSQLiteLM is intended for research and forensic practitioner use. |
| All generated SQL queries must be reviewed by a qualified practitioner before |
| execution in live forensic investigations. The authors accept no liability for |
| incorrect conclusions drawn from unvalidated model outputs. |
| |