Greek Dialect LoRA — Krikri-8B Adapter
LoRA adapter trained by the Computational Linguistics & Language Technology (CLLT) Lab, University of Crete, for producing text in Pontic, Cretan, Northern Greek, and Cypriot dialects. The adapter augments ilsp/Llama-Krikri-8B-Base with dialect-only data prepared via natural Greek prompts.
Project website: https://stergioscha.github.io/CLLT/
Model Details
- Developer: CLLT Lab, University of Crete
- Base model: ilsp/Llama-Krikri-8B-Base
- Adapter type: LoRA via PEFT (r=16, α=32, dropout=0.1, q/k/v/o/gate/up/down projections)
- Trained parameters: 41.9M (≈0.51% of the base model)
- Dataset: 23,421 natural-prompt examples derived from the Greek Regional Dialects Dataset (GRDD)
- Languages: Greek dialectal varieties (Pontic, Cretan, Northern, Cypriot)
- License: Research purposes only (respect the base model’s license)
- Funding / compute: AWS resources provided by GRNET and funded by the EU Recovery & Resilience Facility
Model Sources
- GitHub: https://github.com/StergiosCha/krikri_dialectal
- Dataset description: https://github.com/StergiosCha/Greek_dialect_corpus
- Project site: https://stergioscha.github.io/CLLT/
Intended Use
Direct use
- Dialectal text generation for cultural heritage, education, and research
- Conversational agents that must answer in a specific Greek dialect
- Prompt-based experimentation with dialect-specific stylistics
Downstream use
- Integrate the adapter inside chatbots or RAG pipelines that need dialectal answers
- Build evaluation suites for low-resource Greek varieties
Out-of-scope / limitations
- Standard Modern Greek generation (training data excluded it)
- High-stakes domains (medical, legal, safety-critical) without human oversight
- Automatic dialect classification or translation between dialects
How to Use
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"ilsp/Llama-Krikri-8B-Base",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Base")
model = PeftModel.from_pretrained(base_model, "Stergios/krikri-8b-base-lora")
prompt = "Γράψε στην ποντιακή διάλεκτο: Καλημέρα, πώς είσαι;"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=120, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Training Data & Procedure
- Source: GRDD + GRDD+, filtered to keep only dialect-tagged entries
- Conversion: Artificial tags (
<po>,<cr>,<no>,<cy>) mapped to natural Greek instructions usingconvert_to_natural_prompts_dialects_only.py - Split: 95% train / 5% validation (shuffled)
- Tokenization: 512-token truncation, labels = input IDs
Hyperparameters
- Epochs: 3
- Per-device batch size: 2 (grad accumulation 8 ⇒ effective 16)
- Learning rate: 3e-4 with 100 warmup steps
- Precision: bfloat16
- Save/eval every 200 steps, best checkpoint selected automatically
Evaluation
- Validation loss monitored during training (best checkpoint selected)
- Recommended human evaluation by native speakers for dialect fidelity and cultural appropriateness
Limitations & Risks
- Coverage limited to four dialect families; sub-dialect nuances may be missing
- Model can still hallucinate or drift toward Standard Greek without strong prompts
- Training data might encode stylistic or topical biases present in GRDD
- Outputs should always be reviewed by fluent speakers before publication
Acknowledgments
- Compute: National Infrastructures for Research and Technology (GRNET)
- Funding: EU Recovery and Resilience Facility
- Base models: ILSP (Llama-Krikri-8B-Base)
Contact
For questions or issues, open an issue on the GitHub repository or contact the CLLT Lab (University of Crete).
- Downloads last month
- 38
Model tree for Stergios/krikri-8b-base-lora
Base model
ilsp/Llama-Krikri-8B-Base