Greek Dialect LoRA — Krikri-8B Adapter

LoRA adapter trained by the Computational Linguistics & Language Technology (CLLT) Lab, University of Crete, for producing text in Pontic, Cretan, Northern Greek, and Cypriot dialects. The adapter augments ilsp/Llama-Krikri-8B-Base with dialect-only data prepared via natural Greek prompts.

Project website: https://stergioscha.github.io/CLLT/

Model Details

Developer: CLLT Lab, University of Crete
Base model: ilsp/Llama-Krikri-8B-Base
Adapter type: LoRA via PEFT (r=16, α=32, dropout=0.1, q/k/v/o/gate/up/down projections)
Trained parameters: 41.9M (≈0.51% of the base model)
Dataset: 23,421 natural-prompt examples derived from the Greek Regional Dialects Dataset (GRDD)
Languages: Greek dialectal varieties (Pontic, Cretan, Northern, Cypriot)
License: Research purposes only (respect the base model’s license)
Funding / compute: AWS resources provided by GRNET and funded by the EU Recovery & Resilience Facility

Model Sources

GitHub: https://github.com/StergiosCha/krikri_dialectal
Dataset description: https://github.com/StergiosCha/Greek_dialect_corpus
Project site: https://stergioscha.github.io/CLLT/

Intended Use

Direct use

Dialectal text generation for cultural heritage, education, and research
Conversational agents that must answer in a specific Greek dialect
Prompt-based experimentation with dialect-specific stylistics

Downstream use

Integrate the adapter inside chatbots or RAG pipelines that need dialectal answers
Build evaluation suites for low-resource Greek varieties

Out-of-scope / limitations

Standard Modern Greek generation (training data excluded it)
High-stakes domains (medical, legal, safety-critical) without human oversight
Automatic dialect classification or translation between dialects

How to Use

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "ilsp/Llama-Krikri-8B-Base",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Base")
model = PeftModel.from_pretrained(base_model, "Stergios/krikri-8b-base-lora")

prompt = "Γράψε στην ποντιακή διάλεκτο: Καλημέρα, πώς είσαι;"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=120, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Data & Procedure

Source: GRDD + GRDD+, filtered to keep only dialect-tagged entries
Conversion: Artificial tags (<po>, <cr>, <no>, <cy>) mapped to natural Greek instructions using convert_to_natural_prompts_dialects_only.py
Split: 95% train / 5% validation (shuffled)
Tokenization: 512-token truncation, labels = input IDs

Hyperparameters

Epochs: 3
Per-device batch size: 2 (grad accumulation 8 ⇒ effective 16)
Learning rate: 3e-4 with 100 warmup steps
Precision: bfloat16
Save/eval every 200 steps, best checkpoint selected automatically

Evaluation

Validation loss monitored during training (best checkpoint selected)
Recommended human evaluation by native speakers for dialect fidelity and cultural appropriateness

Limitations & Risks

Coverage limited to four dialect families; sub-dialect nuances may be missing
Model can still hallucinate or drift toward Standard Greek without strong prompts
Training data might encode stylistic or topical biases present in GRDD
Outputs should always be reviewed by fluent speakers before publication

Acknowledgments

Compute: National Infrastructures for Research and Technology (GRNET)
Funding: EU Recovery and Resilience Facility
Base models: ILSP (Llama-Krikri-8B-Base)

Contact

For questions or issues, open an issue on the GitHub repository or contact the CLLT Lab (University of Crete).

Downloads last month: 38

Model tree for Stergios/krikri-8b-base-lora

Base model

ilsp/Llama-Krikri-8B-Base

Adapter

(1)

this model