Greek Dialect LoRA — Krikri-8B Adapter

LoRA adapter trained by the Computational Linguistics & Language Technology (CLLT) Lab, University of Crete, for producing text in Pontic, Cretan, Northern Greek, and Cypriot dialects. The adapter augments ilsp/Llama-Krikri-8B-Base with dialect-only data prepared via natural Greek prompts.

Project website: https://stergioscha.github.io/CLLT/

Model Details

  • Developer: CLLT Lab, University of Crete
  • Base model: ilsp/Llama-Krikri-8B-Base
  • Adapter type: LoRA via PEFT (r=16, α=32, dropout=0.1, q/k/v/o/gate/up/down projections)
  • Trained parameters: 41.9M (≈0.51% of the base model)
  • Dataset: 23,421 natural-prompt examples derived from the Greek Regional Dialects Dataset (GRDD)
  • Languages: Greek dialectal varieties (Pontic, Cretan, Northern, Cypriot)
  • License: Research purposes only (respect the base model’s license)
  • Funding / compute: AWS resources provided by GRNET and funded by the EU Recovery & Resilience Facility

Model Sources

Intended Use

Direct use

  • Dialectal text generation for cultural heritage, education, and research
  • Conversational agents that must answer in a specific Greek dialect
  • Prompt-based experimentation with dialect-specific stylistics

Downstream use

  • Integrate the adapter inside chatbots or RAG pipelines that need dialectal answers
  • Build evaluation suites for low-resource Greek varieties

Out-of-scope / limitations

  • Standard Modern Greek generation (training data excluded it)
  • High-stakes domains (medical, legal, safety-critical) without human oversight
  • Automatic dialect classification or translation between dialects

How to Use

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "ilsp/Llama-Krikri-8B-Base",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Base")
model = PeftModel.from_pretrained(base_model, "Stergios/krikri-8b-base-lora")

prompt = "Γράψε στην ποντιακή διάλεκτο: Καλημέρα, πώς είσαι;"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=120, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Data & Procedure

  • Source: GRDD + GRDD+, filtered to keep only dialect-tagged entries
  • Conversion: Artificial tags (<po>, <cr>, <no>, <cy>) mapped to natural Greek instructions using convert_to_natural_prompts_dialects_only.py
  • Split: 95% train / 5% validation (shuffled)
  • Tokenization: 512-token truncation, labels = input IDs

Hyperparameters

  • Epochs: 3
  • Per-device batch size: 2 (grad accumulation 8 ⇒ effective 16)
  • Learning rate: 3e-4 with 100 warmup steps
  • Precision: bfloat16
  • Save/eval every 200 steps, best checkpoint selected automatically

Evaluation

  • Validation loss monitored during training (best checkpoint selected)
  • Recommended human evaluation by native speakers for dialect fidelity and cultural appropriateness

Limitations & Risks

  • Coverage limited to four dialect families; sub-dialect nuances may be missing
  • Model can still hallucinate or drift toward Standard Greek without strong prompts
  • Training data might encode stylistic or topical biases present in GRDD
  • Outputs should always be reviewed by fluent speakers before publication

Acknowledgments

  • Compute: National Infrastructures for Research and Technology (GRNET)
  • Funding: EU Recovery and Resilience Facility
  • Base models: ILSP (Llama-Krikri-8B-Base)

Contact

For questions or issues, open an issue on the GitHub repository or contact the CLLT Lab (University of Crete).

Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Stergios/krikri-8b-base-lora

Adapter
(1)
this model