Russian-Kabardian Translation Model (ru-kbd-opus)
Fine-tuned MarianMT model for Russian to Kabardian (East Circassian) translation.
- Developed by: Eduard Emkuzhev
- Model type: Neural Machine Translation (Marian Transformer)
- Language pair: Russian (ru) → Kabardian (kbd)
- License: CC BY-NC 4.0
- Base model: Helsinki-NLP/opus-mt-ru-uk (Apache 2.0)
- Training data: adiga-ai/circassian-parallel-corpus (CC BY 4.0)
Model Description
This model translates from Russian to Kabardian, an endangered Northwest Caucasian language with approximately 500,000 speakers. Kabardian features complex polysynthetic morphology with 50+ consonants and ergative-absolutive alignment.
Intended Use
Primary uses:
- Language preservation and revitalization
- Educational materials translation
- Research in low-resource machine translation
- Cultural heritage digitization
- Supporting Kabardian language learners
Limitations:
- Non-commercial use only (CC BY-NC 4.0)
- Best performance on everyday language
- May struggle with highly technical or literary texts
- Requires proper handling of Kabardian-specific character Ӏ (palochka)
Training Data
Dataset: adiga-ai/circassian-parallel-corpus
- Subset:
ru_kbd(Russian → Kabardian) - Total training examples: ~100K parallel sentence pairs
- Dataset license: CC BY 4.0
- Dataset author: Anzor Qunash (adiga.ai)
- Content: Dictionary entries, folklore, proverbs, everyday expressions
Training Procedure
Base Model
- Architecture: Marian Transformer (transformer-align)
- Base: Helsinki-NLP/opus-mt-ru-uk (Russian-Ukrainian translation)
- Transfer learning: Adapted from Russian-Ukrainian to Russian-Kabardian
Hyperparameters
base_model: Helsinki-NLP/opus-mt-ru-uk
training_examples: 200,000
epochs: 7
batch_size: 32
learning_rate: 3e-4
optimizer: AdamW
max_sequence_length: 128
warmup_steps: 500
weight_decay: 0.01
framework: transformers 4.36.0
Special Preprocessing
The model uses a special character mapping for training:
- Kabardian Ӏ (palochka) → I (Latin I) during training
- I → Ӏ restored during inference
This ensures better tokenization compatibility with the MarianMT tokenizer.
Performance
Benchmark Results
Tested on 1,000 examples from adiga-ai/circassian-parallel-corpus:
| Metric | Score |
|---|---|
| BLEU | 18.65 |
| chrF | 52.66 |
| TER | 67.57 |
| Exact Match | 4.2% |
| Speed | 6.9 examples/sec |
| Avg Time | 74ms/example |
Test Configuration:
- Test size: 1,000 examples
- Sampling: Every 50th sentence from corpus
- Generation: beam_search (num_beams=4)
- Device: Apple M-series (MPS)
- Seed: 42 (reproducible)
Translation Examples
| Russian (Input) | Kabardian (Output) |
|---|---|
| "Затем он счастлив." | "ИтӀанэ ар насыпыфӀэщ" |
| Старых друзей отца не покидай, старых дорог отца не оставляй. | Адэм и ныбжьэгъужьыр умыбгынэж, и адэжь гъуэгужьхэр къыумыгъанэ. |
| придаточное предложение цели | мурадым и псалъэухъа къызэрыгуэкӏ |
| внушать детям уважение к старшим | сабийхэм нэхъыжьхэм пщӏэ хуащӏын |
| "Для того, кто не знает обычаев, почетное место становится проходным двором." | "Хабзэ зымыщӀэм дежкӀэ жьантӀэр пхыкӀыпӀэ мэхъу" |
Note: The model handles morphologically complex Kabardian words and preserves the special palochka character (Ӏ).
How to Use
Installation
pip install transformers torch sentencepiece
Basic Usage
from transformers import MarianMTModel, MarianTokenizer
# Load model and tokenizer
model_name = "kubataba/ru-kbd-opus"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
# Translation function
def translate_ru_to_kbd(text):
inputs = tokenizer(text, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_length=128, num_beams=4)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Restore Kabardian palochka
return translation.replace('I', 'Ӏ')
# Example
russian_text = "Здравствуйте!"
kabardian_text = translate_ru_to_kbd(russian_text)
print(f"RU: {russian_text}")
print(f"KBD: {kabardian_text}")
Batch Translation
texts = [
"Добрый день!",
"Как вас зовут?",
"Спасибо за помощь"
]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model.generate(**inputs, max_length=128, num_beams=4)
translations = [
tokenizer.decode(out, skip_special_tokens=True).replace('I', 'Ӏ')
for out in outputs
]
for src, tgt in zip(texts, translations):
print(f"RU: {src} → KBD: {tgt}")
Limitations and Bias
- Domain specificity: Best performance on conversational and dictionary-style text
- Technical vocabulary: May struggle with modern technical terms not in training data
- Literary style: Not optimized for highly stylized or poetic translations
- Character encoding: Requires proper UTF-8 support for Kabardian Cyrillic + palochka (Ӏ)
- Low-resource challenges: Limited by available parallel data (~189K pairs)
Ethical Considerations
This model supports preservation of Kabardian, an endangered language classified by UNESCO.
Responsible use:
- Verify translations for critical applications with native speakers
- Use as a tool to support, not replace, human translators
- Respect cultural context when translating culturally significant content
- Acknowledge the model's limitations in formal or official contexts
About Kabardian Language
Kabardian (Adyghe-Kabardian, East Circassian) is a Northwest Caucasian language spoken by approximately 500,000 people, primarily in:
- Kabardino-Balkaria (Russia)
- Karachay-Cherkessia (Russia)
- Turkey (diaspora)
- Middle East (diaspora)
Linguistic features:
- 50+ consonant phonemes (one of the world's most complex phonological systems)
- Polysynthetic morphology
- Ergative-absolutive alignment
- Unique character: Ӏ (palochka) - represents the glottal stop
License and Attribution
This Model
- License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
- Author: Eduard Emkuzhev
- Year: 2025
Base Model
- Model: Helsinki-NLP/opus-mt-ru-uk
- License: Apache 2.0
- Authors: Language Technology Research Group at the University of Helsinki
- Link: https://huggingface.co/Helsinki-NLP/opus-mt-ru-uk
Training Dataset
- Dataset: Circassian-Russian Parallel Corpus v1.0
- License: CC BY 4.0
- Author: Anzor Qunash (adiga.ai)
- Link: https://huggingface.co/datasets/adiga-ai/circassian-parallel-corpus
Citation
If you use this model in your research, please cite:
@misc{emkuzhev2025rukbd,
author = {Eduard Emkuzhev},
title = {Russian-Kabardian Neural Machine Translation Model},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/kubataba/ru-kbd-opus}}
}
Please also cite the base model and dataset:
@misc{helsinki-nlp-opus-ru-uk,
author = {Language Technology Research Group at the University of Helsinki},
title = {OPUS-MT Russian-Ukrainian Translation Model},
year = {2020},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/Helsinki-NLP/opus-mt-ru-uk}}
}
@dataset{qunash2025circassian,
author = {Anzor Qunash},
title = {Circassian-Russian Parallel Text Corpus v1.0},
year = {2025},
publisher = {adiga.ai},
url = {https://huggingface.co/datasets/adiga-ai/circassian-parallel-corpus}
}
Acknowledgments
- Helsinki-NLP team for the excellent OPUS-MT base models
- Anzor Qunash (adiga.ai) for creating and publishing the Circassian-Russian Parallel Corpus
- Kabardian language community for preserving and promoting their language
- All contributors to Circassian language digitization efforts
Related Models
- Reverse direction: kubataba/kbd-ru-opus - Kabardian to Russian
- Base model: Helsinki-NLP/opus-mt-ru-uk
- Dataset: adiga-ai/circassian-parallel-corpus
Technical Details
- Framework: PyTorch + Transformers
- Model size: ~300MB
- Vocabulary size: ~57K tokens
- Parameters: ~74M
- Inference: CPU and GPU compatible
- Optimal device: GPU or Apple Silicon (MPS)
Contact
- Author: Eduard Emkuzhev
- Email: info@copperline.info
- GitHub: https://github.com/kubataba
- Issues: Report issues
For commercial licensing inquiries, please contact via email.
Model Card Authors: Eduard Emkuzhev
Last Updated: December 2025
Version: 1.0.1
- Downloads last month
- 109
Model tree for kubataba/ru-kbd-opus
Base model
Helsinki-NLP/opus-mt-ru-uk