Russian-Kabardian Translation Model (ru-kbd-opus)

Fine-tuned MarianMT model for Russian to Kabardian (East Circassian) translation.

Developed by: Eduard Emkuzhev
Model type: Neural Machine Translation (Marian Transformer)
Language pair: Russian (ru) → Kabardian (kbd)
License: CC BY-NC 4.0
Base model: Helsinki-NLP/opus-mt-ru-uk (Apache 2.0)
Training data: adiga-ai/circassian-parallel-corpus (CC BY 4.0)

Model Description

This model translates from Russian to Kabardian, an endangered Northwest Caucasian language with approximately 500,000 speakers. Kabardian features complex polysynthetic morphology with 50+ consonants and ergative-absolutive alignment.

Intended Use

Primary uses:

Language preservation and revitalization
Educational materials translation
Research in low-resource machine translation
Cultural heritage digitization
Supporting Kabardian language learners

Limitations:

Non-commercial use only (CC BY-NC 4.0)
Best performance on everyday language
May struggle with highly technical or literary texts
Requires proper handling of Kabardian-specific character Ӏ (palochka)

Training Data

Dataset: adiga-ai/circassian-parallel-corpus

Subset: ru_kbd (Russian → Kabardian)
Total training examples: ~100K parallel sentence pairs
Dataset license: CC BY 4.0
Dataset author: Anzor Qunash (adiga.ai)
Content: Dictionary entries, folklore, proverbs, everyday expressions

Training Procedure

Base Model

Architecture: Marian Transformer (transformer-align)
Base: Helsinki-NLP/opus-mt-ru-uk (Russian-Ukrainian translation)
Transfer learning: Adapted from Russian-Ukrainian to Russian-Kabardian

Hyperparameters

base_model: Helsinki-NLP/opus-mt-ru-uk
training_examples: 200,000
epochs: 7
batch_size: 32
learning_rate: 3e-4
optimizer: AdamW
max_sequence_length: 128
warmup_steps: 500
weight_decay: 0.01
framework: transformers 4.36.0

Special Preprocessing

The model uses a special character mapping for training:

Kabardian Ӏ (palochka) → I (Latin I) during training
I → Ӏ restored during inference

This ensures better tokenization compatibility with the MarianMT tokenizer.

Performance

Benchmark Results

Tested on 1,000 examples from adiga-ai/circassian-parallel-corpus:

Metric	Score
BLEU	18.65
chrF	52.66
TER	67.57
Exact Match	4.2%
Speed	6.9 examples/sec
Avg Time	74ms/example

Test Configuration:

Test size: 1,000 examples
Sampling: Every 50th sentence from corpus
Generation: beam_search (num_beams=4)
Device: Apple M-series (MPS)
Seed: 42 (reproducible)

Translation Examples

Russian (Input)	Kabardian (Output)
"Затем он счастлив."	"ИтӀанэ ар насыпыфӀэщ"
Старых друзей отца не покидай, старых дорог отца не оставляй.	Адэм и ныбжьэгъужьыр умыбгынэж, и адэжь гъуэгужьхэр къыумыгъанэ.
придаточное предложение цели	мурадым и псалъэухъа къызэрыгуэкӏ
внушать детям уважение к старшим	сабийхэм нэхъыжьхэм пщӏэ хуащӏын
"Для того, кто не знает обычаев, почетное место становится проходным двором."	"Хабзэ зымыщӀэм дежкӀэ жьантӀэр пхыкӀыпӀэ мэхъу"

Note: The model handles morphologically complex Kabardian words and preserves the special palochka character (Ӏ).

How to Use

Installation

pip install transformers torch sentencepiece

Basic Usage

from transformers import MarianMTModel, MarianTokenizer

# Load model and tokenizer
model_name = "kubataba/ru-kbd-opus"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Translation function
def translate_ru_to_kbd(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True)
    outputs = model.generate(**inputs, max_length=128, num_beams=4)
    translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Restore Kabardian palochka
    return translation.replace('I', 'Ӏ')

# Example
russian_text = "Здравствуйте!"
kabardian_text = translate_ru_to_kbd(russian_text)
print(f"RU: {russian_text}")
print(f"KBD: {kabardian_text}")

Batch Translation

texts = [
    "Добрый день!",
    "Как вас зовут?",
    "Спасибо за помощь"
]

inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model.generate(**inputs, max_length=128, num_beams=4)
translations = [
    tokenizer.decode(out, skip_special_tokens=True).replace('I', 'Ӏ') 
    for out in outputs
]

for src, tgt in zip(texts, translations):
    print(f"RU: {src} → KBD: {tgt}")

Limitations and Bias

Domain specificity: Best performance on conversational and dictionary-style text
Technical vocabulary: May struggle with modern technical terms not in training data
Literary style: Not optimized for highly stylized or poetic translations
Character encoding: Requires proper UTF-8 support for Kabardian Cyrillic + palochka (Ӏ)
Low-resource challenges: Limited by available parallel data (~189K pairs)

Ethical Considerations

This model supports preservation of Kabardian, an endangered language classified by UNESCO.

Responsible use:

Verify translations for critical applications with native speakers
Use as a tool to support, not replace, human translators
Respect cultural context when translating culturally significant content
Acknowledge the model's limitations in formal or official contexts

About Kabardian Language

Kabardian (Adyghe-Kabardian, East Circassian) is a Northwest Caucasian language spoken by approximately 500,000 people, primarily in:

Kabardino-Balkaria (Russia)
Karachay-Cherkessia (Russia)
Turkey (diaspora)
Middle East (diaspora)

Linguistic features:

50+ consonant phonemes (one of the world's most complex phonological systems)
Polysynthetic morphology
Ergative-absolutive alignment
Unique character: Ӏ (palochka) - represents the glottal stop

License and Attribution

This Model

License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
Author: Eduard Emkuzhev
Year: 2025

Base Model

Model: Helsinki-NLP/opus-mt-ru-uk
License: Apache 2.0
Authors: Language Technology Research Group at the University of Helsinki
Link: https://huggingface.co/Helsinki-NLP/opus-mt-ru-uk

Training Dataset

Dataset: Circassian-Russian Parallel Corpus v1.0
License: CC BY 4.0
Author: Anzor Qunash (adiga.ai)
Link: https://huggingface.co/datasets/adiga-ai/circassian-parallel-corpus

Citation

If you use this model in your research, please cite:

@misc{emkuzhev2025rukbd,
  author = {Eduard Emkuzhev},
  title = {Russian-Kabardian Neural Machine Translation Model},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/kubataba/ru-kbd-opus}}
}

Please also cite the base model and dataset:

@misc{helsinki-nlp-opus-ru-uk,
  author = {Language Technology Research Group at the University of Helsinki},
  title = {OPUS-MT Russian-Ukrainian Translation Model},
  year = {2020},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Helsinki-NLP/opus-mt-ru-uk}}
}

@dataset{qunash2025circassian,
  author = {Anzor Qunash},
  title = {Circassian-Russian Parallel Text Corpus v1.0},
  year = {2025},
  publisher = {adiga.ai},
  url = {https://huggingface.co/datasets/adiga-ai/circassian-parallel-corpus}
}

Acknowledgments

Helsinki-NLP team for the excellent OPUS-MT base models
Anzor Qunash (adiga.ai) for creating and publishing the Circassian-Russian Parallel Corpus
Kabardian language community for preserving and promoting their language
All contributors to Circassian language digitization efforts

Related Models

Reverse direction: kubataba/kbd-ru-opus - Kabardian to Russian
Base model: Helsinki-NLP/opus-mt-ru-uk
Dataset: adiga-ai/circassian-parallel-corpus

Technical Details

Framework: PyTorch + Transformers
Model size: ~300MB
Vocabulary size: ~57K tokens
Parameters: ~74M
Inference: CPU and GPU compatible
Optimal device: GPU or Apple Silicon (MPS)

Contact

Author: Eduard Emkuzhev
Email: info@copperline.info
GitHub: https://github.com/kubataba
Issues: Report issues

For commercial licensing inquiries, please contact via email.

Model Card Authors: Eduard Emkuzhev

Last Updated: December 2025

Version: 1.0.1

Downloads last month: 109

Safetensors

Model size

73.8M params

Tensor type

F32

Model tree for kubataba/ru-kbd-opus

Base model

Helsinki-NLP/opus-mt-ru-uk

Finetuned

(3)

this model

kubataba
/

ru-kbd-opus

Russian-Kabardian Translation Model (ru-kbd-opus)

Model Description

Intended Use

Training Data

Training Procedure

Base Model

Hyperparameters

Special Preprocessing

Performance

Benchmark Results

Translation Examples

How to Use

Installation

Basic Usage

Batch Translation

Limitations and Bias

Ethical Considerations

About Kabardian Language

License and Attribution

This Model

Base Model

Training Dataset

Citation

Acknowledgments

Related Models

Technical Details

Contact

Model tree for kubataba/ru-kbd-opus

Dataset used to train kubataba/ru-kbd-opus