Russian-Kabardian Translation Model (ru-kbd-opus)

Fine-tuned MarianMT model for Russian to Kabardian (East Circassian) translation.

Model Description

This model translates from Russian to Kabardian, an endangered Northwest Caucasian language with approximately 500,000 speakers. Kabardian features complex polysynthetic morphology with 50+ consonants and ergative-absolutive alignment.

Intended Use

Primary uses:

  • Language preservation and revitalization
  • Educational materials translation
  • Research in low-resource machine translation
  • Cultural heritage digitization
  • Supporting Kabardian language learners

Limitations:

  • Non-commercial use only (CC BY-NC 4.0)
  • Best performance on everyday language
  • May struggle with highly technical or literary texts
  • Requires proper handling of Kabardian-specific character Ӏ (palochka)

Training Data

Dataset: adiga-ai/circassian-parallel-corpus

  • Subset: ru_kbd (Russian → Kabardian)
  • Total training examples: ~100K parallel sentence pairs
  • Dataset license: CC BY 4.0
  • Dataset author: Anzor Qunash (adiga.ai)
  • Content: Dictionary entries, folklore, proverbs, everyday expressions

Training Procedure

Base Model

  • Architecture: Marian Transformer (transformer-align)
  • Base: Helsinki-NLP/opus-mt-ru-uk (Russian-Ukrainian translation)
  • Transfer learning: Adapted from Russian-Ukrainian to Russian-Kabardian

Hyperparameters

base_model: Helsinki-NLP/opus-mt-ru-uk
training_examples: 200,000
epochs: 7
batch_size: 32
learning_rate: 3e-4
optimizer: AdamW
max_sequence_length: 128
warmup_steps: 500
weight_decay: 0.01
framework: transformers 4.36.0

Special Preprocessing

The model uses a special character mapping for training:

  • Kabardian Ӏ (palochka) → I (Latin I) during training
  • I → Ӏ restored during inference

This ensures better tokenization compatibility with the MarianMT tokenizer.

Performance

Benchmark Results

Tested on 1,000 examples from adiga-ai/circassian-parallel-corpus:

Metric Score
BLEU 18.65
chrF 52.66
TER 67.57
Exact Match 4.2%
Speed 6.9 examples/sec
Avg Time 74ms/example

Test Configuration:

  • Test size: 1,000 examples
  • Sampling: Every 50th sentence from corpus
  • Generation: beam_search (num_beams=4)
  • Device: Apple M-series (MPS)
  • Seed: 42 (reproducible)

Translation Examples

Russian (Input) Kabardian (Output)
"Затем он счастлив." "ИтӀанэ ар насыпыфӀэщ"
Старых друзей отца не покидай, старых дорог отца не оставляй. Адэм и ныбжьэгъужьыр умыбгынэж, и адэжь гъуэгужьхэр къыумыгъанэ.
придаточное предложение цели мурадым и псалъэухъа къызэрыгуэкӏ
внушать детям уважение к старшим сабийхэм нэхъыжьхэм пщӏэ хуащӏын
"Для того, кто не знает обычаев, почетное место становится проходным двором." "Хабзэ зымыщӀэм дежкӀэ жьантӀэр пхыкӀыпӀэ мэхъу"

Note: The model handles morphologically complex Kabardian words and preserves the special palochka character (Ӏ).

How to Use

Installation

pip install transformers torch sentencepiece

Basic Usage

from transformers import MarianMTModel, MarianTokenizer

# Load model and tokenizer
model_name = "kubataba/ru-kbd-opus"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Translation function
def translate_ru_to_kbd(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True)
    outputs = model.generate(**inputs, max_length=128, num_beams=4)
    translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Restore Kabardian palochka
    return translation.replace('I', 'Ӏ')

# Example
russian_text = "Здравствуйте!"
kabardian_text = translate_ru_to_kbd(russian_text)
print(f"RU: {russian_text}")
print(f"KBD: {kabardian_text}")

Batch Translation

texts = [
    "Добрый день!",
    "Как вас зовут?",
    "Спасибо за помощь"
]

inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model.generate(**inputs, max_length=128, num_beams=4)
translations = [
    tokenizer.decode(out, skip_special_tokens=True).replace('I', 'Ӏ') 
    for out in outputs
]

for src, tgt in zip(texts, translations):
    print(f"RU: {src} → KBD: {tgt}")

Limitations and Bias

  • Domain specificity: Best performance on conversational and dictionary-style text
  • Technical vocabulary: May struggle with modern technical terms not in training data
  • Literary style: Not optimized for highly stylized or poetic translations
  • Character encoding: Requires proper UTF-8 support for Kabardian Cyrillic + palochka (Ӏ)
  • Low-resource challenges: Limited by available parallel data (~189K pairs)

Ethical Considerations

This model supports preservation of Kabardian, an endangered language classified by UNESCO.

Responsible use:

  • Verify translations for critical applications with native speakers
  • Use as a tool to support, not replace, human translators
  • Respect cultural context when translating culturally significant content
  • Acknowledge the model's limitations in formal or official contexts

About Kabardian Language

Kabardian (Adyghe-Kabardian, East Circassian) is a Northwest Caucasian language spoken by approximately 500,000 people, primarily in:

  • Kabardino-Balkaria (Russia)
  • Karachay-Cherkessia (Russia)
  • Turkey (diaspora)
  • Middle East (diaspora)

Linguistic features:

  • 50+ consonant phonemes (one of the world's most complex phonological systems)
  • Polysynthetic morphology
  • Ergative-absolutive alignment
  • Unique character: Ӏ (palochka) - represents the glottal stop

License and Attribution

This Model

  • License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
  • Author: Eduard Emkuzhev
  • Year: 2025

Base Model

Training Dataset

Citation

If you use this model in your research, please cite:

@misc{emkuzhev2025rukbd,
  author = {Eduard Emkuzhev},
  title = {Russian-Kabardian Neural Machine Translation Model},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/kubataba/ru-kbd-opus}}
}

Please also cite the base model and dataset:

@misc{helsinki-nlp-opus-ru-uk,
  author = {Language Technology Research Group at the University of Helsinki},
  title = {OPUS-MT Russian-Ukrainian Translation Model},
  year = {2020},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Helsinki-NLP/opus-mt-ru-uk}}
}

@dataset{qunash2025circassian,
  author = {Anzor Qunash},
  title = {Circassian-Russian Parallel Text Corpus v1.0},
  year = {2025},
  publisher = {adiga.ai},
  url = {https://huggingface.co/datasets/adiga-ai/circassian-parallel-corpus}
}

Acknowledgments

  • Helsinki-NLP team for the excellent OPUS-MT base models
  • Anzor Qunash (adiga.ai) for creating and publishing the Circassian-Russian Parallel Corpus
  • Kabardian language community for preserving and promoting their language
  • All contributors to Circassian language digitization efforts

Related Models

Technical Details

  • Framework: PyTorch + Transformers
  • Model size: ~300MB
  • Vocabulary size: ~57K tokens
  • Parameters: ~74M
  • Inference: CPU and GPU compatible
  • Optimal device: GPU or Apple Silicon (MPS)

Contact

For commercial licensing inquiries, please contact via email.


Model Card Authors: Eduard Emkuzhev

Last Updated: December 2025

Version: 1.0.1

Downloads last month
109
Safetensors
Model size
73.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kubataba/ru-kbd-opus

Finetuned
(3)
this model

Dataset used to train kubataba/ru-kbd-opus