AI & ML interests
Tajik language, Persian NLP, low-resource languages, machine translation, language models, linguistic resources
Recent Activity
TajikNLPWorld – Persian NLP & Low‑Resource Languages Research Hub
TajikNLPWorld is a collaborative research initiative dedicated to advancing natural language processing for Tajik, Persian/Dari, and low‑resource languages in the Iranian language family. We develop state‑of‑the‑art language models, machine translation systems, linguistic resources, and educational tools to empower under‑represented languages in the digital age.
🎯 Our Mission
- Build open‑source language models for Tajik and other Persian varieties (Dari, Hazaragi).
- Create high‑quality linguistic resources (corpora, lexicons, evaluation benchmarks).
- Advance machine translation between Tajik, Persian, Dari, and major world languages.
- Develop educational materials and interactive demos to lower the entry barrier for low‑resource NLP.
- Foster a community of researchers, developers, and native speakers working together on language technology.
🚀 Interactive Demos
Explore our live Hugging Face Spaces and try out our models directly in your browser:
🔤 Language Models
- TajikGPT Playground – Generate and analyze Tajik text with our latest causal LM.
- PersianBERT Explorer – Masked language modelling for Tajik, Persian, and Dari.
- Multilingual Embeddings – Compare word/sentence vectors across Iranian languages.
🌐 Machine Translation
- Tajik ↔ Persian Translator – Neural translation between Tajik and Persian (Farsi).
- Tajik ↔ Russian Translator – Translation demo for Central Asian context.
- Persian Multi-Dialect Translation – Translate between Tajik, Dari, and Iranian Persian.
📚 Linguistic Tools
- Tajik Morphological Analyzer – Interactive segmentation and POS tagging (Cyrillic & Perso-Arabic scripts).
- Named Entity Recognition for Tajik – Identify persons, locations, organizations.
- Script Converter – Convert between Cyrillic Tajik and Perso-Arabic script.
📊 Data & Benchmarks
- Tajik Corpus Explorer – Browse and query our curated text collections.
- Persian NLP Leaderboard – Compare model performance on Tajik, Persian, and Dari tasks.
- Annotation Tools – Help us improve datasets with your feedback.
Click on any demo to start experimenting – no installation required!
🧠 Research Focus Areas
🏔️ Tajik Language Technologies
- Creation of the first large‑scale pretrained models for Tajik (both Cyrillic and Perso-Arabic scripts).
- Morphological disambiguation and syntactic parsing for Tajik.
- Speech recognition and synthesis for Tajik (coming soon).
📜 Persian & Iranian NLP
- Cross‑dialectal transfer learning among Tajik, Dari, and Iranian Persian.
- Unified tokenization and subword models for the Persian language continuum.
- Machine translation between Persian varieties and major languages.
📉 Low‑Resource NLP
- Data augmentation and semi‑supervised learning techniques.
- Leveraging multilingual models (e.g., mT5, XLM‑R) for under‑represented languages.
- Few‑shot and zero‑shot learning for tasks like NER and sentiment analysis.
🤖 Language Models
- Pretraining from scratch and continued pretraining on Persian/Tajik corpora.
- Efficient architectures (ALBERT, DistilBERT) for low‑resource settings.
- Evaluation and bias analysis of Persian language models.
📖 Linguistic Resources
- Corpora: News, literature, web‑crawled texts, social media.
- Lexicons: Morphological dictionaries, wordnets, sentiment lexicons.
- Benchmarks: Named entity recognition, part‑of‑speech tagging, machine translation test sets.
📦 Models & Datasets
We release all our models and datasets on Hugging Face Hub under open licenses.
| Model / Dataset | Description | Link |
|---|---|---|
| TajikBERT | BERT‑base model pretrained on Tajik Cyrillic and Perso-Arabic texts | 🤗 Hub |
| Persian‑mT5 | Multilingual T5 fine‑tuned on Tajik, Persian, and Dari | 🤗 Hub |
| Tajik‑MT‑TgFa | Transformer‑based translation model (Tajik ↔ Persian/Farsi) | 🤗 Hub |
| Tajik‑NER | Named entity recognition model for Tajik | 🤗 Hub |
| TajikCorpus v1.0 | 150M token corpus from news, books, and websites (bilingual scripts) | 🤗 Dataset |
| Persian‑Dialect‑Bench | Parallel sentences for Tajik, Dari, and Iranian Persian | 🤗 Dataset |
More models and datasets are added regularly. Follow our organization page for updates.
📚 Educational Resources
We believe in open education and reproducible research. All our tutorials and teaching materials are freely available.
- Interactive Notebooks – Hands‑on tutorials for building low‑resource NLP systems (in Python, using Hugging Face libraries).
- Video Lectures – Recorded talks on Persian/Tajik NLP, data collection, and model training.
- Course Materials – Slides, readings, and assignments from our university courses.
- Blog Posts – Deep dives into challenges and solutions for Tajik and Persian languages.
📝 Selected Publications
- "TajikBERT: A Pretrained Language Model for Tajik in Cyrillic and Perso-Arabic Scripts" – LREC 2024
- "Bridging Dialects: Machine Translation for Tajik, Dari, and Persian" – WMT 2023
- "Building a Named Entity Recognition Dataset for Tajik" – IranNLP 2023
- "Multilingual Representations for Iranian Languages: A Comparative Study" – EMNLP 2022
- "Tajik Corpus: Collection, Annotation, and Baseline Experiments" – Dialogue 2022
Full list with links to PDFs available on our Publications Page.
🤝 Get Involved
We welcome contributions from the community – whether you are a researcher, developer, student, or native speaker.
For Researchers
- Use our models and datasets in your work (and cite us!).
- Collaborate on joint papers and grant proposals.
- Contribute new benchmarks or evaluation tasks.
For Developers
- Integrate our models into your applications.
- Report bugs or suggest improvements via GitHub Issues.
- Submit pull requests to our open‑source repositories.
For Native Speakers & Linguists
- Help us validate translations and annotations.
- Share texts or corpora (with permission) to enrich our data.
- Provide feedback on model outputs to reduce errors.
For Students
- Use our demos and tutorials for learning.
- Participate in our mentorship program or summer schools.
- Start your own research project with our support.
🌐 Connect With Us
- 🤗 Hugging Face: TajikNLPWorld – Models, datasets, and spaces.
- 💻 GitHub: TajikNLPWorld – Source code, development, and issue tracking.
- 📧 Email: contact@tajiknlp.world – General inquiries and collaboration.
- 🐦 Twitter/X: @TajikNLP – News and updates.
- 📝 Blog: Medium/TajikNLPWorld – In‑depth articles.
🔄 Ecosystem Integration
Our work is integrated with the broader Hugging Face ecosystem:
- Models on the Hub with easy‑to‑use pipelines.
- Datasets with streaming and evaluation scripts.
- Spaces for interactive demos and educational tools.
- Gradio apps for user‑friendly interfaces.
Empowering Tajik and Persian languages through open science and community collaboration.