Tajik NLP Community / Natural Language Processing for Tajik Language

community

Activity Feed

AI & ML interests

Tajik language, Persian NLP, low-resource languages, machine translation, language models, linguistic resources

Recent Activity

ArabovMK updated a Space 1 day ago

TajikNLPWorld/README

ArabovMK published a Space 1 day ago

TajikNLPWorld/README

View all activity

Organization Card

Community About org cards

TajikNLPWorld – Persian NLP & Low‑Resource Languages Research Hub

TajikNLPWorld is a collaborative research initiative dedicated to advancing natural language processing for Tajik, Persian/Dari, and low‑resource languages in the Iranian language family. We develop state‑of‑the‑art language models, machine translation systems, linguistic resources, and educational tools to empower under‑represented languages in the digital age.

🎯 Our Mission

Build open‑source language models for Tajik and other Persian varieties (Dari, Hazaragi).
Create high‑quality linguistic resources (corpora, lexicons, evaluation benchmarks).
Advance machine translation between Tajik, Persian, Dari, and major world languages.
Develop educational materials and interactive demos to lower the entry barrier for low‑resource NLP.
Foster a community of researchers, developers, and native speakers working together on language technology.

🚀 Interactive Demos

Explore our live Hugging Face Spaces and try out our models directly in your browser:

🔤 Language Models

TajikGPT Playground – Generate and analyze Tajik text with our latest causal LM.
PersianBERT Explorer – Masked language modelling for Tajik, Persian, and Dari.
Multilingual Embeddings – Compare word/sentence vectors across Iranian languages.

🌐 Machine Translation

Tajik ↔ Persian Translator – Neural translation between Tajik and Persian (Farsi).
Tajik ↔ Russian Translator – Translation demo for Central Asian context.
Persian Multi-Dialect Translation – Translate between Tajik, Dari, and Iranian Persian.

📚 Linguistic Tools

Tajik Morphological Analyzer – Interactive segmentation and POS tagging (Cyrillic & Perso-Arabic scripts).
Named Entity Recognition for Tajik – Identify persons, locations, organizations.
Script Converter – Convert between Cyrillic Tajik and Perso-Arabic script.

📊 Data & Benchmarks

Tajik Corpus Explorer – Browse and query our curated text collections.
Persian NLP Leaderboard – Compare model performance on Tajik, Persian, and Dari tasks.
Annotation Tools – Help us improve datasets with your feedback.

Click on any demo to start experimenting – no installation required!

🧠 Research Focus Areas

🏔️ Tajik Language Technologies

Creation of the first large‑scale pretrained models for Tajik (both Cyrillic and Perso-Arabic scripts).
Morphological disambiguation and syntactic parsing for Tajik.
Speech recognition and synthesis for Tajik (coming soon).

📜 Persian & Iranian NLP

Cross‑dialectal transfer learning among Tajik, Dari, and Iranian Persian.
Unified tokenization and subword models for the Persian language continuum.
Machine translation between Persian varieties and major languages.

📉 Low‑Resource NLP

Data augmentation and semi‑supervised learning techniques.
Leveraging multilingual models (e.g., mT5, XLM‑R) for under‑represented languages.
Few‑shot and zero‑shot learning for tasks like NER and sentiment analysis.

🤖 Language Models

Pretraining from scratch and continued pretraining on Persian/Tajik corpora.
Efficient architectures (ALBERT, DistilBERT) for low‑resource settings.
Evaluation and bias analysis of Persian language models.

📖 Linguistic Resources

Corpora: News, literature, web‑crawled texts, social media.
Lexicons: Morphological dictionaries, wordnets, sentiment lexicons.
Benchmarks: Named entity recognition, part‑of‑speech tagging, machine translation test sets.

📦 Models & Datasets

We release all our models and datasets on Hugging Face Hub under open licenses.

Model / Dataset	Description	Link
TajikBERT	BERT‑base model pretrained on Tajik Cyrillic and Perso-Arabic texts	🤗 Hub
Persian‑mT5	Multilingual T5 fine‑tuned on Tajik, Persian, and Dari	🤗 Hub
Tajik‑MT‑TgFa	Transformer‑based translation model (Tajik ↔ Persian/Farsi)	🤗 Hub
Tajik‑NER	Named entity recognition model for Tajik	🤗 Hub
TajikCorpus v1.0	150M token corpus from news, books, and websites (bilingual scripts)	🤗 Dataset
Persian‑Dialect‑Bench	Parallel sentences for Tajik, Dari, and Iranian Persian	🤗 Dataset

More models and datasets are added regularly. Follow our organization page for updates.

📚 Educational Resources

We believe in open education and reproducible research. All our tutorials and teaching materials are freely available.

Interactive Notebooks – Hands‑on tutorials for building low‑resource NLP systems (in Python, using Hugging Face libraries).
Video Lectures – Recorded talks on Persian/Tajik NLP, data collection, and model training.
Course Materials – Slides, readings, and assignments from our university courses.
Blog Posts – Deep dives into challenges and solutions for Tajik and Persian languages.

📝 Selected Publications

"TajikBERT: A Pretrained Language Model for Tajik in Cyrillic and Perso-Arabic Scripts" – LREC 2024
"Bridging Dialects: Machine Translation for Tajik, Dari, and Persian" – WMT 2023
"Building a Named Entity Recognition Dataset for Tajik" – IranNLP 2023
"Multilingual Representations for Iranian Languages: A Comparative Study" – EMNLP 2022
"Tajik Corpus: Collection, Annotation, and Baseline Experiments" – Dialogue 2022

Full list with links to PDFs available on our Publications Page.

🤝 Get Involved

We welcome contributions from the community – whether you are a researcher, developer, student, or native speaker.

For Researchers

Use our models and datasets in your work (and cite us!).
Collaborate on joint papers and grant proposals.
Contribute new benchmarks or evaluation tasks.

For Developers

Integrate our models into your applications.
Report bugs or suggest improvements via GitHub Issues.
Submit pull requests to our open‑source repositories.

For Native Speakers & Linguists

Help us validate translations and annotations.
Share texts or corpora (with permission) to enrich our data.
Provide feedback on model outputs to reduce errors.

For Students

Use our demos and tutorials for learning.
Participate in our mentorship program or summer schools.
Start your own research project with our support.

🌐 Connect With Us

🤗 Hugging Face: TajikNLPWorld – Models, datasets, and spaces.
💻 GitHub: TajikNLPWorld – Source code, development, and issue tracking.
📧 Email: contact@tajiknlp.world – General inquiries and collaboration.
🐦 Twitter/X: @TajikNLP – News and updates.
📝 Blog: Medium/TajikNLPWorld – In‑depth articles.

🔄 Ecosystem Integration

Our work is integrated with the broader Hugging Face ecosystem:

Models on the Hub with easy‑to‑use pipelines.
Datasets with streaming and evaluation scripts.
Spaces for interactive demos and educational tools.
Gradio apps for user‑friendly interfaces.

Empowering Tajik and Persian languages through open science and community collaboration.

models 0

None public yet

datasets 0