π·οΈ Urdu Turn Detection Model (DistilBERT)
This model detects End-of-Turn (EoT) in Urdu speech transcripts, classifying each sentence as:
- Complete β The speaker has finished their turn
- Incomplete β The speaker is pausing / trailing off / not yet done
While this may appear similar to Voice Activity Detection (VAD), it solves a different and more linguistically complex problem:
π VAD vs. Turn Detection (Key Difference)
VAD only detects whether sound is present or absent β i.e.,
β‘οΈ βIs the speaker currently making noise or not?β
It cannot determine whether a sentence is logically complete, because it relies only on raw audio features (energy, frequency, silence gaps).
Turn Detection, however, is a semantic task:
β‘οΈ βHas the speaker finished their thought, or are they about to continue?β
This requires understanding grammar, syntax, pause structures, and sentence completeness β something VAD does not and cannot evaluate.
Example:
| Utterance | VAD Output | Turn Detection Output |
|---|---|---|
| "Ψ§Ϊ―Ψ± ΨͺΩ ΩΩΨͺ ΩΎΨ± Ψ’ΨͺΫ ΨͺΩ..." | Speech present | Incomplete (thought not finished) |
| "Ω ΫΪΊ Ϊ―ΪΎΨ± Ψ¬Ψ§ Ψ±ΫΨ§ ΫΩΪΊΫ" | Speech present | Complete |
| 1 sec silence | Silence | Not applicable |
Thus, this model complements VAD:
- VAD β detects audio boundaries
- Turn Detection β detects linguistic boundaries
It is finetuned on a 10,000 sample Urdu dataset and optimized for real-time deployment in conversational AI, ASR pipelines, and voice assistants.
π Model Variants
| Variant | Description | Size | Latency (CPU) | F1 Score |
|---|---|---|---|---|
| Base | Fine-tuned distilbert-base-multilingual-cased |
~516 MB | ~40 ms | 97.2% |
| Quantized | Dynamic INT8 Quantization | ~393 MB | ~14 ms | 95.9% |
π Performance
Evaluation on a held-out test set of 1,000 Urdu samples:
| Metric | Base Model | Quantized (INT8) |
|---|---|---|
| Accuracy | 97.2% | 95.9% |
| F1 Score | 97.2% | 95.9% |
| Precision | 96.9% | 96.8% |
| Recall | 97.6% | 95.1% |
π Dataset Details
- Total Samples: 10,000
- Label Balance: 50% Complete, 50% Incomplete
- Sources:
- 2,825 validated real-world Urdu EoT samples
- 7,175 synthetic Urdu samples
- Script: 100% Urdu (Nastaliq/Arabic), no Roman Urdu
- Domains: everyday conversational patterns, trailing constructs, pauses
Technical Specifications
Training Configuration
| Parameter | Value |
|---|---|
| Base Model | distilbert-base-multilingual-cased |
| Fine-tuning Method | Full Fine-tuning |
| Dataset | Urdu Turn Detection (~10,000 examples) |
| Language | Urdu (Ψ§Ψ±Ψ―Ω) |
| Learning Rate | 2e-5 |
| Scheduler | Linear Decay |
| Batch Size | 128 per device |
| Gradient Accumulation | 1 step |
| Max Sequence Length | 128 tokens |
| Optimizer | AdamW |
| Training Epochs | 5 |
| Total Steps | ~315 steps (63 steps/epoch) |
| Floating Point | FP16 (Mixed Precision) |
Dataset Statistics
| Feature | Details |
|---|---|
| Total Examples | 10,000 |
| Train Set | 8,000 examples (80%) |
| Validation Set | 1,000 examples (10%) |
| Test Set | 1,000 examples (10%) |
| Balance | 50% Complete / 50% Incomplete |
| Source | Real-world validation (28%) + Quality Synthetic (72%) |
Training Metrics (Approximated)
| Metric | Value |
|---|---|
| Final Test Loss | ~0.08 |
| Validation F1 | 97.2% |
| Test F1 | 97.2% (Base) / 95.9% (Quantized) |
Trainable Parameters
- Total Parameters: ~135,000,000
- Trainable Parameters: ~135,000,000 (100%)
- Model Size: ~516 MB (FP32) -> ~393 MB (INT8)
π§ Usage
1οΈβ£ Using the Base Model
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "PuristanLabs1/urdu-turn-detection-distilbert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "Ω
ΫΪΊ Ϊ―ΪΎΨ± Ψ¬Ψ§ Ψ±ΫΨ§ ΫΩΪΊ" # Complete
# text = "Ψ§Ϊ―Ψ± ΨͺΩ
ΩΩΨͺ ΩΎΨ± Ψ’ΨͺΫ ΨͺΩ" # Incomplete
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)
score = probs[0][1].item()
label = "Complete" if score > 0.5 else "Incomplete"
print(f"Prediction: {label} ({score:.2f})")
2οΈβ£ Using the Quantized Model (Fast Inference)
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "PuristanLabs1/urdu-turn-detection-distilbert-quantized"
# Load architecture + tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Apply dynamic quantization
model = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
# Load quantized weights
state_dict = torch.hub.load_state_dict_from_url(
f"https://huggingface.co/PuristanLabs1/urdu-turn-detection-distilbert-quantized/resolve/main/quantized_model.pt",
map_location="cpu"
)
model.load_state_dict(state_dict)
# Inference same as base model
π― Intended Use Cases
- Voice Assistants
- Dialogue Systems
- Real-time ASR segmentation
- Call Center AI / IVR turn-taking models
β οΈ Limitations
- No acoustic/prosodic features (text-only model)
- Short ambiguous utterances may require context
- Should not be used alone in safety-critical systems
π License
MIT License
- Downloads last month
- 132