🏷️ Urdu Turn Detection Model (DistilBERT)

Model Dataset License

This model detects End-of-Turn (EoT) in Urdu speech transcripts, classifying each sentence as:

  • Complete β†’ The speaker has finished their turn
  • Incomplete β†’ The speaker is pausing / trailing off / not yet done

While this may appear similar to Voice Activity Detection (VAD), it solves a different and more linguistically complex problem:

πŸ” VAD vs. Turn Detection (Key Difference)

VAD only detects whether sound is present or absent β€” i.e.,
➑️ β€œIs the speaker currently making noise or not?”

It cannot determine whether a sentence is logically complete, because it relies only on raw audio features (energy, frequency, silence gaps).

Turn Detection, however, is a semantic task:
➑️ β€œHas the speaker finished their thought, or are they about to continue?”

This requires understanding grammar, syntax, pause structures, and sentence completeness β€” something VAD does not and cannot evaluate.

Example:

Utterance VAD Output Turn Detection Output
"Ψ§Ϊ―Ψ± ΨͺΩ… ΩˆΩ‚Ψͺ ΩΎΨ± Ψ’ΨͺΫ’ Ψͺو..." Speech present Incomplete (thought not finished)
"Ω…ΫŒΪΊ Ϊ―ΪΎΨ± Ψ¬Ψ§ رہا ΫΩˆΪΊΫ”" Speech present Complete
1 sec silence Silence Not applicable

Thus, this model complements VAD:

  • VAD β†’ detects audio boundaries
  • Turn Detection β†’ detects linguistic boundaries

It is finetuned on a 10,000 sample Urdu dataset and optimized for real-time deployment in conversational AI, ASR pipelines, and voice assistants.


πŸš€ Model Variants

Variant Description Size Latency (CPU) F1 Score
Base Fine-tuned distilbert-base-multilingual-cased ~516 MB ~40 ms 97.2%
Quantized Dynamic INT8 Quantization ~393 MB ~14 ms 95.9%

πŸ“Š Performance

Evaluation on a held-out test set of 1,000 Urdu samples:

Metric Base Model Quantized (INT8)
Accuracy 97.2% 95.9%
F1 Score 97.2% 95.9%
Precision 96.9% 96.8%
Recall 97.6% 95.1%

πŸ“š Dataset Details

  • Total Samples: 10,000
  • Label Balance: 50% Complete, 50% Incomplete
  • Sources:
    • 2,825 validated real-world Urdu EoT samples
    • 7,175 synthetic Urdu samples
  • Script: 100% Urdu (Nastaliq/Arabic), no Roman Urdu
  • Domains: everyday conversational patterns, trailing constructs, pauses

Technical Specifications

Training Configuration

Parameter Value
Base Model distilbert-base-multilingual-cased
Fine-tuning Method Full Fine-tuning
Dataset Urdu Turn Detection (~10,000 examples)
Language Urdu (اردو)
Learning Rate 2e-5
Scheduler Linear Decay
Batch Size 128 per device
Gradient Accumulation 1 step
Max Sequence Length 128 tokens
Optimizer AdamW
Training Epochs 5
Total Steps ~315 steps (63 steps/epoch)
Floating Point FP16 (Mixed Precision)

Dataset Statistics

Feature Details
Total Examples 10,000
Train Set 8,000 examples (80%)
Validation Set 1,000 examples (10%)
Test Set 1,000 examples (10%)
Balance 50% Complete / 50% Incomplete
Source Real-world validation (28%) + Quality Synthetic (72%)

Training Metrics (Approximated)

Metric Value
Final Test Loss ~0.08
Validation F1 97.2%
Test F1 97.2% (Base) / 95.9% (Quantized)

Trainable Parameters

  • Total Parameters: ~135,000,000
  • Trainable Parameters: ~135,000,000 (100%)
  • Model Size: ~516 MB (FP32) -> ~393 MB (INT8)

πŸ”§ Usage

1️⃣ Using the Base Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "PuristanLabs1/urdu-turn-detection-distilbert" 

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Ω…ΫŒΪΊ Ϊ―ΪΎΨ± Ψ¬Ψ§ رہا ہوں"  # Complete
# text = "Ψ§Ϊ―Ψ± ΨͺΩ… ΩˆΩ‚Ψͺ ΩΎΨ± Ψ’ΨͺΫ’ Ψͺو"  # Incomplete

inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1)
    score = probs[0][1].item()

label = "Complete" if score > 0.5 else "Incomplete"
print(f"Prediction: {label} ({score:.2f})")

2️⃣ Using the Quantized Model (Fast Inference)

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "PuristanLabs1/urdu-turn-detection-distilbert-quantized" 

# Load architecture + tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Apply dynamic quantization
model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# Load quantized weights
state_dict = torch.hub.load_state_dict_from_url(
    f"https://huggingface.co/PuristanLabs1/urdu-turn-detection-distilbert-quantized/resolve/main/quantized_model.pt",
    map_location="cpu"
)
model.load_state_dict(state_dict)

# Inference same as base model

🎯 Intended Use Cases

  • Voice Assistants
  • Dialogue Systems
  • Real-time ASR segmentation
  • Call Center AI / IVR turn-taking models

⚠️ Limitations

  • No acoustic/prosodic features (text-only model)
  • Short ambiguous utterances may require context
  • Should not be used alone in safety-critical systems

πŸ“ License

MIT License

Downloads last month
132
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support