SmolLM2-135M Arabic End-of-Utterance Detector

Fine-tuned SmolLM2-135M model for detecting end-of-utterance (EOU) in Arabic conversations.

Model Description

This model predicts when an Arabic speaker has finished their turn in a conversation based on transcribed speech. It's designed for real-time voice assistants, LiveKit agents, and conversational AI systems.

Key Features:

  • 🎯 High Accuracy: F1-Score of 0.913
  • 🌍 Multi-Dialect: Supports Levantine, Egyptian, and Gulf Arabic
  • Fast Inference: <50ms per prediction on GPU
  • 🔄 Context-Aware: Can use previous utterances for better predictions
  • 🎙️ Production-Ready: Integrated with LiveKit for real-time use

Performance

Metric Score
F1 Score 0.913
Accuracy 0.913
Precision 0.906
Recall 0.921
AUC-ROC 0.958

Inference Speed:

  • CPU: 30-50ms per prediction
  • GPU (RTX 4070): 10-20ms per prediction
  • Batch (32 samples): 3-6ms per prediction

Training Details

Training Data

  • Dataset: Reverb/arabic-eou-conversations
  • Total Examples: 11,660 (balanced 50/50 EOU/NOT_EOU)
  • Dialects:
    • Levantine (شامي)
    • Egyptian (مصري)
    • Gulf (خليجي)
  • Split: 80% train, 10% validation, 10% test

Training Configuration

  • Base Model: HuggingFaceTB/SmolLM2-135M
  • Parameters: 135 million
  • Hardware: NVIDIA RTX 4070 (8GB VRAM)
  • Batch Size: 32 (effective: 64 with gradient accumulation)
  • Learning Rate: 2e-5
  • Epochs: 5
  • Optimizer: AdamW
  • Mixed Precision: FP16

Usage

Quick Start

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
    "Reverb/smollm2-135m-arabic-eou"
)
tokenizer = AutoTokenizer.from_pretrained(
    "Reverb/smollm2-135m-arabic-eou"
)

# Predict
text = "شو رأيك نروح نتغدا؟"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    prediction = torch.argmax(probs, dim=-1).item()
    confidence = probs[0][prediction].item()

print(f"EOU: {prediction == 1}, Confidence: {confidence:.3f}")
# Output: EOU: True, Confidence: 0.952

With Context

# Using previous utterance as context
context = "كيف حالك؟"
current = "الحمد لله بخير"
text_with_context = f"{context} [SEP] {current}"

inputs = tokenizer(text_with_context, return_tensors="pt", max_length=256, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    is_eou = torch.argmax(probs, dim=-1).item() == 1
    confidence = probs[0][1 if is_eou else 0].item()

print(f"EOU: {is_eou}, Confidence: {confidence:.3f}")

Batch Prediction

texts = [
    "شو رأيك",           # Partial - NOT_EOU
    "شو رأيك نروح نتغدا؟"  # Complete - EOU
]

inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    predictions = torch.argmax(probs, dim=-1)

for text, pred, prob in zip(texts, predictions, probs):
    is_eou = pred.item() == 1
    conf = prob[pred].item()
    print(f"'{text}' → {'EOU' if is_eou else 'NOT_EOU'} ({conf:.3f})")

Intended Use

Primary Use Cases

  • Voice Assistants: Detect when users finish speaking
  • LiveKit Agents: Real-time turn detection in voice conversations
  • Dialogue Systems: Turn-taking in conversational AI
  • Transcription Systems: Add turn boundaries to speech transcripts
  • Conversation Analysis: Analyze turn-taking patterns

Example Applications

  1. Real-time Voice Agent

    # Process STT transcription
    is_eou, confidence = detect_eou(transcription)
    if is_eou and confidence > 0.7:
        # User finished speaking, generate response
        agent_response = generate_response(transcription)
    
  2. LiveKit Integration

    from livekit_eou_sdk import ArabicEOUTurnDetector
    
    detector = ArabicEOUTurnDetector(threshold=0.7)
    is_eou, conf = await detector.process_transcription(text, is_final=True)
    

Limitations

  • Dialect Coverage: Optimized for Levantine, Egyptian, and Gulf dialects. May not perform as well on other Arabic dialects.
  • Formal Arabic: Designed for conversational/colloquial Arabic. Performance on Modern Standard Arabic (MSA) or Classical Arabic may vary.
  • Domain: Trained on general conversational data. May require fine-tuning for specialized domains (medical, legal, etc.).
  • Context: Best results when using conversation context. Single utterances without context may have lower accuracy.
  • Spoken Language: Designed for transcribed spoken language, not written text.

Bias and Fairness

  • The model was trained on balanced data across three major Arabic dialects
  • Performance is consistent across all three dialects (Levantine, Egyptian, Gulf)
  • May have reduced performance on underrepresented dialects or regional variations
  • No demographic or gender-based biases were intentionally introduced

Model Architecture

  • Type: Sequence Classification (Binary)
  • Base: LlamaForSequenceClassification (SmolLM2-135M)
  • Input: Arabic text (max 256 tokens)
  • Output: Binary classification (0=NOT_EOU, 1=EOU)
  • Classes: 2 (NOT_EOU, EOU)
  • Model Size: ~270MB

Citation

@misc{arabic-eou-detector-2025,
  author = {Reverb},
  title = {Arabic End-of-Utterance Detector},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Reverb/smollm2-135m-arabic-eou}},
  note = {Fine-tuned SmolLM2-135M for Arabic EOU detection}
}

License

MIT License

Acknowledgments

Contact

For questions or issues, please open an issue on the model repository.

Related Resources


Model Card Version: 1.0
Last Updated: December 2025

Downloads last month
18
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Reverb/smollm2-135m-arabic-eou

Finetuned
(798)
this model

Dataset used to train Reverb/smollm2-135m-arabic-eou

Evaluation results