--- language: - ar - fr license: mit pipeline_tag: text-classification tags: - misinformation-detection - fake-news - text-classification - algerian-darija - arabic - mbert model_name: mBERT-Algerian-Darija base_model: bert-base-multilingual-cased --- # mBERT — Algerian Darija Misinformation Detection Fine-tuned **BERT-base-multilingual-cased** for detecting misinformation in **Algerian Darija** text. - **Base model**: `bert-base-multilingual-cased` (170M parameters) - **Task**: Multi-class text classification (5 classes) - **Classes**: F (Factual), R (Reporting), N (Non-factual), M (Misleading), S (Satire) --- ## Performance (Test set: 3,344 samples) - **Accuracy**: 75.42% - **Macro F1**: 64.48% - **Weighted F1**: 75.70% **Per-class F1**: - Factual (F): 83.72% - Reporting (R): 76.35% - Non-factual (N): 81.01% - Misleading (M): 61.46% - Satire (S): 19.86% --- ## Training Summary - **Max sequence length**: 128 - **Epochs**: 3 (early stopping) - **Batch size**: 16 - **Learning rate**: 2e-5 - **Loss**: Weighted CrossEntropy - **Seed**: 42 (reproducibility) --- ## Usage ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification MODEL_ID = "Rahilgh/model4_1" tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device).eval() LABEL_MAP = {0: "F", 1: "R", 2: "N", 3: "M", 4: "S"} LABEL_NAMES = { "F": "Factual", "R": "Reporting", "N": "Non-factual", "M": "Misleading", "S": "Satire" } texts = [ "قالك بلي رايحين ينحو الباك هذا العام", ] for text in texts: inputs = tokenizer( text, return_tensors="pt", max_length=128, truncation=True, padding=True, ).to(device) with torch.no_grad(): outputs = model(**inputs) probs = torch.softmax(outputs.logits, dim=1)[0] pred_id = probs.argmax().item() confidence = probs[pred_id].item() label = LABEL_MAP[pred_id] print(f"Text: {text}") print(f"Prediction: {LABEL_NAMES[label]} ({label}) — {confidence:.2%}\n")