File size: 2,244 Bytes
31de934 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
---
language:
- ar
- fr
license: mit
pipeline_tag: text-classification
tags:
- misinformation-detection
- fake-news
- text-classification
- algerian-darija
- arabic
- mbert
model_name: mBERT-Algerian-Darija
base_model: bert-base-multilingual-cased
---
# mBERT — Algerian Darija Misinformation Detection
Fine-tuned **BERT-base-multilingual-cased** for detecting misinformation in **Algerian Darija** text.
- **Base model**: `bert-base-multilingual-cased` (170M parameters)
- **Task**: Multi-class text classification (5 classes)
- **Classes**: F (Factual), R (Reporting), N (Non-factual), M (Misleading), S (Satire)
---
## Performance (Test set: 3,344 samples)
- **Accuracy**: 75.42%
- **Macro F1**: 64.48%
- **Weighted F1**: 75.70%
**Per-class F1**:
- Factual (F): 83.72%
- Reporting (R): 76.35%
- Non-factual (N): 81.01%
- Misleading (M): 61.46%
- Satire (S): 19.86%
---
## Training Summary
- **Max sequence length**: 128
- **Epochs**: 3 (early stopping)
- **Batch size**: 16
- **Learning rate**: 2e-5
- **Loss**: Weighted CrossEntropy
- **Seed**: 42 (reproducibility)
---
## Usage
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
MODEL_ID = "Rahilgh/model4_1"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device).eval()
LABEL_MAP = {0: "F", 1: "R", 2: "N", 3: "M", 4: "S"}
LABEL_NAMES = {
"F": "Factual",
"R": "Reporting",
"N": "Non-factual",
"M": "Misleading",
"S": "Satire"
}
texts = [
"قالك بلي رايحين ينحو الباك هذا العام",
]
for text in texts:
inputs = tokenizer(
text,
return_tensors="pt",
max_length=128,
truncation=True,
padding=True,
).to(device)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=1)[0]
pred_id = probs.argmax().item()
confidence = probs[pred_id].item()
label = LABEL_MAP[pred_id]
print(f"Text: {text}")
print(f"Prediction: {LABEL_NAMES[label]} ({label}) — {confidence:.2%}\n") |