LikoKIko
/

OpenCensor-H1

@@ -1,109 +1,109 @@
----
-language:
-- he
-license: cc-by-sa-4.0
-tags:
-- text-classification
-- profanity-detection
-- hebrew
-- bert
-- alephbert
-library_name: transformers
-base_model: onlplab/alephbert-base
-datasets:
-- custom
-metrics:
-- accuracy
-- precision
-- recall
-- f1
----
-# OpenCensor-Hebrew
-This is a fine tuned **AlephBERT** model that finds bad words ( profanity ) in Hebrew text.
-You give the model a Hebrew sentence.
-It returns:
-- a score between **0 and 1**
-- a yes/no flag (based on a cutoff you choose)
-Meaning of the score:
-- **0 = clean**, **1 = has profanity**
-- Recommended cutoff from tests: **0.49** ( you can change it )
-![Validation F1 per Epoch](validation_f1_per_epoch_hd.png)
-![Final Test Metrics](final_test_metrics_hd.png)
-![Best Threshold](thresholds_per_epoch_hd.png)
-## How to use
-```python
-import torch
-from transformers import AutoTokenizer, AutoModelForSequenceClassification
-KModel = "LikoKIko/OpenCensor-Hebrew"
-KCutoff = 0.49 # best threshold from training
-KMaxLen = 512 # number of tokens (not characters)
-tokenizer = AutoTokenizer.from_pretrained(KModel)
-model = AutoModelForSequenceClassification.from_pretrained(KModel, num_labels=1).eval()
-text = "some hebrew text here"
-inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=KMaxLen)
-with torch.inference_mode():
-  score = torch.sigmoid(model(**inputs).logits).item()
-KHasProfanity = int(score >= KCutoff)
-print({"score": round(score, 4), "KHasProfanity": KHasProfanity})
-````
-Note: If the text is very long, it is cut at `KMaxLen` tokens.
-## About this model
-  - Base: `onlplab/alephbert-base`
-  - Task: binary classification (clean / profanity)
-  - Language: Hebrew
-  - Max length: 512 tokens
-  - Training:
-      - Batch size: 16
-      - Epochs: 10
-      - Learning rate: 0.00002
-      - Loss: binary cross-entropy with logits (`BCEWithLogitsLoss`). We use `pos_weight` so the model pays more attention to the rare class. This helps when the dataset is imbalanced.
-      - Scheduler: linear warmup (10%)
-### Results
-  - Test Accuracy: 0.9826
-  - Test Precision: 0.9812
-  - Test Recall: 0.9835
-  - Test F1: 0.9823
-  - Best threshold: 0.49
-## Reproduce (training code)
-This model was trained with a script that:
-  - Loads `onlplab/alephbert-base` with `num_labels=1`
-  - Tokenizes with `max_length=512` and pads to the max length
-  - Trains with AdamW, linear warmup, and mixed precision
-  - Tries cutoffs from `0.1` to `0.9` on the validation set and picks the best F1
-  - Saves the best checkpoint by validation F1, then reports test metrics
-## License
-CC-BY-SA-4.0
-## How to cite
-```
-```bibtex
-@misc{opencensor-hebrew,
-  title = {OpenCensor-Hebrew: Hebrew Profanity Detection Model},
-  author = {LikoKIko},
-  year = {2025},
-  url = {[https://huggingface.co/LikoKIko/OpenCensor-Hebrew](https://huggingface.co/LikoKIko/OpenCensor-Hebrew)}
-}
-```
 ```

+---
+language:
+- he
+license: cc-by-sa-4.0
+tags:
+- text-classification
+- profanity-detection
+- hebrew
+- bert
+- alephbert
+library_name: transformers
+base_model: onlplab/alephbert-base
+datasets:
+- custom
+metrics:
+- accuracy
+- precision
+- recall
+- f1
+---
+# OpenCensor-Hebrew
+This is a fine tuned **AlephBERT** model that finds bad words ( profanity ) in Hebrew text.
+You give the model a Hebrew sentence.
+It returns:
+- a score between **0 and 1**
+- a yes/no flag (based on a cutoff you choose)
+Meaning of the score:
+- **0 = clean**, **1 = has profanity**
+- Recommended cutoff from tests: **0.49** ( you can change it )
+![Validation F1 per Epoch](valf1perepoch.png)
+![Final Test Metrics](testmetrics.png)
+![Best Threshold](bestthreshold.png)
+## How to use
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+KModel = "LikoKIko/OpenCensor-Hebrew"
+KCutoff = 0.49 # best threshold from training
+KMaxLen = 512 # number of tokens (not characters)
+tokenizer = AutoTokenizer.from_pretrained(KModel)
+model = AutoModelForSequenceClassification.from_pretrained(KModel, num_labels=1).eval()
+text = "some hebrew text here"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=KMaxLen)
+with torch.inference_mode():
+  score = torch.sigmoid(model(**inputs).logits).item()
+KHasProfanity = int(score >= KCutoff)
+print({"score": round(score, 4), "KHasProfanity": KHasProfanity})
+````
+Note: If the text is very long, it is cut at `KMaxLen` tokens.
+## About this model
+  - Base: `onlplab/alephbert-base`
+  - Task: binary classification (clean / profanity)
+  - Language: Hebrew
+  - Max length: 512 tokens
+  - Training:
+      - Batch size: 16
+      - Epochs: 10
+      - Learning rate: 0.00002
+      - Loss: binary cross-entropy with logits (`BCEWithLogitsLoss`). We use `pos_weight` so the model pays more attention to the rare class. This helps when the dataset is imbalanced.
+      - Scheduler: linear warmup (10%)
+### Results
+  - Test Accuracy: 0.9826
+  - Test Precision: 0.9812
+  - Test Recall: 0.9835
+  - Test F1: 0.9823
+  - Best threshold: 0.49
+## Reproduce (training code)
+This model was trained with a script that:
+  - Loads `onlplab/alephbert-base` with `num_labels=1`
+  - Tokenizes with `max_length=512` and pads to the max length
+  - Trains with AdamW, linear warmup, and mixed precision
+  - Tries cutoffs from `0.1` to `0.9` on the validation set and picks the best F1
+  - Saves the best checkpoint by validation F1, then reports test metrics
+## License
+CC-BY-SA-4.0
+## How to cite
+```
+```bibtex
+@misc{opencensor-hebrew,
+  title = {OpenCensor-Hebrew: Hebrew Profanity Detection Model},
+  author = {LikoKIko},
+  year = {2025},
+  url = {[https://huggingface.co/LikoKIko/OpenCensor-Hebrew](https://huggingface.co/LikoKIko/OpenCensor-Hebrew)}
+}
+```
 ```