LikoKIko commited on
Commit
72c039d
·
verified ·
1 Parent(s): 09dfad0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -108
README.md CHANGED
@@ -1,109 +1,109 @@
1
- ---
2
- language:
3
- - he
4
- license: cc-by-sa-4.0
5
- tags:
6
- - text-classification
7
- - profanity-detection
8
- - hebrew
9
- - bert
10
- - alephbert
11
- library_name: transformers
12
- base_model: onlplab/alephbert-base
13
- datasets:
14
- - custom
15
- metrics:
16
- - accuracy
17
- - precision
18
- - recall
19
- - f1
20
- ---
21
-
22
- # OpenCensor-Hebrew
23
-
24
- This is a fine tuned **AlephBERT** model that finds bad words ( profanity ) in Hebrew text.
25
-
26
- You give the model a Hebrew sentence.
27
- It returns:
28
- - a score between **0 and 1**
29
- - a yes/no flag (based on a cutoff you choose)
30
-
31
- Meaning of the score:
32
- - **0 = clean**, **1 = has profanity**
33
- - Recommended cutoff from tests: **0.49** ( you can change it )
34
-
35
- ![Validation F1 per Epoch](validation_f1_per_epoch_hd.png)
36
- ![Final Test Metrics](final_test_metrics_hd.png)
37
- ![Best Threshold](thresholds_per_epoch_hd.png)
38
-
39
- ## How to use
40
-
41
- ```python
42
- import torch
43
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
44
-
45
- KModel = "LikoKIko/OpenCensor-Hebrew"
46
- KCutoff = 0.49 # best threshold from training
47
- KMaxLen = 512 # number of tokens (not characters)
48
-
49
- tokenizer = AutoTokenizer.from_pretrained(KModel)
50
- model = AutoModelForSequenceClassification.from_pretrained(KModel, num_labels=1).eval()
51
-
52
- text = "some hebrew text here"
53
- inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=KMaxLen)
54
-
55
- with torch.inference_mode():
56
- score = torch.sigmoid(model(**inputs).logits).item()
57
- KHasProfanity = int(score >= KCutoff)
58
-
59
- print({"score": round(score, 4), "KHasProfanity": KHasProfanity})
60
- ````
61
-
62
- Note: If the text is very long, it is cut at `KMaxLen` tokens.
63
-
64
- ## About this model
65
-
66
- - Base: `onlplab/alephbert-base`
67
- - Task: binary classification (clean / profanity)
68
- - Language: Hebrew
69
- - Max length: 512 tokens
70
- - Training:
71
- - Batch size: 16
72
- - Epochs: 10
73
- - Learning rate: 0.00002
74
- - Loss: binary cross-entropy with logits (`BCEWithLogitsLoss`). We use `pos_weight` so the model pays more attention to the rare class. This helps when the dataset is imbalanced.
75
- - Scheduler: linear warmup (10%)
76
-
77
- ### Results
78
-
79
- - Test Accuracy: 0.9826
80
- - Test Precision: 0.9812
81
- - Test Recall: 0.9835
82
- - Test F1: 0.9823
83
- - Best threshold: 0.49
84
-
85
- ## Reproduce (training code)
86
-
87
- This model was trained with a script that:
88
-
89
- - Loads `onlplab/alephbert-base` with `num_labels=1`
90
- - Tokenizes with `max_length=512` and pads to the max length
91
- - Trains with AdamW, linear warmup, and mixed precision
92
- - Tries cutoffs from `0.1` to `0.9` on the validation set and picks the best F1
93
- - Saves the best checkpoint by validation F1, then reports test metrics
94
-
95
- ## License
96
-
97
- CC-BY-SA-4.0
98
-
99
- ## How to cite
100
- ```
101
- ```bibtex
102
- @misc{opencensor-hebrew,
103
- title = {OpenCensor-Hebrew: Hebrew Profanity Detection Model},
104
- author = {LikoKIko},
105
- year = {2025},
106
- url = {[https://huggingface.co/LikoKIko/OpenCensor-Hebrew](https://huggingface.co/LikoKIko/OpenCensor-Hebrew)}
107
- }
108
- ```
109
  ```
 
1
+ ---
2
+ language:
3
+ - he
4
+ license: cc-by-sa-4.0
5
+ tags:
6
+ - text-classification
7
+ - profanity-detection
8
+ - hebrew
9
+ - bert
10
+ - alephbert
11
+ library_name: transformers
12
+ base_model: onlplab/alephbert-base
13
+ datasets:
14
+ - custom
15
+ metrics:
16
+ - accuracy
17
+ - precision
18
+ - recall
19
+ - f1
20
+ ---
21
+
22
+ # OpenCensor-Hebrew
23
+
24
+ This is a fine tuned **AlephBERT** model that finds bad words ( profanity ) in Hebrew text.
25
+
26
+ You give the model a Hebrew sentence.
27
+ It returns:
28
+ - a score between **0 and 1**
29
+ - a yes/no flag (based on a cutoff you choose)
30
+
31
+ Meaning of the score:
32
+ - **0 = clean**, **1 = has profanity**
33
+ - Recommended cutoff from tests: **0.49** ( you can change it )
34
+
35
+ ![Validation F1 per Epoch](valf1perepoch.png)
36
+ ![Final Test Metrics](testmetrics.png)
37
+ ![Best Threshold](bestthreshold.png)
38
+
39
+ ## How to use
40
+
41
+ ```python
42
+ import torch
43
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
44
+
45
+ KModel = "LikoKIko/OpenCensor-Hebrew"
46
+ KCutoff = 0.49 # best threshold from training
47
+ KMaxLen = 512 # number of tokens (not characters)
48
+
49
+ tokenizer = AutoTokenizer.from_pretrained(KModel)
50
+ model = AutoModelForSequenceClassification.from_pretrained(KModel, num_labels=1).eval()
51
+
52
+ text = "some hebrew text here"
53
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=KMaxLen)
54
+
55
+ with torch.inference_mode():
56
+ score = torch.sigmoid(model(**inputs).logits).item()
57
+ KHasProfanity = int(score >= KCutoff)
58
+
59
+ print({"score": round(score, 4), "KHasProfanity": KHasProfanity})
60
+ ````
61
+
62
+ Note: If the text is very long, it is cut at `KMaxLen` tokens.
63
+
64
+ ## About this model
65
+
66
+ - Base: `onlplab/alephbert-base`
67
+ - Task: binary classification (clean / profanity)
68
+ - Language: Hebrew
69
+ - Max length: 512 tokens
70
+ - Training:
71
+ - Batch size: 16
72
+ - Epochs: 10
73
+ - Learning rate: 0.00002
74
+ - Loss: binary cross-entropy with logits (`BCEWithLogitsLoss`). We use `pos_weight` so the model pays more attention to the rare class. This helps when the dataset is imbalanced.
75
+ - Scheduler: linear warmup (10%)
76
+
77
+ ### Results
78
+
79
+ - Test Accuracy: 0.9826
80
+ - Test Precision: 0.9812
81
+ - Test Recall: 0.9835
82
+ - Test F1: 0.9823
83
+ - Best threshold: 0.49
84
+
85
+ ## Reproduce (training code)
86
+
87
+ This model was trained with a script that:
88
+
89
+ - Loads `onlplab/alephbert-base` with `num_labels=1`
90
+ - Tokenizes with `max_length=512` and pads to the max length
91
+ - Trains with AdamW, linear warmup, and mixed precision
92
+ - Tries cutoffs from `0.1` to `0.9` on the validation set and picks the best F1
93
+ - Saves the best checkpoint by validation F1, then reports test metrics
94
+
95
+ ## License
96
+
97
+ CC-BY-SA-4.0
98
+
99
+ ## How to cite
100
+ ```
101
+ ```bibtex
102
+ @misc{opencensor-hebrew,
103
+ title = {OpenCensor-Hebrew: Hebrew Profanity Detection Model},
104
+ author = {LikoKIko},
105
+ year = {2025},
106
+ url = {[https://huggingface.co/LikoKIko/OpenCensor-Hebrew](https://huggingface.co/LikoKIko/OpenCensor-Hebrew)}
107
+ }
108
+ ```
109
  ```