Update README.md

3951155 verified 8 months ago

4.54 kB


	---
	language: en
	tags:
	- toxic-content
	- text-classification
	- keras
	- tensorflow
	- deep-learning
	- safety
	- multiclass
	license: mit
	datasets:
	- custom
	metrics:
	- accuracy
	- f1
	pipeline_tag: text-classification
	model-index:
	- name: Toxic_Classification
	results: []
	---


	# Toxic-Predict

	Toxic-Predict is a machine learning project developed as part of the Cellula Internship, focused on safe and responsible multi-modal toxic content moderation. It classifies text queries and image descriptions into nine toxicity categories such as "Safe", "Violent Crimes", "Non-Violent Crimes", "Unsafe", and others. The project leverages deep learning (Keras/TensorFlow), NLP preprocessing, and benchmarking with modern transformer models to build and evaluate a robust multi-class toxic content classifier.

	---

	## 🚩 Project Context

	This project is part of the Cellula Internship proposal:
	"Safe and Responsible Multi-Modal Toxic Content Moderation"
	The goal is to build a dual-stage moderation pipeline for both text and images, combining hard guardrails (Llama Guard) and soft classification (DistilBERT/Deep Learning) for nuanced, policy-compliant moderation.

	---


	## Features

	- Dual-stage moderation: hard filter (Llama Guard) + soft classifier (DistilBERT/CNN/LSTM)
	- Data cleaning, preprocessing, and label encoding
	- Tokenization and sequence padding for text data
	- Deep learning and transformer-based models for multi-class toxicity classification
	- Evaluation metrics: classification report and confusion matrix
	- Jupyter notebooks for data exploration and model development
	- Streamlit web app for demo and deployment

	---

	---

	## Usage

	- Preprocessing and Tokenization:
	See `notebooks/Preprocessing.ipynb` and `notebooks/tokenization.ipynb` for step-by-step data cleaning, splitting, and tokenization.
	- Model Training:
	Model architecture and training code are in `models/model.py`.
	- Inference:
	Load the trained model (`models/toxic_classifier.h5` or `.keras`) and tokenizer (`data/tokenizer.pkl`) to predict toxicity categories for new samples.

	---

	## Data

	- CSV files with columns: `query`, `image descriptions`, `Toxic Category`, and `Toxic Category Encoded`.
	- Data splits: `train.csv`, `eval.csv`, `test.csv`, and `cleaned.csv` for processed data.
	- 9 categories: Safe, Violent Crimes, Elections, Sex-Related Crimes, Unsafe, Non-Violent Crimes, Child Sexual Exploitation, Unknown S-Type, Suicide & Self-Harm.

	---

	## Model

	- Deep learning model built with Keras (TensorFlow backend).
	- Multi-class classification with label encoding for toxicity categories.
	- Benchmarking with PEFT-LoRA DistilBERT and baseline CNN/LSTM.

	---

	## Evaluation

	- Classification report and confusion matrix are generated for model evaluation.
	- See the evaluation steps in `notebooks/Preprocessing.ipynb`.

	---

	language: en

	## 🤗 Hugging Face Inference

	This model is available on the Hugging Face Hub: [NightPrince/Toxic_Classification](https://huggingface.co/NightPrince/Toxic_Classification)

	### Inference API Usage

	You can use the Hugging Face Inference API or widget with two fields:

	- `text`: The main query or post text
	- `image_desc`: The image description (if any)

	Example (Python):

	```python
	from huggingface_hub import InferenceClient
	client = InferenceClient("NightPrince/Toxic_Classification")
	result = client.text_classification({
	"text": "This is a dangerous post",
	"image_desc": "Knife shown in the image"
	})
	print(result) # {'label': 'toxic', 'score': 0.98}
	```

	### Custom Pipeline Details

	- The model uses a custom `pipeline.py` for multi-input inference.
	- The output is a dictionary with the predicted `label` (class name) and `score` (confidence).
	- Class names are mapped using `label_map.json`.

	Files in the repo:
	- `pipeline.py` (custom inference logic)
	- `tokenizer.json` (Keras tokenizer)
	- `label_map.json` (class code to name mapping)
	- TensorFlow SavedModel files (`saved_model.pb`, `variables/`)

	Requirements:
	```
	tensorflow
	keras
	numpy
	```

	---

	---

	## 📚 Resources

	- [Cellula Internship Project Proposal](#)
	- [BLIP: Bootstrapped Language-Image Pre-training](https://github.com/salesforce/BLIP)
	- [Llama Guard](https://llama.meta.com/llama-guard/)
	- [DistilBERT](https://huggingface.co/distilbert-base-uncased)
	- [Streamlit](https://streamlit.io/)

	---

	## License

	MIT License

	---

	Author: Yahya Muhammad Alnwsany
	Contact: yahyaalnwsany39@gmail.com
	Portfolio: https://nightprincey.github.io/Portfolio/