BERT Torrent Classifier

A fine-tuned BERT-tiny model for classifying torrent content into media types.

Model Details

Base model: prajjwal1/bert-tiny
Task: Multi-class text classification
Labels: audio, video, software, book, other
Format: ONNX (with embedded weights)
Size: ~17MB

Training

Training data: ~10k torrent names with 4-LLM consensus voting
LLM ensemble: qwen2.5:3b, gemma3:4b, mistral:7b, qwen3-coder:30b
Consensus rules: 4-agree = high confidence, 3v1 = majority vote, 2v2 = discarded
Accuracy: ~92% on held-out test set

Usage

This model is designed for use with mimmo, a Rust library for torrent content classification. The ONNX model is embedded directly in the binary at compile time.

// Model is automatically downloaded during build
const MODEL_BYTES: &[u8] = include_bytes!("../models/bert/model_embedded.onnx");
const TOKENIZER_JSON: &str = include_str!("../models/bert/tokenizer.json");

Performance

Inference: <10ms per sample (CPU)
Used as ML fallback when pattern matching is inconclusive

Files

model_embedded.onnx - ONNX model with embedded weights
tokenizer.json - HuggingFace tokenizer
vocab.txt - Vocabulary file
config.json - Model configuration

License

MIT

Downloads last month: 14

Model tree for lelloman/bert-torrent-classifier

Base model

prajjwal1/bert-tiny

Quantized

(6)

this model