BERT Torrent Classifier

A fine-tuned BERT-tiny model for classifying torrent content into media types.

Model Details

  • Base model: prajjwal1/bert-tiny
  • Task: Multi-class text classification
  • Labels: audio, video, software, book, other
  • Format: ONNX (with embedded weights)
  • Size: ~17MB

Training

  • Training data: ~10k torrent names with 4-LLM consensus voting
  • LLM ensemble: qwen2.5:3b, gemma3:4b, mistral:7b, qwen3-coder:30b
  • Consensus rules: 4-agree = high confidence, 3v1 = majority vote, 2v2 = discarded
  • Accuracy: ~92% on held-out test set

Usage

This model is designed for use with mimmo, a Rust library for torrent content classification. The ONNX model is embedded directly in the binary at compile time.

// Model is automatically downloaded during build
const MODEL_BYTES: &[u8] = include_bytes!("../models/bert/model_embedded.onnx");
const TOKENIZER_JSON: &str = include_str!("../models/bert/tokenizer.json");

Performance

  • Inference: <10ms per sample (CPU)
  • Used as ML fallback when pattern matching is inconclusive

Files

  • model_embedded.onnx - ONNX model with embedded weights
  • tokenizer.json - HuggingFace tokenizer
  • vocab.txt - Vocabulary file
  • config.json - Model configuration

License

MIT

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lelloman/bert-torrent-classifier

Quantized
(6)
this model