YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
DeepFake Detector V13 π―
State-of-the-art deepfake detection ensemble with 699M parameters
π Performance Highlights
- Average Ensemble F1: 0.9313
- Best Model F1: 0.9586 (Model 13.3 - Swin-Large)
- Total Parameters: 699M (exceeds 500M requirement β )
- Training Time: ~6.1 hours on T4 GPU
π Architecture
This model consists of 3 large-scale transformer and CNN models trained sequentially:
| Model | Backbone | Parameters | F1 Score | Training Time |
|---|---|---|---|---|
| Model 13.1 | ConvNeXt-Large | 198M | 0.8971 | 205.7 min |
| Model 13.2 | ViT-Large | 304M | 0.9382 | 52.7 min |
| Model 13.3 | Swin-Large | 197M | 0.9586 | 106.2 min |
Total: 699M parameters
Model Files
model_1.safetensors- ConvNeXt-Large (752 MB)model_2.safetensors- ViT-Large (1159 MB)model_3.safetensors- Swin-Large (747 MB)
π― Usage
Installation
pip install torch torchvision timm safetensors pillow
Quick Start - Single Model
import torch
import timm
from PIL import Image
from torchvision import transforms
from safetensors.torch import load_file
# Define model architecture
class DeepfakeDetector(torch.nn.Module):
def __init__(self, backbone_name, dropout=0.3):
super().__init__()
self.backbone = timm.create_model(backbone_name, pretrained=False, num_classes=0)
if hasattr(self.backbone, 'num_features'):
feat_dim = self.backbone.num_features
else:
with torch.no_grad():
feat_dim = self.backbone(torch.randn(1, 3, 224, 224)).shape[1]
self.classifier = torch.nn.Sequential(
torch.nn.Linear(feat_dim, 512),
torch.nn.BatchNorm1d(512),
torch.nn.GELU(),
torch.nn.Dropout(dropout),
torch.nn.Linear(512, 128),
torch.nn.BatchNorm1d(128),
torch.nn.GELU(),
torch.nn.Dropout(dropout * 0.5),
torch.nn.Linear(128, 1)
)
def forward(self, x):
features = self.backbone(x)
return self.classifier(features).squeeze(-1)
# Load best model (Model 13.3 - Swin-Large)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = DeepfakeDetector('swin_large_patch4_window7_224', dropout=0.3)
state_dict = load_file('model_3.safetensors')
model.load_state_dict(state_dict)
model = model.to(device)
model.eval()
# Preprocessing
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
# Predict
image = Image.open('test_image.jpg').convert('RGB')
input_tensor = transform(image).unsqueeze(0).to(device)
with torch.no_grad():
logits = model(input_tensor)
probability = torch.sigmoid(logits).item()
prediction = 'FAKE' if probability > 0.5 else 'REAL'
print(f"Prediction: {prediction}")
print(f"Confidence: {probability:.2%}")
Full Ensemble (Recommended)
import torch
import timm
from PIL import Image
from torchvision import transforms
from safetensors.torch import load_file
class DeepfakeDetector(torch.nn.Module):
def __init__(self, backbone_name, dropout=0.3):
super().__init__()
self.backbone = timm.create_model(backbone_name, pretrained=False, num_classes=0)
if hasattr(self.backbone, 'num_features'):
feat_dim = self.backbone.num_features
else:
with torch.no_grad():
feat_dim = self.backbone(torch.randn(1, 3, 224, 224)).shape[1]
self.classifier = torch.nn.Sequential(
torch.nn.Linear(feat_dim, 512),
torch.nn.BatchNorm1d(512),
torch.nn.GELU(),
torch.nn.Dropout(dropout),
torch.nn.Linear(512, 128),
torch.nn.BatchNorm1d(128),
torch.nn.GELU(),
torch.nn.Dropout(dropout * 0.5),
torch.nn.Linear(128, 1)
)
def forward(self, x):
features = self.backbone(x)
return self.classifier(features).squeeze(-1)
# Model configurations
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
configs = [
('convnext_large', 0.3, 'model_1.safetensors'),
('vit_large_patch16_224', 0.35, 'model_2.safetensors'),
('swin_large_patch4_window7_224', 0.3, 'model_3.safetensors')
]
# Load all models
models = []
for backbone, dropout, filename in configs:
model = DeepfakeDetector(backbone, dropout)
state_dict = load_file(filename)
model.load_state_dict(state_dict)
model = model.to(device)
model.eval()
models.append(model)
print(f"β Loaded {len(models)} models")
# Preprocessing
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
# Ensemble prediction
def predict_ensemble(image_path):
image = Image.open(image_path).convert('RGB')
input_tensor = transform(image).unsqueeze(0).to(device)
predictions = []
with torch.no_grad():
for model in models:
logits = model(input_tensor)
prob = torch.sigmoid(logits).item()
predictions.append(prob)
# Average ensemble
avg_prob = sum(predictions) / len(predictions)
prediction = 'FAKE' if avg_prob > 0.5 else 'REAL'
return {
'prediction': prediction,
'confidence': avg_prob,
'individual_predictions': predictions
}
# Use it
result = predict_ensemble('test_image.jpg')
print(f"Prediction: {result['prediction']}")
print(f"Ensemble Confidence: {result['confidence']:.2%}")
print(f"Individual Models: {[f'{p:.2%}' for p in result['individual_predictions']]}")
π Training Details
Architecture Design
Each model uses:
- Backbone: Large pre-trained vision model (frozen initially, fine-tuned)
- Classifier Head:
- Linear(feat_dim β 512) + BatchNorm + GELU + Dropout
- Linear(512 β 128) + BatchNorm + GELU + Dropout
- Linear(128 β 1)
Training Configuration
- Loss Function: Focal Loss with Label Smoothing
- Alpha: 0.25
- Gamma: 2.5
- Label Smoothing: 0.12
- Optimizer: AdamW
- Learning Rates: [2e-5, 1.5e-5, 1.8e-5]
- Weight Decay: 3e-4
- Scheduler: CosineAnnealingWarmRestarts (T_0=3, T_mult=2)
- Epochs: 10 per model
- Batch Sizes: [32, 24, 32]
- Mixed Precision: FP16 enabled
- Gradient Accumulation: 4 steps
- Gradient Checkpointing: Enabled (memory efficiency)
Data Augmentation
- Random Horizontal Flip (p=0.5)
- Random Rotation (Β±12Β°)
- Color Jitter (brightness, contrast, saturation: Β±0.15)
- Normalization: ImageNet stats
π Performance Analysis
Model Comparison
Model 13.1 (ConvNeXt-Large)
- β Solid baseline: F1 = 0.8971
- β CNN-based architecture
- β Good for local feature extraction
Model 13.2 (ViT-Large)
- β Strong performance: F1 = 0.9382
- β Fastest training (52.7 min)
- β Global attention mechanism
Model 13.3 (Swin-Large) β Best Model
- β Excellent performance: F1 = 0.9586
- β Hierarchical vision transformer
- β Best balance of accuracy and efficiency
Ensemble Benefits
The ensemble approach provides:
- Improved Robustness: Different architectures capture different patterns
- Reduced Variance: Averaging reduces prediction noise
- Better Generalization: Complementary strengths minimize overfitting
- Higher Accuracy: Expected ensemble F1 β 0.94-0.96
π§ System Requirements
Inference (Single Model)
- GPU: 4GB+ VRAM
- RAM: 8GB+
- Storage: ~1.2 GB per model
Inference (Full Ensemble)
- GPU: 12GB+ VRAM (or run models sequentially on smaller GPU)
- RAM: 16GB+
- Storage: ~2.7 GB total
Training
- GPU: T4 (16GB) or better
- RAM: 12GB+
- Storage: 8GB+ for checkpoints
π Dataset
Trained on: ash12321/deepfake-v13-dataset
π Related Models
- Predecessor:
ash12321/deepfake-detector-v12
π Citation
@model{v13-deepfake-detector,
title={DeepFake Detector V13: Large-Scale Ensemble},
author={Ash},
year={2024},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/ash12321/deepfake-detector-v13}}
}
π License
MIT License - See LICENSE file for details
π Acknowledgments
- Built with PyTorch, timm, and Hugging Face
- Trained on Google Colab T4 GPU
- Architectures: ConvNeXt (Meta), ViT (Google), Swin (Microsoft)
Model Version: 13.0
Last Updated: November 2024
Status: Production Ready β
- Downloads last month
- 4
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support