π CalorieCLIP: Accurate Food Calorie Estimation
CalorieCLIP is a fine-tuned CLIP model that estimates calories from food images with state-of-the-art accuracy. It outperforms all tested VLMs (including GPT-4o and Claude) while running entirely on-device.
π― Key Results
| Metric | Value |
|---|---|
| Mean Absolute Error | 51.4 calories |
| Within 50 calories | 67.6% |
| Within 100 calories | 90.5% |
| Inference Speed | <50ms on M1 Mac |
π½οΈ Example Predictions
Real predictions from our validation set across multiple datasets:
π Quick Start
Installation
pip install open-clip-torch torch pillow
Python Usage
# Clone or download this repo first, then:
from calorie_clip import CalorieCLIP
# Load model from local directory
model = CalorieCLIP.from_pretrained(".")
# Predict calories
calories = model.predict("food_photo.jpg")
print(f"Estimated: {calories:.0f} calories")
# Batch prediction
images = ["breakfast.jpg", "lunch.jpg", "dinner.jpg"]
results = model.predict_batch(images)
Direct Usage (no wrapper)
import torch
import open_clip
from PIL import Image
# Load CLIP
clip, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
checkpoint = torch.load('calorie_clip.pt', map_location='cpu', weights_only=False)
clip.load_state_dict(checkpoint['clip_state'], strict=False)
# Load regression head
import torch.nn as nn
class RegressionHead(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(512, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(0.4),
nn.Linear(512, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(256, 64), nn.ReLU(), nn.Linear(64, 1)
)
def forward(self, x): return self.net(x)
head = RegressionHead()
head.load_state_dict(checkpoint['regressor_state'])
clip.eval(); head.eval()
# Predict
img = preprocess(Image.open('food.jpg')).unsqueeze(0)
with torch.no_grad():
features = clip.encode_image(img)
calories = head(features).item()
print(f"{calories:.0f} calories")
Command Line
python calorie_clip.py my_food_image.jpg
# Output: my_food_image.jpg: 342 calories
π Training Progress
The model was trained for 30 epochs on the Nutrition5k dataset with:
- Huber Loss for robustness to outliers
- Strong augmentation (rotation, color jitter, flips)
- Fine-tuning last 2 CLIP transformer blocks (9.4% of parameters)
- Differential learning rates (1e-5 for CLIP, 1e-3 for regression head)
π¬ Technical Details
Architecture
βββββββββββββββββββ ββββββββββββββββ βββββββββββββββ
β Food Image ββββββΆβ CLIP ViT-B ββββββΆβ Regression ββββββΆ Calories
β (224Γ224) β β Encoder β β Head β
βββββββββββββββββββ β (fine-tuned)β β (4 layers) β
ββββββββββββββββ βββββββββββββββ
β
βΌ
512-dim features
Model Specs
- Base Model: OpenAI CLIP ViT-B/32
- Fine-tuned Layers: Last 2 transformer blocks + regression head
- Trainable Parameters: 9.4% (8.5M of 90M)
- Input Size: 224Γ224 RGB
- Output: Single float (calories)
Comparison to VLMs
We tested multiple Vision-Language Models on the same test set:
| Model | MAE | Notes |
|---|---|---|
| CalorieCLIP (Ours) | 51.4 | Local, fast, accurate |
| Claude 3.5 Sonnet | 71.7 | API required |
| GPT-4o | 80.2 | API required |
| Gemini 1.5 Pro | 86.7 | API required |
| GPT-4o-mini | 88.7 | API required |
| Qwen2-VL-7B (Local) | 160.7 | Mode collapse issues |
Key Finding: All tested local VLMs (Qwen, Pixtral) suffered from mode collapse, outputting the same calorie value for all images. CalorieCLIP's regression approach avoids this entirely.
π Files
CalorieCLIP/
βββ config.json # Model configuration
βββ calorie_clip.pt # Model weights (PyTorch)
βββ calorie_clip.py # Inference code
βββ requirements.txt # Dependencies
βββ assets/
βββ training_progress.png
βββ model_comparison.png
βββ accuracy_breakdown.png
βββ error_distribution.png
π Training Data
Trained on a combined dataset of:
- Nutrition5k: 5,006 real cafeteria food images with professional calorie measurements
- Food-101 subset: 8,000+ food images with estimated calories
- Total: 13,004 samples (11,053 train / 1,951 validation)
- Diverse foods: beignets, prime rib, ramen, hamburgers, bruschetta, chicken wings, pork chops, greek salads, sashimi, and more
β οΈ Limitations
- Trained on cafeteria food; may be less accurate for restaurant/home-cooked meals
- Single-dish focused; complex multi-item plates may have higher error
- Portion size estimation is inherently challenging from 2D images
- Not a replacement for professional nutrition advice
π Citation
@software{calorieclip2024,
author = {Haplo LLC},
title = {CalorieCLIP: Accurate Food Calorie Estimation from Images},
year = {2024},
url = {https://huggingface.co/jc-builds/CalorieCLIP}
}
π License
MIT License - free for commercial and personal use.
Made with β€οΈ by Haplo LLC
- Downloads last month
- 80







