CLIP-ViT-Large LoRA Adapter for Multi-Label Image Classification
This model is a lightweight multi-label classification model based on openai/clip-vit-large-patch14, optimized using LoRA (Low-Rank Adaptation) and 8-bit quantization (via bitsandbytes). It is suitable for multi-label classification tasks involving 20 distinct image categories.
This repo contains only:
- The LoRA adapter weights (
adapter_model.safetensors) - The classifier head weights (
classifier_head.pt) - A sample loading script in this README
π§ Model Architecture
- Backbone:
openai/clip-vit-large-patch14 - Quantization: 8-bit (
load_in_8bit=True) - Fine-tuning method: LoRA (r=16, alpha=32) via
peft - Classification head:
LayerNorm β Dropout β Linear(num_labels=20)
π§ͺ Training Details
- LoRA was applied to the attention projection modules:
q_proj,k_proj,v_proj,out_proj - Optimizer: AdamW
- Loss: Asymmetric Focal Loss (Ξ³β»=2)
- Epochs: 2 (grid search on LR and gamma_neg)
π Class Labels
The model supports 20 categories:
Class 0, Class 1, Class 2, ..., Class 19
You can replace these with your own label names based on your dataset.
π How to Use
π¦ Install dependencies
!pip install transformers peft bitsandbytes accelerate
π§© Load model
import torch
import torch.nn as nn
from transformers import CLIPModel, BitsAndBytesConfig, CLIPProcessor
from peft import PeftModel
class CLIPForMultiLabel(nn.Module):
def __init__(self, backbone, num_labels=20, dropout=0.1):
super().__init__()
self.backbone = backbone
hidden_size = backbone.config.projection_dim
self.classifier = nn.Sequential(
nn.LayerNorm(hidden_size),
nn.Dropout(dropout),
nn.Linear(hidden_size, num_labels)
)
def forward(self, pixel_values):
image_feats = self.backbone.get_image_features(pixel_values=pixel_values)
return self.classifier(image_feats)
# Load LoRA backbone
quant_cfg = BitsAndBytesConfig(load_in_8bit=True)
base = CLIPModel.from_pretrained("openai/clip-vit-large-patch14", quantization_config=quant_cfg)
backbone = PeftModel.from_pretrained(base, "YOUR_USERNAME/clip-lora-multilabel")
# Load classifier head
model = CLIPForMultiLabel(backbone, num_labels=20)
state_dict = torch.hub.load_state_dict_from_url(
"https://huggingface.co/YOUR_USERNAME/clip-lora-multilabel/resolve/main/classifier_head.pt",
map_location="cpu"
)
model.classifier.load_state_dict(state_dict)
model.eval()
# Load processor
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
πΌοΈ Predict on an image
from PIL import Image
image = Image.open("your_image.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
pixel_values = inputs["pixel_values"]
with torch.no_grad():
logits = model(pixel_values)
probs = torch.sigmoid(logits)
preds = (probs > 0.5).int().cpu().numpy()
print("Predicted multi-hot vector:", preds)
π License
This model is released under the MIT license.
π¬ Citation
If you use this model in your work, please cite this repository or acknowledge it appropriately.