CLIP-ViT-Large LoRA Adapter for Multi-Label Image Classification

This model is a lightweight multi-label classification model based on openai/clip-vit-large-patch14, optimized using LoRA (Low-Rank Adaptation) and 8-bit quantization (via bitsandbytes). It is suitable for multi-label classification tasks involving 20 distinct image categories.

This repo contains only:

The LoRA adapter weights (adapter_model.safetensors)
The classifier head weights (classifier_head.pt)
A sample loading script in this README

🧠 Model Architecture

Backbone: openai/clip-vit-large-patch14
Quantization: 8-bit (load_in_8bit=True)
Fine-tuning method: LoRA (r=16, alpha=32) via peft
Classification head: LayerNorm → Dropout → Linear(num_labels=20)

🧪 Training Details

LoRA was applied to the attention projection modules: q_proj, k_proj, v_proj, out_proj
Optimizer: AdamW
Loss: Asymmetric Focal Loss (γ⁻=2)
Epochs: 2 (grid search on LR and gamma_neg)

📂 Class Labels

The model supports 20 categories:

Class 0, Class 1, Class 2, ..., Class 19

You can replace these with your own label names based on your dataset.

🚀 How to Use

📦 Install dependencies

!pip install transformers peft bitsandbytes accelerate

🧩 Load model

import torch
import torch.nn as nn
from transformers import CLIPModel, BitsAndBytesConfig, CLIPProcessor
from peft import PeftModel

class CLIPForMultiLabel(nn.Module):
    def __init__(self, backbone, num_labels=20, dropout=0.1):
        super().__init__()
        self.backbone = backbone
        hidden_size = backbone.config.projection_dim
        self.classifier = nn.Sequential(
            nn.LayerNorm(hidden_size),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, num_labels)
        )

    def forward(self, pixel_values):
        image_feats = self.backbone.get_image_features(pixel_values=pixel_values)
        return self.classifier(image_feats)

# Load LoRA backbone
quant_cfg = BitsAndBytesConfig(load_in_8bit=True)
base = CLIPModel.from_pretrained("openai/clip-vit-large-patch14", quantization_config=quant_cfg)
backbone = PeftModel.from_pretrained(base, "YOUR_USERNAME/clip-lora-multilabel")

# Load classifier head
model = CLIPForMultiLabel(backbone, num_labels=20)
state_dict = torch.hub.load_state_dict_from_url(
    "https://huggingface.co/YOUR_USERNAME/clip-lora-multilabel/resolve/main/classifier_head.pt",
    map_location="cpu"
)
model.classifier.load_state_dict(state_dict)
model.eval()

# Load processor
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

🖼️ Predict on an image

from PIL import Image

image = Image.open("your_image.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
pixel_values = inputs["pixel_values"]

with torch.no_grad():
    logits = model(pixel_values)
    probs = torch.sigmoid(logits)
    preds = (probs > 0.5).int().cpu().numpy()

print("Predicted multi-hot vector:", preds)

📜 License

This model is released under the MIT license.

💬 Citation

If you use this model in your work, please cite this repository or acknowledge it appropriately.

Downloads last month: -; Downloads are not tracked for this model. How to track