LongCLIP: Unlocking the Long-Text Capability of CLIP
Model Description
LongCLIP is an enhanced version of OpenAI's CLIP that extends the maximum input text length from 77 to 248 tokens, enabling better understanding of detailed, long-form text descriptions. This model maintains CLIP's zero-shot capabilities while significantly improving performance on long-caption retrieval tasks.
Key Features
- ๐ฅ Extended Context Length: 248 tokens (3.2ร longer than original CLIP)
- ๐ฅ Strong Performance: +20% R@5 on long-caption retrieval, +6% on standard retrieval
- ๐ฅ Plug-and-Play: Drop-in replacement for CLIP in existing workflows
- ๐ฅ Two Model Sizes: Base (LongCLIP-B) and Large (LongCLIP-L)
Model Variants
| Model | Text Encoder | Vision Encoder | Params | Projection Dim |
|---|---|---|---|---|
| LongCLIP-B | 12 layers, 512d | 12 layers, 768d | ~150M | 512 |
| LongCLIP-L | 12 layers, 768d | 24 layers, 1024d | ~430M | 768 |
Uses
Direct Use
LongCLIP can be used for:
- Zero-shot image classification with detailed text descriptions
- Image-text retrieval with long, descriptive captions
- Text-to-image generation (e.g., Stable Diffusion XL integration)
- Visual question answering with complex queries
Downstream Use
LongCLIP serves as a backbone for:
- Vision-language models requiring long text understanding
- Multimodal retrieval systems
- Content-based image search engines
- Automated image captioning evaluation
How to Use
Installation
pip install "transformers[torch,torch-vision]"
Quick Start
from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch
# Load model and processor
model = AutoModel.from_pretrained(
"creative-graphic-design/LongCLIP-B",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
"creative-graphic-design/LongCLIP-B",
trust_remote_code=True
)
# Prepare inputs
image = Image.open("your_image.jpg")
texts = [
"A man is crossing the street with a red car parked nearby.",
"A man is driving a car in an urban scene."
]
inputs = processor(
text=texts,
images=image,
return_tensors="pt",
max_length=248,
padding="max_length"
)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=-1)
print("Probabilities:", probs)
Advanced Usage: Feature Extraction
# Extract features separately (unnormalized)
text_inputs = processor(text=texts, return_tensors="pt", max_length=248, padding="max_length")
image_inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
text_features = model.get_text_features(**text_inputs)
image_features = model.get_image_features(**image_inputs)
# Compute similarity (like original CLIP)
logits = image_features @ text_features.T
probs = logits.softmax(dim=-1)
Comparison with Original CLIP
# Original CLIP: max 77 tokens
clip_text = "A cat"
# LongCLIP: up to 248 tokens
longclip_text = "A fluffy orange tabby cat with green eyes is sitting on a wooden table near a window, with sunlight streaming through the curtains in the background, creating a warm and cozy atmosphere in a modern living room."
# LongCLIP can handle both short and long texts effectively!
Citation
If you use LongCLIP in your research, please cite:
@inproceedings{zhang2024longclip,
title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
author={Zhang, Beichen and Zhang, Pan and Dong, Xiaoyi and Zang, Yuhang and Wang, Jiaqi},
booktitle={European Conference on Computer Vision (ECCV)},
year={2024}
}
License
This model is released under the MIT License, consistent with the original CLIP model.
Acknowledgments
- OpenAI CLIP: Foundation model and architecture
- Original Authors: Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang
Model Card Contact
For questions and feedback, please open an issue on the GitHub repository.
- Downloads last month
- -