LongCLIP: Unlocking the Long-Text Capability of CLIP

Model Description

LongCLIP is an enhanced version of OpenAI's CLIP that extends the maximum input text length from 77 to 248 tokens, enabling better understanding of detailed, long-form text descriptions. This model maintains CLIP's zero-shot capabilities while significantly improving performance on long-caption retrieval tasks.

Key Features

🔥 Extended Context Length: 248 tokens (3.2× longer than original CLIP)
🔥 Strong Performance: +20% R@5 on long-caption retrieval, +6% on standard retrieval
🔥 Plug-and-Play: Drop-in replacement for CLIP in existing workflows
🔥 Two Model Sizes: Base (LongCLIP-B) and Large (LongCLIP-L)

Model Variants

Model	Text Encoder	Vision Encoder	Params	Projection Dim
LongCLIP-B	12 layers, 512d	12 layers, 768d	~150M	512
LongCLIP-L	12 layers, 768d	24 layers, 1024d	~430M	768

Uses

Direct Use

LongCLIP can be used for:

Zero-shot image classification with detailed text descriptions
Image-text retrieval with long, descriptive captions
Text-to-image generation (e.g., Stable Diffusion XL integration)
Visual question answering with complex queries

Downstream Use

LongCLIP serves as a backbone for:

Vision-language models requiring long text understanding
Multimodal retrieval systems
Content-based image search engines
Automated image captioning evaluation

How to Use

Installation

pip install "transformers[torch,torch-vision]"

Quick Start

from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch

# Load model and processor
model = AutoModel.from_pretrained(
    "creative-graphic-design/LongCLIP-B",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
    "creative-graphic-design/LongCLIP-B",
    trust_remote_code=True
)

# Prepare inputs
image = Image.open("your_image.jpg")
texts = [
    "A man is crossing the street with a red car parked nearby.",
    "A man is driving a car in an urban scene."
]

inputs = processor(
    text=texts,
    images=image,
    return_tensors="pt",
    max_length=248,
    padding="max_length"
)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=-1)

print("Probabilities:", probs)

Advanced Usage: Feature Extraction

# Extract features separately (unnormalized)
text_inputs = processor(text=texts, return_tensors="pt", max_length=248, padding="max_length")
image_inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    text_features = model.get_text_features(**text_inputs)
    image_features = model.get_image_features(**image_inputs)

    # Compute similarity (like original CLIP)
    logits = image_features @ text_features.T
    probs = logits.softmax(dim=-1)

Comparison with Original CLIP

# Original CLIP: max 77 tokens
clip_text = "A cat"

# LongCLIP: up to 248 tokens
longclip_text = "A fluffy orange tabby cat with green eyes is sitting on a wooden table near a window, with sunlight streaming through the curtains in the background, creating a warm and cozy atmosphere in a modern living room."

# LongCLIP can handle both short and long texts effectively!

Citation

If you use LongCLIP in your research, please cite:

@inproceedings{zhang2024longclip,
  title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
  author={Zhang, Beichen and Zhang, Pan and Dong, Xiaoyi and Zang, Yuhang and Wang, Jiaqi},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2024}
}

License

This model is released under the MIT License, consistent with the original CLIP model.

Acknowledgments

OpenAI CLIP: Foundation model and architecture
Original Authors: Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang

Model Card Contact

For questions and feedback, please open an issue on the GitHub repository.

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32