LongCLIP: Unlocking the Long-Text Capability of CLIP

Paper Conference GitHub

Model Description

LongCLIP is an enhanced version of OpenAI's CLIP that extends the maximum input text length from 77 to 248 tokens, enabling better understanding of detailed, long-form text descriptions. This model maintains CLIP's zero-shot capabilities while significantly improving performance on long-caption retrieval tasks.

Key Features

  • ๐Ÿ”ฅ Extended Context Length: 248 tokens (3.2ร— longer than original CLIP)
  • ๐Ÿ”ฅ Strong Performance: +20% R@5 on long-caption retrieval, +6% on standard retrieval
  • ๐Ÿ”ฅ Plug-and-Play: Drop-in replacement for CLIP in existing workflows
  • ๐Ÿ”ฅ Two Model Sizes: Base (LongCLIP-B) and Large (LongCLIP-L)

Model Variants

Model Text Encoder Vision Encoder Params Projection Dim
LongCLIP-B 12 layers, 512d 12 layers, 768d ~150M 512
LongCLIP-L 12 layers, 768d 24 layers, 1024d ~430M 768

Uses

Direct Use

LongCLIP can be used for:

  • Zero-shot image classification with detailed text descriptions
  • Image-text retrieval with long, descriptive captions
  • Text-to-image generation (e.g., Stable Diffusion XL integration)
  • Visual question answering with complex queries

Downstream Use

LongCLIP serves as a backbone for:

  • Vision-language models requiring long text understanding
  • Multimodal retrieval systems
  • Content-based image search engines
  • Automated image captioning evaluation

How to Use

Installation

pip install "transformers[torch,torch-vision]"

Quick Start

from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch

# Load model and processor
model = AutoModel.from_pretrained(
    "creative-graphic-design/LongCLIP-B",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
    "creative-graphic-design/LongCLIP-B",
    trust_remote_code=True
)

# Prepare inputs
image = Image.open("your_image.jpg")
texts = [
    "A man is crossing the street with a red car parked nearby.",
    "A man is driving a car in an urban scene."
]

inputs = processor(
    text=texts,
    images=image,
    return_tensors="pt",
    max_length=248,
    padding="max_length"
)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=-1)

print("Probabilities:", probs)

Advanced Usage: Feature Extraction

# Extract features separately (unnormalized)
text_inputs = processor(text=texts, return_tensors="pt", max_length=248, padding="max_length")
image_inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    text_features = model.get_text_features(**text_inputs)
    image_features = model.get_image_features(**image_inputs)

    # Compute similarity (like original CLIP)
    logits = image_features @ text_features.T
    probs = logits.softmax(dim=-1)

Comparison with Original CLIP

# Original CLIP: max 77 tokens
clip_text = "A cat"

# LongCLIP: up to 248 tokens
longclip_text = "A fluffy orange tabby cat with green eyes is sitting on a wooden table near a window, with sunlight streaming through the curtains in the background, creating a warm and cozy atmosphere in a modern living room."

# LongCLIP can handle both short and long texts effectively!

Citation

If you use LongCLIP in your research, please cite:

@inproceedings{zhang2024longclip,
  title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
  author={Zhang, Beichen and Zhang, Pan and Dong, Xiaoyi and Zang, Yuhang and Wang, Jiaqi},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2024}
}

License

This model is released under the MIT License, consistent with the original CLIP model.

Acknowledgments

  • OpenAI CLIP: Foundation model and architecture
  • Original Authors: Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang

Model Card Contact

For questions and feedback, please open an issue on the GitHub repository.

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support