Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2407.15841

about 6 hours ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

Multimodal Language Model

What does matter besides data receipt when training a Multimodal language model?

LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6, 2024 • 61
VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24, 2024 • 41
PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10, 2024 • 72
openbmb/MiniCPM-V-2_6

Image-Text-to-Text • 8B • Updated Jun 13 • 88.3k • 1.02k

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22, 2024 • 40
Stable Audio Open

Paper • 2407.14358 • Published Jul 19, 2024 • 26
PlacidDreamer: Advancing Harmony in Text-to-3D Generation

Paper • 2407.13976 • Published Jul 19, 2024 • 5
Efficient Audio Captioning with Encoder-Level Knowledge Distillation

Paper • 2407.14329 • Published Jul 19, 2024 • 5

Multimodality Video LVM

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Paper • 2407.12679 • Published Jul 17, 2024 • 8
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22, 2024 • 40
LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Paper • 2408.10188 • Published Aug 19, 2024 • 52

Video Understanding

Vript: A Video Is Worth Thousands of Words

Paper • 2406.06040 • Published Jun 10, 2024 • 29
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Paper • 2406.04325 • Published Jun 6, 2024 • 75
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Paper • 2406.01574 • Published Jun 3, 2024 • 51
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Paper • 2405.21075 • Published May 31, 2024 • 26

Adaptive Hyper-Graph Convolution Network for Skeleton-based Human Action Recognition with Virtual Connections

Paper • 2411.14796 • Published Nov 22, 2024
LLaVAction: evaluating and training multi-modal large language models for action recognition

Paper • 2503.18712 • Published Mar 24 • 3
FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition

Paper • 2402.03241 • Published Feb 5, 2024
Leveraging Temporal Contextualization for Video Action Recognition

Paper • 2404.09490 • Published Apr 15, 2024

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22, 2024 • 40

Video Understanding

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22, 2024 • 40

Distilling Vision-Language Models on Millions of Videos

Paper • 2401.06129 • Published Jan 11, 2024 • 17
Koala: Key frame-conditioned long video-LLM

Paper • 2404.04346 • Published Apr 5, 2024 • 7
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Paper • 2404.05726 • Published Apr 8, 2024 • 23
OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding

Paper • 2406.07471 • Published Jun 11, 2024 • 1

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24, 2024 • 17
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 55
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 90
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 34

about 6 hours ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

Adaptive Hyper-Graph Convolution Network for Skeleton-based Human Action Recognition with Virtual Connections

Paper • 2411.14796 • Published Nov 22, 2024
LLaVAction: evaluating and training multi-modal large language models for action recognition

Paper • 2503.18712 • Published Mar 24 • 3
FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition

Paper • 2402.03241 • Published Feb 5, 2024
Leveraging Temporal Contextualization for Video Action Recognition

Paper • 2404.09490 • Published Apr 15, 2024

Multimodal Language Model

What does matter besides data receipt when training a Multimodal language model?

LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6, 2024 • 61
VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24, 2024 • 41
PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10, 2024 • 72
openbmb/MiniCPM-V-2_6

Image-Text-to-Text • 8B • Updated Jun 13 • 88.3k • 1.02k

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22, 2024 • 40

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22, 2024 • 40
Stable Audio Open

Paper • 2407.14358 • Published Jul 19, 2024 • 26
PlacidDreamer: Advancing Harmony in Text-to-3D Generation

Paper • 2407.13976 • Published Jul 19, 2024 • 5
Efficient Audio Captioning with Encoder-Level Knowledge Distillation

Paper • 2407.14329 • Published Jul 19, 2024 • 5

Video Understanding

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22, 2024 • 40

Multimodality Video LVM

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Paper • 2407.12679 • Published Jul 17, 2024 • 8
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22, 2024 • 40
LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Paper • 2408.10188 • Published Aug 19, 2024 • 52

Distilling Vision-Language Models on Millions of Videos

Paper • 2401.06129 • Published Jan 11, 2024 • 17
Koala: Key frame-conditioned long video-LLM

Paper • 2404.04346 • Published Apr 5, 2024 • 7
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Paper • 2404.05726 • Published Apr 8, 2024 • 23
OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding

Paper • 2406.07471 • Published Jun 11, 2024 • 1

Video Understanding

Vript: A Video Is Worth Thousands of Words

Paper • 2406.06040 • Published Jun 10, 2024 • 29
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Paper • 2406.04325 • Published Jun 6, 2024 • 75
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Paper • 2406.01574 • Published Jun 3, 2024 • 51
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Paper • 2405.21075 • Published May 31, 2024 • 26

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24, 2024 • 17
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 55
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 90
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 34

Previous
1
2
Next

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs