Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2406.09246

mllm applications

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Paper • 2404.05719 • Published Apr 8, 2024 • 83
OpenVLA: An Open-Source Vision-Language-Action Model

Paper • 2406.09246 • Published Jun 13, 2024 • 41

Foundation Models

OLMo: Accelerating the Science of Language Models

Paper • 2402.00838 • Published Feb 1, 2024 • 85
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Paper • 2403.05530 • Published Mar 8, 2024 • 66
StarCoder: may the source be with you!

Paper • 2305.06161 • Published May 9, 2023 • 31
SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling

Paper • 2312.15166 • Published Dec 23, 2023 • 60

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

A collection of Audio, Video and Visual LLMs.

myshell-ai/OpenVoice

Text-to-Speech • Updated Dec 24, 2024 • 484
Running

Featured

1.12k

OpenVoice

🤗

1.12k

Generate voice from text using a reference audio
dataautogpt3/ProteusV0.3

Text-to-Image • Updated Feb 12, 2024 • 191k • 94
ByteDance/SDXL-Lightning

Text-to-Image • Updated Apr 3, 2024 • 122k • • 2.11k

LEAP Hand: Low-Cost, Efficient, and Anthropomorphic Hand for Robot Learning

Paper • 2309.06440 • Published Sep 12, 2023 • 11
Robotic Table Tennis: A Case Study into a High Speed Learning System

Paper • 2309.03315 • Published Sep 6, 2023 • 7
Video Language Planning

Paper • 2310.10625 • Published Oct 16, 2023 • 11
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation

Paper • 2311.01455 • Published Nov 2, 2023 • 30

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Paper • 2412.14161 • Published Dec 18, 2024 • 51
Training Software Engineering Agents and Verifiers with SWE-Gym

Paper • 2412.21139 • Published Dec 30, 2024 • 24
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Paper • 2412.19723 • Published Dec 27, 2024 • 87
AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation

Paper • 2408.00764 • Published Aug 1, 2024 • 1

Vision Language Models

BLINK: Multimodal Large Language Models Can See but Not Perceive

Paper • 2404.12390 • Published Apr 18, 2024 • 26
TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Paper • 2404.12803 • Published Apr 19, 2024 • 30
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Paper • 2404.13013 • Published Apr 19, 2024 • 31
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Paper • 2404.06512 • Published Apr 9, 2024 • 30

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Paper • 2306.17107 • Published Jun 29, 2023 • 11
On the Hidden Mystery of OCR in Large Multimodal Models

Paper • 2305.07895 • Published May 13, 2023 • 1
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Paper • 2308.12966 • Published Aug 24, 2023 • 11
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Paper • 2401.15947 • Published Jan 29, 2024 • 53

Vision-Language

SILC: Improving Vision Language Pretraining with Self-Distillation

Paper • 2310.13355 • Published Oct 20, 2023 • 9
Woodpecker: Hallucination Correction for Multimodal Large Language Models

Paper • 2310.16045 • Published Oct 24, 2023 • 17
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Paper • 2201.12086 • Published Jan 28, 2022 • 3
ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories

Paper • 2305.15028 • Published May 24, 2023 • 1

mllm applications

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Paper • 2404.05719 • Published Apr 8, 2024 • 83
OpenVLA: An Open-Source Vision-Language-Action Model

Paper • 2406.09246 • Published Jun 13, 2024 • 41

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Paper • 2412.14161 • Published Dec 18, 2024 • 51
Training Software Engineering Agents and Verifiers with SWE-Gym

Paper • 2412.21139 • Published Dec 30, 2024 • 24
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Paper • 2412.19723 • Published Dec 27, 2024 • 87
AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation

Paper • 2408.00764 • Published Aug 1, 2024 • 1

Foundation Models

OLMo: Accelerating the Science of Language Models

Paper • 2402.00838 • Published Feb 1, 2024 • 85
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Paper • 2403.05530 • Published Mar 8, 2024 • 66
StarCoder: may the source be with you!

Paper • 2305.06161 • Published May 9, 2023 • 31
SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling

Paper • 2312.15166 • Published Dec 23, 2023 • 60

Vision Language Models

BLINK: Multimodal Large Language Models Can See but Not Perceive

Paper • 2404.12390 • Published Apr 18, 2024 • 26
TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Paper • 2404.12803 • Published Apr 19, 2024 • 30
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Paper • 2404.13013 • Published Apr 19, 2024 • 31
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Paper • 2404.06512 • Published Apr 9, 2024 • 30

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Paper • 2306.17107 • Published Jun 29, 2023 • 11
On the Hidden Mystery of OCR in Large Multimodal Models

Paper • 2305.07895 • Published May 13, 2023 • 1
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Paper • 2308.12966 • Published Aug 24, 2023 • 11
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Paper • 2401.15947 • Published Jan 29, 2024 • 53

A collection of Audio, Video and Visual LLMs.

myshell-ai/OpenVoice

Text-to-Speech • Updated Dec 24, 2024 • 484
Running

Featured

1.12k

OpenVoice

🤗

1.12k

Generate voice from text using a reference audio
dataautogpt3/ProteusV0.3

Text-to-Image • Updated Feb 12, 2024 • 191k • 94
ByteDance/SDXL-Lightning

Text-to-Image • Updated Apr 3, 2024 • 122k • • 2.11k

Vision-Language

SILC: Improving Vision Language Pretraining with Self-Distillation

Paper • 2310.13355 • Published Oct 20, 2023 • 9
Woodpecker: Hallucination Correction for Multimodal Large Language Models

Paper • 2310.16045 • Published Oct 24, 2023 • 17
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Paper • 2201.12086 • Published Jan 28, 2022 • 3
ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories

Paper • 2305.15028 • Published May 24, 2023 • 1

LEAP Hand: Low-Cost, Efficient, and Anthropomorphic Hand for Robot Learning

Paper • 2309.06440 • Published Sep 12, 2023 • 11
Robotic Table Tennis: A Case Study into a High Speed Learning System

Paper • 2309.03315 • Published Sep 6, 2023 • 7
Video Language Planning

Paper • 2310.10625 • Published Oct 16, 2023 • 11
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation

Paper • 2311.01455 • Published Nov 2, 2023 • 30

Previous
1
2
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs