Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents Paper • 2509.26539 • Published Sep 30 • 8
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer Paper • 2509.16197 • Published Sep 19 • 56
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms Paper • 2410.18967 • Published Oct 24, 2024 • 1
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer Paper • 2509.16197 • Published Sep 19 • 56
DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation Paper • 2503.10618 • Published Mar 13 • 18
STIV: Scalable Text and Image Conditioned Video Generation Paper • 2412.07730 • Published Dec 10, 2024 • 74
Improve Vision Language Model Chain-of-thought Reasoning Paper • 2410.16198 • Published Oct 21, 2024 • 26
Improve Vision Language Model Chain-of-thought Reasoning Paper • 2410.16198 • Published Oct 21, 2024 • 26
Aria: An Open Multimodal Native Mixture-of-Experts Model Paper • 2410.05993 • Published Oct 8, 2024 • 111
MM-Ego: Towards Building Egocentric Multimodal LLMs Paper • 2410.07177 • Published Oct 9, 2024 • 22 • 3
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models Paper • 2410.02740 • Published Oct 3, 2024 • 54
Contrastive Localized Language-Image Pre-Training Paper • 2410.02746 • Published Oct 3, 2024 • 37 • 3