7 23 9

Haotian Zhang

haotiz

AI & ML interests

Vision and Language

Recent Activity

liked a dataset 2 days ago

nvidia/PhysicalAI-Autonomous-Vehicle-Cosmos-Drive-Dreams

upvoted a paper 2 months ago

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

authored a paper 3 months ago

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

View all activity

Organizations

liked a dataset 2 days ago

nvidia/PhysicalAI-Autonomous-Vehicle-Cosmos-Drive-Dreams

Updated Jun 15 • 16.1k • 31

upvoted a paper 2 months ago

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Paper • 2509.26539 • Published Sep 30 • 8

authored 2 papers 3 months ago

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Paper • 2509.16197 • Published Sep 19 • 56

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Paper • 2410.18967 • Published Oct 24, 2024 • 1

upvoted 2 papers 3 months ago

AToken: A Unified Tokenizer for Vision

Paper • 2509.14476 • Published Sep 17 • 36

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Paper • 2509.16197 • Published Sep 19 • 56

liked a model 7 months ago

reducto/RolmOCR

Image-to-Text • 8B • Updated Apr 2 • 3.88k • 567

upvoted a paper 9 months ago

DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation

Paper • 2503.10618 • Published Mar 13 • 18

upvoted a paper 12 months ago

STIV: Scalable Text and Image Conditioned Video Generation

Paper • 2412.07730 • Published Dec 10, 2024 • 74

authored a paper about 1 year ago

Improve Vision Language Model Chain-of-thought Reasoning

Paper • 2410.16198 • Published Oct 21, 2024 • 26

upvoted 2 papers about 1 year ago

Improve Vision Language Model Chain-of-thought Reasoning

Paper • 2410.16198 • Published Oct 21, 2024 • 26

Aria: An Open Multimodal Native Mixture-of-Experts Model

Paper • 2410.05993 • Published Oct 8, 2024 • 111

authored a paper about 1 year ago

MM-Ego: Towards Building Egocentric Multimodal LLMs

Paper • 2410.07177 • Published Oct 9, 2024 • 22

upvoted 2 papers about 1 year ago

Pixtral 12B

Paper • 2410.07073 • Published Oct 9, 2024 • 68

MM-Ego: Towards Building Egocentric Multimodal LLMs

Paper • 2410.07177 • Published Oct 9, 2024 • 22

commented a paper about 1 year ago

MM-Ego: Towards Building Egocentric Multimodal LLMs

Paper • 2410.07177 • Published Oct 9, 2024 • 22 •

authored 2 papers about 1 year ago

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Paper • 2410.02740 • Published Oct 3, 2024 • 54

Contrastive Localized Language-Image Pre-Training

Paper • 2410.02746 • Published Oct 3, 2024 • 37

upvoted a paper about 1 year ago

Contrastive Localized Language-Image Pre-Training

Paper • 2410.02746 • Published Oct 3, 2024 • 37

commented a paper about 1 year ago

Contrastive Localized Language-Image Pre-Training

Paper • 2410.02746 • Published Oct 3, 2024 • 37 •

Haotian Zhang

AI & ML interests

Recent Activity

Organizations

haotiz's activity