-
Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation
Paper • 2408.11812 • Published • 6 -
WebArena: A Realistic Web Environment for Building Autonomous Agents
Paper • 2307.13854 • Published • 25 -
Agent Workflow Memory
Paper • 2409.07429 • Published • 32 -
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Paper • 2409.08264 • Published • 48
Collections
Discover the best community collections!
Collections including paper arxiv:2408.06327
-
RLHF Workflow: From Reward Modeling to Online RLHF
Paper • 2405.07863 • Published • 71 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 132 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 90
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 17 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 90 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 34
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Paper • 2408.06292 • Published • 126 -
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers
Paper • 2408.06195 • Published • 73 -
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Paper • 2408.06327 • Published • 17 -
Imagen 3
Paper • 2408.07009 • Published • 62
-
End-to-End Goal-Driven Web Navigation
Paper • 1602.02261 • Published -
Learning Language Games through Interaction
Paper • 1606.02447 • Published -
Naturalizing a Programming Language via Interactive Learning
Paper • 1704.06956 • Published -
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
Paper • 1802.08802 • Published • 1
-
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 29 -
MoDE: CLIP Data Experts via Clustering
Paper • 2404.16030 • Published • 15 -
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper • 2405.12130 • Published • 50 -
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper • 2405.12981 • Published • 33
-
Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation
Paper • 2408.11812 • Published • 6 -
WebArena: A Realistic Web Environment for Building Autonomous Agents
Paper • 2307.13854 • Published • 25 -
Agent Workflow Memory
Paper • 2409.07429 • Published • 32 -
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Paper • 2409.08264 • Published • 48
-
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Paper • 2408.06292 • Published • 126 -
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers
Paper • 2408.06195 • Published • 73 -
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Paper • 2408.06327 • Published • 17 -
Imagen 3
Paper • 2408.07009 • Published • 62
-
RLHF Workflow: From Reward Modeling to Online RLHF
Paper • 2405.07863 • Published • 71 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 132 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 90
-
End-to-End Goal-Driven Web Navigation
Paper • 1602.02261 • Published -
Learning Language Games through Interaction
Paper • 1606.02447 • Published -
Naturalizing a Programming Language via Interactive Learning
Paper • 1704.06956 • Published -
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
Paper • 1802.08802 • Published • 1
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 17 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 90 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 34
-
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 29 -
MoDE: CLIP Data Experts via Clustering
Paper • 2404.16030 • Published • 15 -
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper • 2405.12130 • Published • 50 -
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper • 2405.12981 • Published • 33
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23