TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers
Abstract
TwinBrainVLA addresses the tension between semantic understanding and motor skills in robot control by coordinating a generalist vision-language model with a specialist model through an asymmetric mixture-of-transformers mechanism.
Standard Vision-Language-Action (VLA) models typically fine-tune a monolithic Vision-Language Model (VLM) backbone explicitly for robotic control. However, this approach creates a critical tension between maintaining high-level general semantic understanding and learning low-level, fine-grained sensorimotor skills, often leading to "catastrophic forgetting" of the model's open-world capabilities. To resolve this conflict, we introduce TwinBrainVLA, a novel architecture that coordinates a generalist VLM retaining universal semantic understanding and a specialist VLM dedicated to embodied proprioception for joint robotic control. TwinBrainVLA synergizes a frozen "Left Brain", which retains robust general visual reasoning, with a trainable "Right Brain", specialized for embodied perception, via a novel Asymmetric Mixture-of-Transformers (AsyMoT) mechanism. This design allows the Right Brain to dynamically query semantic knowledge from the frozen Left Brain and fuse it with proprioceptive states, providing rich conditioning for a Flow-Matching Action Expert to generate precise continuous controls. Extensive experiments on SimplerEnv and RoboCasa benchmarks demonstrate that TwinBrainVLA achieves superior manipulation performance compared to state-of-the-art baselines while explicitly preserving the comprehensive visual understanding capabilities of the pre-trained VLM, offering a promising direction for building general-purpose robots that simultaneously achieve high-level semantic understanding and low-level physical dexterity.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries (2026)
- InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation (2026)
- VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models (2026)
- PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence (2025)
- Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training (2025)
- CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos (2026)
- HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
