Papers
arxiv:2512.16793

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

Published on Dec 18
ยท Submitted by
lin
on Dec 22
#2 Paper of the day
ยท DeepCybo DeepCybo
Authors:
Bin Yu ,
,
,
,
,
,
,

Abstract

Proposed Egocentric2Embodiment pipeline translates human egocentric videos into structured training data for robots, enhancing their egocentric understanding and task performance.

AI-generated summary

Robotic generalization relies on physical intelligence: the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception and action. However, most VLMs are trained primarily on third-person data, creating a fundamental viewpoint mismatch for humanoid robots. Scaling robot egocentric data collection remains impractical due to high cost and limited diversity, whereas large-scale human egocentric videos offer a scalable alternative that naturally capture rich interaction context and causal structure. The key challenge is to convert raw egocentric videos into structured and reliable embodiment training supervision. Accordingly, we propose an Egocentric2Embodiment translation pipeline that transforms first-person videos into multi-level, schema-driven VQA supervision with enforced evidence grounding and temporal consistency, enabling the construction of the Egocentric2Embodiment dataset (E2E-3M) at scale. An egocentric-aware embodied brain, termed PhysBrain, is obtained by training on the E2E-3M dataset. PhysBrain exhibits substantially improved egocentric understanding, particularly for planning on EgoThink. It provides an egocentric-aware initialization that enables more sample-efficient VLA fine-tuning and higher SimplerEnv success rates (53.9\%), demonstrating effective transfer from human egocentric supervision to downstream robot control.

Community

Paper author Paper submitter

icml_overall

Paper author Paper submitter

arXiv lens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/physbrain-human-egocentric-data-as-a-bridge-from-vision-language-models-to-physical-intelligence-3614-1d27535a

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

arXiv lens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/physbrain-human-egocentric-data-as-a-bridge-from-vision-language-models-to-physical-intelligence-3614-1d27535a

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.16793 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.16793 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.16793 in a Space README.md to link it from this page.

Collections including this paper 3