VideoWorld 2
Learning Transferable Knowledge from Real-world Videos

1Beijing Jiaotong University, 2ByteDance Seed
(Correspondence,Project Lead)
Interpolate start reference image.

Figure 1: (left) VideoWorld 2 explores how to learn transferable knowledge from unlabeled real-world videos. (right) Comparison of different frameworks. VDM (e.g., Wan2.2 14B) produces high visual fidelity but fails to learn task-relevant dynamics or long-horizon policies. VideoWorld 1 improves policy learning but suffers from poor visual quality in real-world scenarios. VideoWorld 2 learns more robust latent dynamics while also achieving significantly better visual quality, enabling generalizable long-horizon knowledge learning from videos.

Video Demos

Abstract

Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.


🔥Highlights

1. We are the first to explore how to learn transferable world knowledge for complex long-horizon tasks directly from raw real-world videos, and we reveal that disentangling action dynamics from visual appearance is essential for successful knowledge learning.

2. We propose VideoWorld 2 , whose core is a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples task-relevant dynamics from visual appearance, enhancing the quality and transferability of learned knowledge.

3. We construct Video-CraftBench, to address the rarely explored challenge of fine-grained, long-horizon visual reasoning through real-world handicraft tasks. This benchmark facilitates future research on learning transferable knowledge from raw videos.

Background

Current AI models primarily learn knowledge from large-scale text data . However, text alone cannot fully capture the rich information of the real visual world, including world dynamics, spatial relationships, and underlying physical laws. In contrast, animals in nature can acquire knowledge directly from visual signals, and generalize it to solve tasks across diverse scenarios. For instance, a child can reproduce paper-folding skills demonstrated in a video using different paper materials, without any language instruction. Given the vast abundance of video content available on the internet, enabling AI models to learn generalizable knowledge from raw video data holds significant promise for scaling their knowledge acquisition and is fundamental to their ability to execute tasks effectively in both real-world and digital environments.

VideoWorld is among the first works to explore learning knowledge from synthetic videos. It investigates the acquisition of rules, as well as reasoning and planning capabilities, from Go game records and simulated robotics environments. The study demonstrates that models can learn such knowledge solely from visual signals using an autoregressive video generation paradigm. However, extending this paradigm beyond synthetic domains remains an open challenge. Real-world videos exhibit substantial visual diversity, complex action dynamics, and often involve long-horizon, multi-step interactions. These characteristics prevent the training approach and model design of VideoWorld from being directly applied to realistic settings. When presented with minute-long, multi-step real-world task videos, VideoWorld fails to extract the core task-solving knowledge or generalize it to novel scenarios through observation alone--even for tasks such as paper folding that are easily mastered by children. These limitations naturally lead to the following question: Can AI models learn transferable knowledge for complex, long-horizon tasks directly from unlabeled real-world videos?


Our Work

To explore this question, we consider two challenging real-world environments. The first is handcraft making, which serves as a strong testbed for learning task knowledge from raw video. These videos require fine-grained manipulation under varied desktop environments and object appearances, involving deformable materials, viewpoint shifts, and frequent occlusions. Furthermore, the videos span minutes and comprise multiple interdependent steps, presenting significantly higher complexity and longer horizons than entertainment-oriented video generation or typical imitation settings. In parallel, we investigate robotic manipulation by learning from the Open-X dataset, which contains real-world demonstration videos, and evaluating on the CALVIN environment to test the generalization of the learned knowledge. Together, these environments provide a comprehensive test of whether knowledge learned from raw videos can transfer across scenes, tasks, and embodiments.

Key Steps
Figure 2: Our objective is to learn transferable knowledge for complex and long-horizon tasks from real-world videos. We prioritize tasks that demand multi-step planning and feature delicate manipulations. To this end, we introduce the Video-CraftBench, a dataset of first-person video tutorials covering five long-horizon handcraft tasks: folding a paper airplane, folding a paper boat, and building a tower/horse/person using blocks.

Overall Architecture

Interpolate start reference image.
Figure 3: (Left) First, a dLDM compresses future visual changes into compact and generalizable latent codes. These codes are then modeled by an autoregressive transformer. (Right) In inference, the transformer predicts latent codes for a new, unseen environment from the input image, which are subsequently decoded into task execution videos.

We propose VideoWorld 2, which features a dynamic-enhanced Latent Dynamics Model (dLDM) as its core design to effectively decouple appearance modeling from action dynamics learning, thereby enabling robust knowledge acquisition. The dLDM consists of a causal VQ-VAE and a pretrained Video Diffusion Model (VDM). The VQ-VAE compresses future visual changes into discrete latent codes that capture task-relevant dynamics, while the VDM models visual appearance and produces high-fidelity reconstructions. The latent codes condition the VDM through cross-attention, and gradients from the VDM further refine these codes so they focus on concise and transferable dynamics rather than appearance details. By delegating appearance modeling to the pretrained VDM, VideoWorld 2 learns more robust and generalizable latent dynamics than prior approaches.


dLDM

dLDM architecture
The proposed dynamic-enhanced latent dynamics model (dLDM). (Left) Latent dynamic model in VideoWorld. Visual changes between the first and subsequent frames are compressed into a set of latent codes. (right) The dLDM proposed in VideoWorld 2. It employs a pre-trained VDM as an appearance prior, yielding better latent codes and facilitating high-fidelity video output.

BibTeX

@misc{ren2026videoworld2,
  title={VideoWorld 2: Learning Transferable Knowledge from Real-world Videos}, 
  author={Zhongwei Ren and Yunchao Wei and Xiao Yu and Guixun Luo and Yao Zhao and Bingyi Kang and Jiashi Feng and Xiaojie Jin},
  year={2026},
  eprint={2602.10102},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2602.10102}, 
}