Interpolate start reference image.VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

1Beijing Jiaotong University, 2University of Science and Technology of China, 3ByteDance Seed
(Correspondence,Project Lead)
Interpolate start reference image.

Figure 1: VideoWorld explores learning knowledge from raw videos, ranging from task-specific rules to high-level reasoning and planning capabilities.Compared to other learning methods: reinforcement learning (RL), supervised learning (SL) and text-based learning, it offers three advantages: 1. better generalization with unified visual representation for various tasks and interfaces, 2. lower mannual annotation burden, and 3. richer real-world information than text description.

Image 1 Image 2 Image 3 Image 4

Figure 2: VideoWorld plays Go by generating next board state.


Image 1 Image 2 Image 3 Image 4

Figure 3: VideoWorld controls robotic arms across different environments.

Interpolate start reference image. Abstract

This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop VideoWorld, an autoregressive video generation model trained on raw video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning knowledge, including rules, reasoning and planning capabilities, and (2) increasing the compactness of visual representations significantly enhances learning efficiency. To improve both the efficiency and efficacy of knowledge learning, we introduce the Latent Dynamics Model (LDM) as a key component of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. In robotic tasks, VideoWorld effectively learns diverse control operations and generalizes across environments, approaching the performance of oracle models in CALVIN and RLBench. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models to be open-sourced for further research.



🔥Highlights

1. We explore for the first time, whether video generation models can learn sophisticated knowledge and observe two key findings: i) merely observing videos suffices to learn complex tasks, and ii) compact representations of visual changes greatly enhance knowledge learning.


2. We porpose VideoWorld , leveraging a latent dynamics model to represent multi-step visual changes, boosting both efficiency and effectiveness of knowledge acquisition.


3. We construct Video-GoBench , a large-scale video-based Go dataset for training and evaluation, facilitating future research on knowledge learning from pure videos.



Interpolate start reference image. Background

The next token prediction training paradigm has endowed large language models (LLMs) with remarkable world knowledge and intelligence, enabling them to help address complex tasks that require reasoning, planning ahead, and decision-making. However, language alone cannot fully capture all forms of knowledge or encompass the vast information present in the real world. In nature, biological organisms acquire knowledge primarily through visual information, rather than relying solely on language. For instance, gorillas and other primates learn vital skills like foraging and social interactions mainly through visual observation, mimicking adult behaviors without relying on language.

Most existing research has focused on learning knowledge from texts, while relatively little attention is given to learning from pure visual signals. Some studies, such as UniPi, have explored using video data to train models for robot manipulation, but they still rely heavily on language instructions. Moreover, these tasks are often limited to single commands, without requiring complex reasoning or planning. This raises an important question: can an AI model learn sophisticated knowledge solely from visual input, akin to how a gorilla learns from its environment?


Interpolate start reference image. Our Work

In this work, we take an initial step toward exploring knowledge learning from raw video data by leveraging the next token prediction paradigm. To achieve this, we construct two experiment environments to collect pure visual training data: Go and robotic manipulation.

We begin our investigation with a basic video generation model based on VQ-VAE and autoregressive transformer. The raw videos of task executions, collected from the environments described above, serve as our training data, representing the sole source of knowledge. We convert video frames into discrete tokens using VQ-VAE. Similar to large language models (LLMs), we train an autoregressive transformer on these tokens, employing the next token (or next frame) prediction paradigm. During testing, the model generates new frames based on prior frames, and task-specific operations such as moves in Go or robotic operations are derived from the newly generated frames.

Key findings

1. The model can learn basic knowledge from video generations. This is evidenced by the its ability to master Go rules and learn fundamental robotic operations.

2. The representation of visual change is crucial for knowledge learning. While videos contain sufficient information for task completion, redundant representations of visual changes related to key decisions and actions hinder learning efficiency. A compact representation is essential for enhancing the model's learning efficiency and knowledge acquisitions.

Latent Dynamics Model

Building on the observations above, we propose the Latent Dynamics Model (LDM), which enhances both the efficiency and effectiveness of video learning while providing a mechanism to probe the model's learned knowledge. The LDM compresses the future visual changes into a set of latent codes to serve as the compact representation of multi-step visual context. This allows the model to predict both video frames and latent code during training, improving its ability to capture and reason about diverse visual information, such as object interactions and scene dynamics. In the figure below, the video generation model with LDM achieves superior training efficiency and demonstrates a 5- dan professional level of performance against RL agents.

Interpolate start reference image.

Figure 4: ''State'', ''Video'' and ''Video w/ LDM'' refer to three different prediction targets: a state sequence (e.g., labeled positions of moves in Go), a raw video sequence, and a video sequence augmented with latent codes representing future visual changes (this approach is adopted by VideoWorld). “Action-Value” denotes the score for each move in the game. By combining rich video information with a compact representation of visual changes, VideoWorld enables more effective learning.


Interpolate start reference image. Overall Architecture

Interpolate start reference image.
Figure 5: Overview of the proposed VideoWorld model architecture. (Left) Overall architecture. (Right) The proposed latent dynamics model (LDM). First, LDM compresses the visual changes from each frame to its subsequent H frames into compact latent codes. Then, an auto-regressive transformer seamlessly integrates the output of LDM with the next token prediction paradigm.


Interpolate start reference image. Understanding Learned Knowledge with LDM

The latent representation learned in LDM provides valuable insights into the knowledge learning process of VideoWorld. Below we offer an in-depth analysis of what our model learns through latent representations.

LDM learns patterns in the training set

As shown in figure below, the latent codes on the training set capture both short- and long-term dependencies, demonstrating the model's ability to represent knowledge at different temporal scales. In the Go scenario, salient regions in the latent codes correspond to common move patterns, indicating that the model effectively embeds multi-step strategies into a compressed space, hence aiding decision-making and reasoning. Similarly, in the robotics scenario, the clustering of latent codes across steps reveals key dynamic dependencies over various time ranges, thus benefiting diverse manipulation tasks.

Interpolate start reference image.
Figure 6: UMAP projection of the learned latent code on the Go (Left) and CALVIN (right) training set. Each point represents the continuous (pre-quantization) latent code generated by the LDM. In Go examples, odd steps represent white's moves, and even steps represent black's moves. We visualize the latent codes of black moves in steps 2/4/6. The legend shows examples of common patterns learned for new black moves. For clarity, these moves are highlighted on the board with added colors and lines to indicate new patterns. On the right, we visualize the latent codes of the robotic arm's movement along the X/Y/Z axes at intervals of 1, 5, and 10 frames. Points are color-coded by displacement range, with purple and red indicating the maximum displacement in opposite directions along each axis.

LDM enables forward planning during testing

We examine the role of codes during inference. The visualization in the figure below shows that codes from different steps group by output positions, suggesting that VideoWorld models long-range changes progressively, similar to human forward-planning. The visualization also includes imagination of the opponent's moves, achieving a high average action-value of 71.2\% and action accuracy of 74.3\%. This indicates that, at each step, VideoWorld considers long-term changes in the game situation within the latent space, enabling it to make strategic moves with a long-term perspective.

Interpolate start reference image.

Figure 7: Illustration of playing against KataGO and UMAP projection of the predicted latent code. Our model plays as black. The generated latent code is visualized through the LDM decoder and new stones in the visualization are marked with colors to match the legend. The visualization serves as a probe, indicating that the model shows signs of forward planning.

LDM generates causally interrelated codes.

Similar findings are observed in the robotic scenario. We visualize the predicted latent codes during inference across different tasks in figure above. Here, \(H=9\), meaning the transformer generates 9 latent codes per time step, corresponding to 9 prediction steps. As shown, the latent codes for different prediction steps are grouped by task type, indicating that they capture task-relevant dynamics. Codes for steps 1–4 show greater overlap, likely because they focus on fine-grained displacements shared across tasks. In contrast, steps 5–9 show more distinct separation by task type, highlighting the model's ability to progressively capture long-range changes specific to each task.

Interpolate start reference image.

Figure 8: Illustration of robotic manipulation and UMAP projection of the predicted latent code during inference. Latent codes are visualized through the LDM decoder. The UMAP projection illustrates the 9 predicted latent codes (i.e. \(H=9\)) across different tasks, with each point color-coded by task type. Visualizations with a yellow background show the model's actual robotic arm control during inference, while those with a green background represent the model's next-frame predictions during training.


BibTeX

@misc{ren2025videoworldexploringknowledgelearning,
  title={VideoWorld: Exploring Knowledge Learning from Unlabeled Videos}, 
  author={Zhongwei Ren and Yunchao Wei and Xun Guo and Yao Zhao and Bingyi Kang and Jiashi Feng and Xiaojie Jin},
  year={2025},
  eprint={2501.09781},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2501.09781}, 
}