The latent representation learned in LDM provides valuable insights into the knowledge learning process of VideoWorld. Below we offer an in-depth analysis of what our model learns through latent representations.
LDM learns patterns in the training set
As shown in figure below, the latent codes on the training set capture both short- and long-term dependencies, demonstrating the model's ability to represent knowledge at different temporal scales. In the Go scenario, salient regions in the latent codes correspond to common move patterns, indicating that the model effectively embeds multi-step strategies into a compressed space, hence aiding decision-making and reasoning. Similarly, in the robotics scenario, the clustering of latent codes across steps reveals key dynamic dependencies over various time ranges, thus benefiting diverse manipulation tasks.
Figure 6: UMAP projection of the learned latent code on the Go (Left) and CALVIN (right) training set. Each point represents
the continuous (pre-quantization) latent code generated by the LDM. In Go examples, odd steps represent white's moves, and even steps represent black's moves. We visualize the latent codes of black moves in steps 2/4/6. The legend shows examples of common patterns learned for new black moves. For clarity, these moves are highlighted on the board with added colors and lines to indicate new patterns. On the right, we visualize the latent codes of the robotic arm's movement along the X/Y/Z axes at intervals of 1, 5, and 10 frames. Points are color-coded by displacement range, with purple and red indicating the maximum displacement in opposite directions along each axis.
LDM enables forward planning during testing
We examine the role of codes during inference. The visualization in the figure below shows that codes from different steps group by output positions, suggesting that VideoWorld models long-range changes progressively, similar to human forward-planning. The visualization also includes imagination of the opponent's moves, achieving a high average action-value of 71.2\% and action accuracy of 74.3\%. This indicates that, at each step, VideoWorld considers long-term changes in the game situation within the latent space, enabling it to make strategic moves with a long-term perspective.
Figure 7: Illustration of playing against KataGO and UMAP projection of the predicted latent code. Our model plays as black. The generated latent code is visualized through the LDM decoder and new stones in the visualization are marked with colors to match the legend. The visualization serves as a probe, indicating that the model shows signs of forward planning.
LDM generates causally interrelated codes.
Similar findings are observed in the robotic scenario. We visualize the predicted latent codes during inference across different tasks in figure above. Here, \(H=9\), meaning the transformer generates 9 latent codes per time step, corresponding to 9 prediction steps. As shown, the latent codes for different prediction steps are grouped by task type, indicating that they capture task-relevant dynamics. Codes for steps 1–4 show greater overlap, likely because they focus on fine-grained displacements shared across tasks. In contrast, steps 5–9 show more distinct separation by task type, highlighting the model's ability to progressively capture long-range changes specific to each task.
Figure 8: Illustration of robotic manipulation and UMAP projection of the predicted latent code during inference. Latent codes are visualized through the LDM decoder. The UMAP projection illustrates the 9 predicted latent codes (i.e. \(H=9\)) across different tasks, with each point color-coded by task type. Visualizations with a yellow background show the model's actual robotic arm control during inference, while those with a green background represent the model's next-frame predictions during training.