Current AI models primarily learn knowledge from large-scale text data . However, text alone cannot fully capture the rich information of the real visual world, including world dynamics, spatial relationships, and underlying physical laws. In contrast, animals in nature can acquire knowledge directly from visual signals, and generalize it to solve tasks across diverse scenarios. For instance, a child can reproduce paper-folding skills demonstrated in a video using different paper materials, without any language instruction. Given the vast abundance of video content available on the internet, enabling AI models to learn generalizable knowledge from raw video data holds significant promise for scaling their knowledge acquisition and is fundamental to their ability to execute tasks effectively in both real-world and digital environments.
VideoWorld is among the first works to explore learning knowledge from synthetic videos. It investigates the acquisition of rules, as well as reasoning and planning capabilities, from Go game records and simulated robotics environments. The study demonstrates that models can learn such knowledge solely from visual signals using an autoregressive video generation paradigm. However, extending this paradigm beyond synthetic domains remains an open challenge. Real-world videos exhibit substantial visual diversity, complex action dynamics, and often involve long-horizon, multi-step interactions. These characteristics prevent the training approach and model design of VideoWorld from being directly applied to realistic settings. When presented with minute-long, multi-step real-world task videos, VideoWorld fails to extract the core task-solving knowledge or generalize it to novel scenarios through observation alone--even for tasks such as paper folding that are easily mastered by children. These limitations naturally lead to the following question:
Can AI models learn transferable knowledge for complex, long-horizon tasks directly from unlabeled real-world videos?