Next world states (large world modeling) based on actions
Nature gives us an existential proof of a highly dexterous physical intelligence with minimal language capability.
The ape.
I’ve seen apes drive golf carts and change brake pads with screwdrivers like human mechanics. Their language understanding is no more than BERT or GPT-1, yet their physical skills are far beyond anything our SOTA robots can do. Apes may not have good LMs, but they surely have a robust mental picture of "what if"s: how the physical world works and reacts to their intervention.
The era of world modeling is here. It is bitter lesson-pilled. As Jitendra likes to remind us, the scaling addicts, “Supervision is the opium of the AI researcher.” The whole of YouTube and the rise of smart glasses will capture raw visual streams of our world at a scale far beyond all the texts we ever train on.
We shall see a new type of pretraining: next world states could include more than RGBs - 3D spatial motions, proprioception, and tactile sensing are just getting started.
We shall see a new type of reasoning: chain of thought in visual space rather than language space. You can solve a physical puzzle by simulating geometry and contact, imagining how pieces move and collide, without ever translating into strings. Language is a bottleneck, a scaffold, not a foundation.
We shall face a new Pandora’s box of open questions: even with perfect future simulation, how should motor actions be decoded? Is pixel reconstruction really the best objective, or shall we go into alternative latent spaces? How much robot data do we need, and is scaling teleoperation still the answer? And after all these exercises, are we finally inching towards the GPT-3 moment for robotics?
Ilya is right after all. AGI has not converged. We are back to the age of research, and nothing is more thrilling than challenging first principles.
Source: X from Jim Fan
统一态中非常值得思考的Open problems:
- 对于模型,我们应该采用什么样的统一动作定义和形式?是将所有不同的动作都定义为文本,还是用结构化的JSON文件定义不同的动作,或者必须依赖多模态形式的动作(例如运动控制可能需要更精细的多模态形式)?
- 能否找到一种统一的图像和视频表征,同时支持理解和生成任务?这种表征应该是离散的还是连续的?
- 什么样的表征能够最大限度地实现不同模态之间的正向迁移,同时满足现实世界应用中对采样速度的部署约束?
- 对于图像和视频,我们应该使用自回归、扩散模型还是其他方法?在最近的一些工作中,像素空间扩散展现出了超出我们最初预期的潜力,我们是否应该回归像素扩散,或者至少重新审视大规模的像素扩散? World Model&视频生成的最大挑战之一:
- long term consistency: 一个是long-term,另一个是consistency,静态的是否保持静态,动态又是否能够刻画与静态世界的交互?如果场景里面又动态部分,比如钟表,时间,日落?让模型理解时间的概念(扩大context length, model scale, how about efficiency? training data? memory?)
- real-time interaction (保持long-term的前提下,与real world的交互,反思,决策,执行)
Ref: source