Home

Agentic Thinking

The first wave of reasoning models taught us that if we want to scale reinforcement learning in language models, we need feedback signals that are deterministic, stable, and scalable. Math, code, logic, and other verifiable domains became central because rewards in these settings are much stronger than generic preference supervision. They let RL optimize for correctness rather than plausibility. Infrastructure became critical. Now we are in the era of agentic thinking and reasoning: An agent is a system that can formulate plans, decide when to act, use tools, perceive environment feedback, revise strategy, and continue over long horizons. It is defined by closed-loop interaction with the world.

Agentic's reasoning & thinking makes infra harder: Even on very difficult math or coding tasks, a genuinely advanced system should have the right to search, simulate, execute, inspect, verify, and revise. The objective is to solve problems robustly and productively.

  1. training and inference must be more cleanly decoupled. Without that decoupling, rollout throughput collapses. Consider a coding agent that must execute generated code against a live test harness: the inference side stalls waiting for execution feedback, the training side starves for completed trajectories, and the whole pipeline operates far below the GPU utilization you would expect from classical reasoning RL. Adding tool latency, partial observability, and stateful environments amplifies these inefficiencies. The result is that experimentation slows and becomes painful long before you reach the capability levels you are targeting.

  2. The environment itself also becomes a first-class research artifact. In the SFT era, we obsessed over data diversity. In the agent era, we should obsess over environment quality: stability, realism, coverage, difficulty, diversity of states, richness of feedback, exploit resistance, and scalability of rollout generation. Environment-building has started to become a real startup category rather than a side project. If the agent is being trained to operate in production-like settings, then the environment is part of the core capability stack.

  3. Agent's RL is easily reward hacking: As soon as the model gets meaningful tool access, reward hacking becomes much more dangerous. A model with search might learn to look up answers directly during RL. A coding agent might exploit future information in a repository, misuse logs, or discover shortcuts that invalidate the task. An environment with hidden leaks can make the policy look superhuman while actually training it to cheat. This is where the agent era becomes much more delicate than the reasoning era. Better tools make the model more useful, but they also enlarge the attack surface for spurious optimization.

  4. Thinking: from thinking longer to thinking in order to act. The core object of training has shifted. It is the model-plus-environment system, or more concretely, the agent and the harness around it. That changes what research artifacts matter most: model architecture and training data, yes, but also environment design, rollout infrastructure, evaluator robustness, and the interfaces through which multiple agents coordinate. It changes what "good thinking" means: the most useful trace for sustaining action under real-world constraints, rather than the longest or most visible one.

Source: translated from JustinLin's X here and here

2026/4/3 07:18:27(UTC+0)

Video Post Training

Video post training is a systematic framework instead of simply scaling RMs or applying GRPO/DPO algorithms, it's about:

  1. pre-training capability: determines optimization space (upper bound for post-training), if pretrained models themselves could produce videos with plausible motions, post training simply refines them, otherwise post training hase to reconstruct such capability;

  2. Reward Model: detemines optimization directions, RMs might have their preferences for specific attributes, even with multiple RMs, the combined reward feedback may reflect local optimum. Moreover, multiple reward models include more hyperparameter and risk of hacking, training instability;

  3. Optimization strategies: detemines optimization trajectories, how to optimization towards right directions and ensure training stability;

  4. Evaluation: Automatic v.s. human evaluation: what's the golden standard? labrious human evaluation, how to identify good video and better video?

  5. Distributed training/Implementation details: Same data and Same hyperparameters on different pre-trained models lead to totally different results. (Optimization Efficiency, Scalibility, System stability)

Source: Here

2026/3/3 12:10:15(UTC+0)

Next world states (large world modeling) based on actions

Nature gives us an existential proof of a highly dexterous physical intelligence with minimal language capability.

The ape.

I’ve seen apes drive golf carts and change brake pads with screwdrivers like human mechanics. Their language understanding is no more than BERT or GPT-1, yet their physical skills are far beyond anything our SOTA robots can do. Apes may not have good LMs, but they surely have a robust mental picture of "what if"s: how the physical world works and reacts to their intervention.

The era of world modeling is here. It is bitter lesson-pilled. As Jitendra likes to remind us, the scaling addicts, “Supervision is the opium of the AI researcher.” The whole of YouTube and the rise of smart glasses will capture raw visual streams of our world at a scale far beyond all the texts we ever train on.

We shall see a new type of pretraining: next world states could include more than RGBs - 3D spatial motions, proprioception, and tactile sensing are just getting started.

We shall see a new type of reasoning: chain of thought in visual space rather than language space. You can solve a physical puzzle by simulating geometry and contact, imagining how pieces move and collide, without ever translating into strings. Language is a bottleneck, a scaffold, not a foundation.

We shall face a new Pandora’s box of open questions: even with perfect future simulation, how should motor actions be decoded? Is pixel reconstruction really the best objective, or shall we go into alternative latent spaces? How much robot data do we need, and is scaling teleoperation still the answer? And after all these exercises, are we finally inching towards the GPT-3 moment for robotics?

Ilya is right after all. AGI has not converged. We are back to the age of research, and nothing is more thrilling than challenging first principles.

Source: X from Jim Fan

2026/2/11 03:10:00(UTC+0)

统一态中非常值得思考的Open problems:

  1. 对于模型,我们应该采用什么样的统一动作定义和形式?是将所有不同的动作都定义为文本,还是用结构化的JSON文件定义不同的动作,或者必须依赖多模态形式的动作(例如运动控制可能需要更精细的多模态形式)?
  2. 能否找到一种统一的图像和视频表征,同时支持理解和生成任务?这种表征应该是离散的还是连续的?
  3. 什么样的表征能够最大限度地实现不同模态之间的正向迁移,同时满足现实世界应用中对采样速度的部署约束?
  4. 对于图像和视频,我们应该使用自回归、扩散模型还是其他方法?在最近的一些工作中,像素空间扩散展现出了超出我们最初预期的潜力,我们是否应该回归像素扩散,或者至少重新审视大规模的像素扩散? World Model&视频生成的最大挑战之一:
  5. long term consistency: 一个是long-term,另一个是consistency,静态的是否保持静态,动态又是否能够刻画与静态世界的交互?如果场景里面又动态部分,比如钟表,时间,日落?让模型理解时间的概念(扩大context length, model scale, how about efficiency? training data? memory?)
  6. real-time interaction (保持long-term的前提下,与real world的交互,反思,决策,执行)

Ref: source

2026/1/24 09:21:13(UTC+0)