Zero-shot Emergent Robotic Behaviors using World Models

How Test-Time Reasoning for Robotic Intelligence Unlocks Behaviors That Were Never Trained For

The Voaige Team

Emergent Behavior (Unseen in Training)


From Neuroscience Principles to Robotic Intelligence

At Voaige, our mission is to understand the computational principles that make intelligence possible, and then build systems that exhibit intelligent behavior grounded in those principles. In our previous post, we showed what this looks like in practice: a single observation about the brain's locally random wiring led to architectures that achieved 10x efficiency gains in deep learning. That work focused on perception, specifically how a network can learn to see robustly with far less compute.

This post is about what comes next. An agent that can recognize objects but cannot reason about how to act on them, adapt when things go wrong, or compose familiar skills into novel strategies is just pattern matching, and perception alone does not solve that. The next chapter for us moves from perception into action, and the central question becomes: how do you build robotic systems that can handle the open-ended complexity of the physical world without needing to be shown every possible situation in advance?

As we will show, the answer points directly to a deeper principle: intelligence emerges at test time. This work on robotic intelligence is, at its core, a proof of concept for test-time reasoning, and the same principle is now at the heart of what we are building for LLMs.

The Bottleneck in Robotic Manipulation

Robotic manipulation spans a combinatorially large space of embodiments, tools, workspaces, and constraints. The dominant approach today, collecting teleoperation demonstrations and training a policy to imitate them, works well when the training distribution densely covers the situations the robot will encounter, but degrades sharply at the edges, precisely where robotics gets interesting.

The problem is compositional. A dual-arm robot might need to push an object clear of an obstacle with one tool, reposition it into the reachable zone of another arm, and then grasp it, a sequence that was never demonstrated end-to-end. A teleoperation pipeline would need to have captured not just each individual skill, but the specific multi-skill transitions that resolve the long-tail failure modes or mistakes of each skill's control policy, across many environment configurations. The number of required demonstrations scales roughly with the product of skills, their possible handoffs, and their error-recovery strategies and trajectories, making long-tail coverage prohibitively expensive.

This is not just an engineering inconvenience but a fundamental limitation: imitation learning encodes solutions rather than understanding. The robot learns what to do in situations it has seen, but has no mechanism to reason about what would happen under alternative actions in situations it has not seen. This is the same limitation that naive next-token prediction imposes on LLMs, and the same reason that test-time reasoning is the path forward in both domains.

What Biology Does Differently

Biological agents do not solve this problem by accumulating demonstrations. A child learning to use tools does not need to observe every possible obstacle configuration paired with the correct recovery strategy. Instead, they build internal models of how the world works, how objects move when pushed, what happens when something is blocked, which actions are available given the current state, and use those models to plan, compose, and recover on the fly.

Brains do not retrain for each new problem. They dynamically allocate compute, adjust, redirect, and coordinate at test time, which is the same principle we believe is the missing piece for AI systems broadly. This is the principle we wanted to translate: not the neural implementation details, but the computational strategy of learning how the world responds to your actions, then reasoning through that model at test time to solve new problems. Rather than expanding the demonstration dataset to cover every composite trajectory, corner case, and error-correction behavior, we learn reusable skill-level models that can be recombined and reasoned through to handle novel situations, satisfy new constraints, and resolve unexpected failures.


Test-Time Reasoning for Robotic Intelligence: The Architecture

We adopted a world model-first methodology evaluated in a deliberately constrained, small-scale robotic setting designed to stress-test compositional generalization. The core idea is that rather than training a robot on what to do, we train it on how the world responds to its actions and then let it reason at test time, which is test-time cognition applied to physical intelligence.

Environment and task design

  • Embodiment: a dual-arm system with heterogeneous end-effectors: one arm equipped with a stick-like tool (for non-prehensile manipulation) and one with a gripper (for prehensile lifting).
  • Workspace constraints: tabletop scene with immovable obstacles and non-overlapping / partial reachability, such that each arm can act only on a subset of the workspace.
  • Goal: lift a cube into free space (air), under conditions where direct grasping may be infeasible from the initial configuration due to collision with obstacles/being out of reach for the gripper arm.

System architecture

  • Skill policies: one policy per skill/tool, optimized for closed-loop execution of that skill.
  • Skill-conditioned world models: one predictive dynamics model per skill, trained to forecast the environment's evolution under that skill's action distribution.
  • Isolated training: policies and world models were trained independently per skill, purely in simulation, to capture how each tool interacts with the environment (contacts, pushes, slips, obstacle interactions), without exposing the system to multi-skill composite demonstrations.
  • Transferable state representation: world model inputs and outputs were designed to avoid leveraging artificial simulation-specific states, which allows the system to transfer more seamlessly to unseen objects, obstacles, and beyond simulation into the real world.
  • Planning stack: a higher-level planner composes skills by rolling out candidate sequences through the corresponding skill world models, selecting skill activations that advance the system toward the goal under the current constraints.

The intended outcome is a system where generalization is achieved by test-time reasoning and composition over learned predictive models, rather than by expanding the demonstration dataset to cover all composite trajectories, corner cases, and error-correction behaviors. The planner is not recalling a memorized solution; it is thinking its way to one.


Why this is hard (and why test-time reasoning is the answer)

We explicitly targeted solutions that are non-obvious: success often requires intermediate interactions (e.g., freeing an object from obstacles, moving it into a reachable region, then grasping) that are not trivially implied by the goal specification.

Key challenges:

  • Combinatorial coverage gap for teleoperation. A demonstration-based system would need to observe not just obstacle-avoidant behaviors, but also the specific multi-skill transitions that resolve each failure mode (stuck states, contact misalignments, slips) across many environment configurations. The number of required demonstrations scales roughly with the number of skills and their possible handoffs, making long-tail coverage prohibitive for a demonstration-based approach.
  • Constraint-driven compositionality. Because each arm has limited reach and different affordances, the optimal strategy is often a structured composition: use the stick for non-prehensile repositioning and clearance, then hand off to the gripper for lifting. These compositions must be inferred from state and constraints at test time, not learnt from a fixed library of end-to-end demonstrations that explicitly includes those compositions.
  • Test-time reasoning over counterfactual futures. The planner must evaluate what would happen under alternative skill sequences, such as "attempt grasp now" vs "push to clear obstacles" vs "push into reachable region", and select a plan whose intermediate predicted states make the final goal feasible. This is active inference, not recall.
  • Self-correction via replanning under execution error. Realistic manipulation induces stochasticity (contact uncertainty, slip, inadvertent pushes). A successful system must detect when execution diverges from the plan and reassign control to the appropriate skill (for example, return to the stick when the object is displaced), repeatedly if necessary, until the goal becomes reliably achievable. This error-correcting behavior becomes prohibitively expensive to cover using demonstration data, but emerges naturally from test-time reasoning.

Overall, our emphasis was not merely on the complexity of the task, but on demonstrating that skill-specific world models plus planning enable one-shot, constraint-aware composition and self-correction at test time, without needing the system to learn the full behavior from end-to-end demonstrations.


What Emerges When a Robot Reasons at Test Time

The central result of this work is not a task solved but a class of behaviors that emerge. When a system is equipped with predictive models and the ability to reason through them at test time, it exhibits capabilities that were never demonstrated, never explicitly trained, and never anticipated by the training pipeline. Rather than retrieving a solution, the model constructs one, and that distinction is the hallmark of genuine test-time intelligence.

We consistently find that our system smoothly stitches skills together to achieve goals that are not directly achievable or that require intelligent, constraint-aware subgoal discovery, with dynamic error recovery achieved by handing responsibility between skills, sometimes multiple times within an episode. We also find it robust across many different environments, a wide range of obstacle complexities including moving obstacles, and across different robot embodiments.

Navigating diverse and moving obstacles
Extracting the cube from cluttered spaces using the stick prior to picking
Navigating diverse and moving obstacles; a second randomization
Maze-like navigation using the stick

(1) Unseen Composite Trajectories

To successfully perform this task, our system was able to construct composite skill trajectories dynamically at test time: first use the stick to move the cube away from obstacles, then lift the cube using the gripper. No version of this full sequence existed in training; it was reasoned into existence at inference time.

Figure 1a. World model-grounded planning composes unseen skill-switching strategies under constraints.
Figure 1b. Composite skill switching on a second embodiment.

It is important to remember that the stick arm and gripper arm (each with their own skill policies and skill world models) were trained fully separately: at no point in training did the system see this full, compositional trajectory. Based on the state of the environment, the system successfully inferred that directly lifting the cube is not an option, and planned a composite trajectory using the complementary strengths of the two arms. The planner did not find this trajectory in memory; it derived it through active reasoning over its world models.

This approach contrasts sharply with teleoperation-based systems, which would have to cover such composite trajectories within the training data. As the set of individual skills becomes larger, the harder it is to augment the training set with useful composite trajectories, an issue that test-time reasoning systems are much less susceptible to.

(2) Subgoal Discovery

The system was able to effectively infer subgoals that must be achieved before the final objective (lifting the cube) became feasible. Our planner infers which skills should activate at each step, and the corresponding world model predicts the resulting state. By visualizing those predictions, we can see the inferred subgoals explicitly, providing a direct window into the system's test-time reasoning process.

Figure 2a. World model predictions reveal intermediate subgoals inferred from state and constraints.
Figure 2b. Subgoal discovery in embodiment 2.

(3) Self-Correction and Replanning

Finally, we observed that the system dynamically hands responsibility back and forth between the two tools/skills based on the current state of the environment. This is not a pre-programmed fallback but the planner continuously re-evaluating its world model predictions against observed reality and reassigning control when the two diverge.

The clearest example is when the system first attempts to pick the cube immediately, then realizes it cannot (because of an obstacle), then switches to the stick tool until the cube is free and within range of the gripper. In some cases we saw multiple handoffs when the gripper slipped and pushed the cube away, handing the task back to the stick arm until the system was confident the gripper could complete the lift. This adaptive replanning loop is precisely the kind of behavior that test-time cognition makes possible, and it is one that no amount of additional training data could fully anticipate.

Figure 3a. Dynamic back-and-forth handoffs during repeated recovery.
Figure 3b. Initial failed grasp followed by tool-switch correction.

A Shift From Imitation To Understanding The World, and What Comes Next

This work encapsulates Voaige's approach to AI: rather than trying to solve robotics with ever-larger demonstration datasets, we show that principles from neuroscience can lead to systems that reason through novel situations they were never explicitly trained for.

The core insight is simple: teach robots how the world works, not just what to do. By training skill-conditioned world models independently and composing them at test time through planning, we built a system that exhibits three critical capabilities that emerge naturally from this architecture:

  • Zero-shot compositional behavior: multi-skill trajectories that were never seen in training, stitching together the complementary strengths of different tools to handle complex constraint-satisfaction problems.
  • Intelligent subgoal discovery: inferred intermediate states that make the final goal achievable.
  • Adaptive self-correction: dynamic reassignment between skills when execution drifts, under contact uncertainty, or environmental stochasticity, sometimes multiple times within an episode.

These behaviors emerged from the interaction between learned predictive models and test-time planning, without the full behavior being demonstrated end-to-end. This is what separates understanding from imitation: a robot that has learned how its actions affect the world can generalize to situations it has never encountered, compose skills it has never chained together, and recover from failures it has never experienced.

Critically, this principle is not specific to robotics. What we demonstrated here is that test-time reasoning over learned models unlocks emergent, adaptive intelligence, and that is the same principle we are now bringing to LLMs. Just as our robotic system discovered novel composite behaviors by reasoning at inference time rather than recalling from training, we believe the next breakthrough in language model intelligence lies in building the same kind of efficient, neuroscience-grounded cognition layer: one that reasons, composes, and adapts at test time rather than merely retrieving. This continues our broader mission at Voaige: identify the computational principles that make intelligence possible, and build systems grounded in those principles. We showed in our previous work how neuroscience-inspired architectures could achieve 10x efficiency gains in perception. Here, we have demonstrated that the same neuroscience-first philosophy enables robots to exhibit sophisticated, adaptive behavior in action and manipulation, and the next AI breakthrough will be in bringing that same efficient cognition to language.