Zero-shot Emergent Robotic Behaviors using World Models

Principles from Neuroscience Give Robots Behaviors They Were Never Trained For

The Voaige Team

Emergent Behavior (Unseen in Training)

Composite Skill Stitching

Subgoal Discovery

Adaptive Self-Correction

From Neuroscience Principles to Robotic Intelligence

At Voaige, our mission is to understand the computational principles that make intelligence possible, and then build systems that exhibit intelligent behavior grounded in those principles. In our previous post, we showed what this looks like in practice: a single observation about the brain's locally random wiring led to architectures that achieved 10x efficiency gains in deep learning. That work focused on perception, specifically how a network can learn to see robustly with far less compute.

This post is about what comes next. Perception alone is not intelligence. An agent that can recognize objects but cannot reason about how to act on them, adapt when things go wrong, or compose familiar skills into novel strategies is just pattern matching. The next chapter for us moves from perception into action, and the central question becomes: how do you build robotic systems that can handle the open-ended complexity of the physical world without needing to be shown every possible situation in advance?

The Bottleneck in Robotic Manipulation

Robotic manipulation spans a combinatorially large space of embodiments, tools, workspaces, and constraints. The dominant approach today, collecting teleoperation demonstrations and training a policy to imitate them, works well when the training distribution densely covers the situations the robot will encounter, but degrades sharply at the edges, precisely where robotics gets interesting.

The problem is compositional. A dual-arm robot might need to push an object clear of an obstacle with one tool, reposition it into the reachable zone of another arm, and then grasp it, a sequence that was never demonstrated end-to-end. A teleoperation pipeline would need to have captured not just each individual skill, but the specific multi-skill transitions that resolve the long-tail failure modes or mistakes of each skill's control policy, across many environment configurations. The number of required demonstrations scales roughly with the product of skills, their possible handoffs, and their error-recovery strategies and trajectories, making long-tail coverage prohibitively expensive.

This is not just an engineering inconvenience. It reflects a fundamental limitation: imitation learning encodes solutions rather than understanding. The robot learns what to do in situations it has seen, but has no mechanism to reason about what would happen under alternative actions in situations it has not seen.

What Biology Does Differently

Biological agents do not solve this problem by accumulating demonstrations. A child learning to use tools does not need to observe every possible obstacle configuration paired with the correct recovery strategy. Instead, they build internal models of how the world works, how objects move when pushed, what happens when something is blocked, which actions are available given the current state, and use those models to plan, compose, and recover on the fly.

This is the principle we wanted to translate. Not the neural implementation details, but the computational strategy: learn how the world responds to your actions, then reason through that model at test time to solve new problems. Rather than expanding the demonstration dataset to cover every composite trajectory, corner case, and error-correction behavior, learn reusable skill-level models that can be recombined and reasoned through to handle novel situations, satisfy new constraints, and resolve unexpected failures.

Approach

We adopted a world model-first methodology evaluated in a deliberately constrained, small-scale robotic setting designed to stress-test compositional generalization.

Environment and task design

Embodiment: a dual-arm system with heterogeneous end-effectors: one arm equipped with a stick-like tool (for non-prehensile manipulation) and one with a gripper (for prehensile lifting).
Workspace constraints: tabletop scene with immovable obstacles and non-overlapping / partial reachability, such that each arm can act only on a subset of the workspace.
Goal: lift a cube into free space (air), under conditions where direct grasping may be infeasible from the initial configuration due to collision with obstacles/being out of reach for the gripper arm.

System architecture

Skill policies: one policy per skill/tool, optimized for closed-loop execution of that skill.
Skill-conditioned world models: one predictive dynamics model per skill, trained to forecast the environment's evolution under that skill's action distribution.
Isolated training: policies and world models were trained independently per skill, purely in simulation, to capture how each tool interacts with the environment (contacts, pushes, slips, obstacle interactions), without exposing the system to multi-skill composite demonstrations.
Transferable state representation: world model inputs and outputs were designed to avoid leveraging artificial simulation-specific states. This allows the system to transfer more seamlessly to unseen objects, obstacles, and beyond simulation into the real world.
Planning stack: a higher-level planner composes skills by rolling out candidate sequences through the corresponding skill world models, selecting skill activations that advance the system toward the goal under the current constraints.

The intended outcome is a system where generalization is achieved by test-time reasoning and composition over learned predictive models, rather than by expanding the demonstration dataset to cover all composite trajectories, corner cases, and error-correction behaviors.

Why this is hard (and why a world-model planner helps)

We explicitly targeted solutions that are non-obvious: success often requires intermediate interactions (e.g., freeing an object from obstacles, moving it into a reachable region, then grasping) that are not trivially implied by the goal specification.

Key challenges:

Combinatorial coverage gap for teleoperation. A demonstration-based system would need to observe not just obstacle-avoidant behaviors, but also the specific multi-skill transitions that resolve each failure mode (stuck states, contact misalignments, slips) across many environment configurations. The number of required demonstrations scales roughly with the number of skills and their possible handoffs, making long-tail coverage prohibitive for a demonstration-based approach.
Constraint-driven compositionality. Because each arm has limited reach and different affordances, the optimal strategy is often a structured composition: use the stick for non-prehensile repositioning and clearance, then hand off to the gripper for lifting. The crucial point is that these compositions must be inferred from state and constraints, not learnt from a fixed library of end-to-end demonstrations that explicitly includes those compositions.
Test-time reasoning over counterfactual futures. The planner must evaluate what would happen under alternative skill sequences, such as "attempt grasp now" vs "push to clear obstacles" vs "push into reachable region", and select a plan whose intermediate predicted states make the final goal feasible.
Self-correction via replanning under execution error. Realistic manipulation induces stochasticity (contact uncertainty, slip, inadvertent pushes). A successful system must detect when execution diverges from the plan and reassign control to the appropriate skill (for example, return to the stick when the object is displaced), repeatedly if necessary, until the goal becomes reliably achievable. Such error-correcting behavior becomes prohibitively expensive to cover using demonstration data.

Overall, our emphasis was not merely on the complexity of the task, but on demonstrating that skill-specific world models plus planning enable one-shot, constraint-aware composition and self-correction at test time, without needing the system to learn the full behavior from end-to-end demonstrations.

Emergent Behaviors

We consistently find that our world model-based system exhibits diverse behaviors that were not trained for, smoothly stitching skills together to achieve goals that are not directly achievable or that require intelligent, constraint-aware subgoal discovery. We also observed dynamic error recovery where the system hands responsibility between skills, sometimes multiple times within an episode.

Furthermore, we find it robust across many environments, a wide range of obstacle complexities, and across different robot embodiments.

Navigating diverse and moving obstacles

Extracting the cube from cluttered spaces using the stick prior to picking

Navigating diverse and moving obstacles; a second randomization

Maze-like navigation using the stick

(1) Unseen Composite Trajectories

To successfully perform this task, our system was able to construct composite skill trajectories dynamically at test time: first use the stick to move the cube away from obstacles, then lift the cube using the gripper.

Figure 1a. World model-grounded planning composes unseen skill-switching strategies under constraints.

Figure 1b. Composite skill switching on a second embodiment.

It is important to remember that the stick arm and gripper arm (each with their own skill policies and skill world models) were trained fully separately: at no point in training did the system see this full, compositional trajectory. Based on the state of the environment, the system successfully inferred that directly lifting the cube is not an option, and planned a composite trajectory using the complementary strengths of the two arms.

This approach contrasts sharply with teleoperation-based systems, which would have to cover such composite trajectories within the training data. As the set of individual skills becomes larger, the harder it is to augment the training set with useful composite trajectories, an issue that world model-based systems are much less susceptible to.

(2) Subgoal Discovery

The system was able to effectively infer subgoals that must be achieved before the final objective (lifting the cube) became feasible. Our planner infers which skills should activate at each step, and the corresponding world model predicts the resulting state. By visualizing those predictions, we can see the inferred subgoals explicitly.

Figure 2a. World model predictions reveal intermediate subgoals inferred from state and constraints.

Figure 2b. Subgoal discovery in embodiment 2.

(3) Self-Correction and Replanning

Finally, we observed that the system dynamically hands responsibility back and forth between the two tools/skills, based on the current state of the environment.

The clearest example is when the system first attempts to pick the cube immediately, then realizes it cannot (because of an obstacle), then switches to the stick tool until the cube is free and within range of the gripper. In some cases we saw multiple handoffs if the gripper slipped and pushed the cube away, handing the task back to the stick arm until the system was confident the gripper could complete the lift.

Figure 3a. Dynamic back-and-forth handoffs during repeated recovery.

Figure 3b. Initial failed grasp followed by tool-switch correction.

A Shift From Imitation To Understanding The World

This work encapsulates Voaige's approach to AI: rather than trying to solve robotics with ever-larger demonstration datasets, we show that principles from neuroscience can lead to systems that reason through novel situations they were never explicitly trained for.

The core insight is simple: teach robots how the world works, not just what to do. By training skill-conditioned world models independently and composing them at test time through planning, we built a system that exhibits three critical capabilities that emerge naturally from this architecture:

Zero-shot compositional behavior: multi-skill trajectories that were never seen in training, stitching together the complementary strengths of different tools to handle complex constraint-satisfaction problems.
Intelligent subgoal discovery: inferred intermediate states that make the final goal achievable.
Adaptive self-correction: dynamic reassignment between skills when execution drifts, under contact uncertainty, or environmental stochasticity; sometimes multiple times within an episode.

These behaviors emerged from the interaction between learned predictive models and test-time planning, without the full behavior being demonstrated end-to-end. This is what separates understanding from imitation: a robot that has learned how its actions affect the world can generalize to situations it has never encountered, compose skills it has never chained together, and recover from failures it has never experienced.

This continues our broader mission at Voaige: identify the computational principles that make intelligence possible, and build systems grounded in those principles. We showed in our previous work how neuroscience-inspired architectures could achieve 10x efficiency gains in perception. Here, we've demonstrated that the same neuroscience-first philosophy enables robots to exhibit sophisticated, adaptive behavior in action and manipulation.

Back to Research Blogs