How Test-Time Reasoning for Robotic Intelligence Unlocks Behaviors That Were Never Trained For
At Voaige, our mission is to understand the computational principles that make intelligence possible, and then build systems that exhibit intelligent behavior grounded in those principles. In our previous post, we showed what this looks like in practice: a single observation about the brain's locally random wiring led to architectures that achieved 10x efficiency gains in deep learning. That work focused on perception, specifically how a network can learn to see robustly with far less compute.
This post is about what comes next. An agent that can recognize objects but cannot reason about how to act on them, adapt when things go wrong, or compose familiar skills into novel strategies is just pattern matching, and perception alone does not solve that. The next chapter for us moves from perception into action, and the central question becomes: how do you build robotic systems that can handle the open-ended complexity of the physical world without needing to be shown every possible situation in advance?
As we will show, the answer points directly to a deeper principle: intelligence emerges at test time. This work on robotic intelligence is, at its core, a proof of concept for test-time reasoning, and the same principle is now at the heart of what we are building for LLMs.
Robotic manipulation spans a combinatorially large space of embodiments, tools, workspaces, and constraints. The dominant approach today, collecting teleoperation demonstrations and training a policy to imitate them, works well when the training distribution densely covers the situations the robot will encounter, but degrades sharply at the edges, precisely where robotics gets interesting.
The problem is compositional. A dual-arm robot might need to push an object clear of an obstacle with one tool, reposition it into the reachable zone of another arm, and then grasp it, a sequence that was never demonstrated end-to-end. A teleoperation pipeline would need to have captured not just each individual skill, but the specific multi-skill transitions that resolve the long-tail failure modes or mistakes of each skill's control policy, across many environment configurations. The number of required demonstrations scales roughly with the product of skills, their possible handoffs, and their error-recovery strategies and trajectories, making long-tail coverage prohibitively expensive.
This is not just an engineering inconvenience but a fundamental limitation: imitation learning encodes solutions rather than understanding. The robot learns what to do in situations it has seen, but has no mechanism to reason about what would happen under alternative actions in situations it has not seen. This is the same limitation that naive next-token prediction imposes on LLMs, and the same reason that test-time reasoning is the path forward in both domains.
Biological agents do not solve this problem by accumulating demonstrations. A child learning to use tools does not need to observe every possible obstacle configuration paired with the correct recovery strategy. Instead, they build internal models of how the world works, how objects move when pushed, what happens when something is blocked, which actions are available given the current state, and use those models to plan, compose, and recover on the fly.
Brains do not retrain for each new problem. They dynamically allocate compute, adjust, redirect, and coordinate at test time, which is the same principle we believe is the missing piece for AI systems broadly. This is the principle we wanted to translate: not the neural implementation details, but the computational strategy of learning how the world responds to your actions, then reasoning through that model at test time to solve new problems. Rather than expanding the demonstration dataset to cover every composite trajectory, corner case, and error-correction behavior, we learn reusable skill-level models that can be recombined and reasoned through to handle novel situations, satisfy new constraints, and resolve unexpected failures.
We adopted a world model-first methodology evaluated in a deliberately constrained, small-scale robotic setting designed to stress-test compositional generalization. The core idea is that rather than training a robot on what to do, we train it on how the world responds to its actions and then let it reason at test time, which is test-time cognition applied to physical intelligence.
The intended outcome is a system where generalization is achieved by test-time reasoning and composition over learned predictive models, rather than by expanding the demonstration dataset to cover all composite trajectories, corner cases, and error-correction behaviors. The planner is not recalling a memorized solution; it is thinking its way to one.
We explicitly targeted solutions that are non-obvious: success often requires intermediate interactions (e.g., freeing an object from obstacles, moving it into a reachable region, then grasping) that are not trivially implied by the goal specification.
Key challenges:
Overall, our emphasis was not merely on the complexity of the task, but on demonstrating that skill-specific world models plus planning enable one-shot, constraint-aware composition and self-correction at test time, without needing the system to learn the full behavior from end-to-end demonstrations.
The central result of this work is not a task solved but a class of behaviors that emerge. When a system is equipped with predictive models and the ability to reason through them at test time, it exhibits capabilities that were never demonstrated, never explicitly trained, and never anticipated by the training pipeline. Rather than retrieving a solution, the model constructs one, and that distinction is the hallmark of genuine test-time intelligence.
We consistently find that our system smoothly stitches skills together to achieve goals that are not directly achievable or that require intelligent, constraint-aware subgoal discovery, with dynamic error recovery achieved by handing responsibility between skills, sometimes multiple times within an episode. We also find it robust across many different environments, a wide range of obstacle complexities including moving obstacles, and across different robot embodiments.
To successfully perform this task, our system was able to construct composite skill trajectories dynamically at test time: first use the stick to move the cube away from obstacles, then lift the cube using the gripper. No version of this full sequence existed in training; it was reasoned into existence at inference time.
It is important to remember that the stick arm and gripper arm (each with their own skill policies and skill world models) were trained fully separately: at no point in training did the system see this full, compositional trajectory. Based on the state of the environment, the system successfully inferred that directly lifting the cube is not an option, and planned a composite trajectory using the complementary strengths of the two arms. The planner did not find this trajectory in memory; it derived it through active reasoning over its world models.
This approach contrasts sharply with teleoperation-based systems, which would have to cover such composite trajectories within the training data. As the set of individual skills becomes larger, the harder it is to augment the training set with useful composite trajectories, an issue that test-time reasoning systems are much less susceptible to.
The system was able to effectively infer subgoals that must be achieved before the final objective (lifting the cube) became feasible. Our planner infers which skills should activate at each step, and the corresponding world model predicts the resulting state. By visualizing those predictions, we can see the inferred subgoals explicitly, providing a direct window into the system's test-time reasoning process.
Finally, we observed that the system dynamically hands responsibility back and forth between the two tools/skills based on the current state of the environment. This is not a pre-programmed fallback but the planner continuously re-evaluating its world model predictions against observed reality and reassigning control when the two diverge.
The clearest example is when the system first attempts to pick the cube immediately, then realizes it cannot (because of an obstacle), then switches to the stick tool until the cube is free and within range of the gripper. In some cases we saw multiple handoffs when the gripper slipped and pushed the cube away, handing the task back to the stick arm until the system was confident the gripper could complete the lift. This adaptive replanning loop is precisely the kind of behavior that test-time cognition makes possible, and it is one that no amount of additional training data could fully anticipate.
This work encapsulates Voaige's approach to AI: rather than trying to solve robotics with ever-larger demonstration datasets, we show that principles from neuroscience can lead to systems that reason through novel situations they were never explicitly trained for.
The core insight is simple: teach robots how the world works, not just what to do. By training skill-conditioned world models independently and composing them at test time through planning, we built a system that exhibits three critical capabilities that emerge naturally from this architecture:
These behaviors emerged from the interaction between learned predictive models and test-time planning, without the full behavior being demonstrated end-to-end. This is what separates understanding from imitation: a robot that has learned how its actions affect the world can generalize to situations it has never encountered, compose skills it has never chained together, and recover from failures it has never experienced.
Critically, this principle is not specific to robotics. What we demonstrated here is that test-time reasoning over learned models unlocks emergent, adaptive intelligence, and that is the same principle we are now bringing to LLMs. Just as our robotic system discovered novel composite behaviors by reasoning at inference time rather than recalling from training, we believe the next breakthrough in language model intelligence lies in building the same kind of efficient, neuroscience-grounded cognition layer: one that reasons, composes, and adapts at test time rather than merely retrieving. This continues our broader mission at Voaige: identify the computational principles that make intelligence possible, and build systems grounded in those principles. We showed in our previous work how neuroscience-inspired architectures could achieve 10x efficiency gains in perception. Here, we have demonstrated that the same neuroscience-first philosophy enables robots to exhibit sophisticated, adaptive behavior in action and manipulation, and the next AI breakthrough will be in bringing that same efficient cognition to language.