We introduce Dynamic Compute Allocation (DCA) and evaluate it on agentic coding tasks, across pure and composite model configurations, on Terminal-Bench 2.0.
Most improvements to AI systems in recent years have come from two directions: better models (larger, longer-trained, more carefully fine-tuned with reinforcement learning) and better agents (richer prompts, retrieval, tool use, and planning scaffolds). Both approaches modify the two inputs to any model endpoint: what the model knows and what it sees. At Voaige, we study a third dimension: test-time cognition. Holding model weights and prompt context fixed, we ask whether the mechanics of inference itself: how compute is allocated across a generation, how uncertainty is handled, and how hard sub-problems are identified and treated differently from easy ones. We ask whether these mechanics can be made adaptive and principled rather than uniform. Our thesis is that it can, and that test-time cognition methods applied at inference can produce consistent, cost-efficient improvements in model performance without modifying the model or the agent.
Dynamic Compute Allocation (DCA) is our first published algorithm in this direction. Drawing on systems and computational neuroscience, DCA is inspired by the observation that neural systems do not allocate computation uniformly: they modulate attention, gain, and processing effort based on input uncertainty, surprise, and behavioral salience.
DCA decides when and how to intervene in an agentic trajectory on a given task. At each step, DCA reads signals from the current state of the trajectory to estimate how much additional compute would benefit that step. It then applies a lightweight form of explicit search to determine the final action to take, rather than accepting the model's first-pass output. This search is not exhaustive; it is selective, triggered by DCA's estimate of where additional deliberation has the highest expected return. It requires no changes to model weights or agent configuration.
DCA's effectiveness also relies on the reasoning ability of the underlying models. Not all models have responded equally to the current set of techniques DCA employs, and characterising the conditions under which this occurs is part of ongoing work. What follows are our first results, evaluated on agentic coding tasks across Terminal-Bench 2.0.
| RL approaches | Voaige Cognition (DCA) | |
|---|---|---|
| Requires training | Yes — fine-tuning, RL, or continued pretraining | No — operates on frozen weights |
| Domain-specific data | Often — curated datasets per task or domain | No — domain agnostic by design |
| Tuning overhead | High — full retraining cycles | Low — a few parameters per agent-model combination |
| Applies to closed-source models | Rarely — requires model access | Yes — works at the inference layer |
Terminal-Bench 2.0 is a suite of agentic coding tasks requiring multi-step reasoning, tool use, and execution in a live terminal environment. We evaluated three agent harnesses (Terminus 2.0.0, Mini-SWE-agent 2.2.1, OpenHands 1.4.0) across seven models, measuring success rate with and without DCA applied. All runs were conducted with no wall clock timeout, to evaluate the native performance of agentic systems without the additional constraint of task time limits. All reported results are averaged over 3 trials.
We applied DCA v0.1 to GPT-OSS-120B and GLM-4.7-Flash on the Terminus 2.0.0 and OpenHands 1.4.0 agents respectively. Each card shows baseline success rate alongside the DCA result. The green segment is the gain over baseline.
Both configurations hold weights and agent harness fixed. The improvement in success rate reflects DCA's adaptive compute allocation at inference time: allocating more compute to harder sub-problems rather than distributing it uniformly across the generation.
DCA is tuned for each agent-model combination. This is intentional: fine-tuned cognition allows DCA to account for how a specific agent harness interacts with a specific model, rather than applying a one-size-fits-all inference strategy. We tested configurations across all three agents (Terminus, OpenHands, and MiniSWE) for each model. As Figs 1 and 2 show, this per-combination tuning produces gains across harnesses, though magnitude varies and one regression is observed.
DCA is tuned for each agent-model combination, allowing inference-time cognition to be calibrated to how a specific harness interacts with a specific model. This produces gains across all three harnesses tested. Magnitude varies by combination, and one isolated regression is observed; characterising the conditions under which this occurs is part of ongoing work.
DCA interleaves models and orchestrates inferences across them based on the input and the agent trajectory at that instant. We tested a composite of GPT-OSS-120B and GLM-4.7-Flash applied through DCA. The chart below shows individual baselines, individual DCA results, and the composite result for each agent. The composite is compared against the stronger single-model baseline.
Neither model reaches the composite result alone, and neither does DCA applied to each model individually. The additional gain reflects DCA coordinating across models at inference time in ways that neither model achieves individually.
We tested DCA composite configurations on GPT-5, GPT-5-mini, and GPT-5.2. Each GPT-5 series model was augmented with a smaller model in a multi-model composite. The chart below shows baseline and composite success rate for each model across all evaluated agents. The green segment is the gain over baseline.
GPT-5.2 represents a strong RL-trained baseline at 59.9%, with DCA reaching 63.3%. The composite still yields a measurable gain, consistent with DCA operating on a different axis than training-time improvements: adaptive inference allocation at test time rather than compressed search encoded during training.
Performance gains from cognition are usually achieved at less than 2× the cost, often at a very favourable location in the Pareto frontier. The chart below plots all evaluated configurations by median task cost ($) and success rate on Mini-SWE-agent. Labels in brackets (minimal, low, medium, high) denote the reasoning effort level of the GPT-5 model in each configuration. Note that for closed-source models, DCA operates on a subset of its full capabilities: limited access to inference mechanisms constrains which interventions can be applied. The gains shown here are therefore a lower bound on what DCA can achieve on these models.
At each cost level, the Voaige configuration achieves higher accuracy than the best available baseline. The gains observed in §1–§4 are not purchased at a proportionate cost premium; the performance improvement per dollar of additional inference cost is favourable across all tested configurations.
The results above are consistent with test-time cognition being a meaningful and largely independent axis of improvement, separable from model quality and agent design. The gains from DCA are present across model families, agent harnesses, and capability levels, including frontier models where RL-trained reasoning is already strong.
This is worth pausing on. The field has converged on a shared assumption: that the primary levers for improving model performance are training-time (more data, better architectures, longer RL runs). What a model does at inference is treated as a fixed consequence of that training. These results suggest that assumption is incomplete. Holding weights and context constant, varying only how cognition is conducted at test time, produces consistent and measurable improvements on hard agentic tasks. Test-time cognition is not a downstream consequence of training. It is a degree of freedom in its own right.
These findings do not diminish the contribution of training-time improvements. RL-trained reasoning is a strong foundation, and DCA builds on top of it. But the results suggest that how a model thinks at test time (how compute is allocated, how uncertainty is handled, how multiple models are coordinated) represents headroom that training alone does not capture. As base model capability increases, this space becomes more, not less, worth exploring. Stronger models have more structured reasoning for test-time cognition to operate over.
DCA v0.1 is an early instantiation of this research direction. The space of test-time cognition methods is larger than what any single algorithm explores. What these results establish is that the space is worth exploring, and that principled methods operating within it can produce gains that are real, consistent, and additive to everything the field has built so far.
We are at the beginning of understanding what is possible when cognition at inference time is treated as a first-class research problem.
Further reading: Why Test-Time Cognition? · The Third Axis of Intelligence