Introducing Dynamic Compute Allocation

Test-Time Cognition
for Agentic Tasks

We introduce Dynamic Compute Allocation (DCA) and evaluate it on agentic coding tasks, across pure and composite model configurations, on Terminal-Bench 2.0.

Most improvements to AI systems in recent years have come from two directions: better models (larger, longer-trained, more carefully fine-tuned with reinforcement learning) and better agents (richer prompts, retrieval, tool use, and planning scaffolds). Both approaches modify the two inputs to any model endpoint: what the model knows and what it sees. At Voaige, we study a third dimension: test-time cognition. Holding model weights and prompt context fixed, we ask whether the mechanics of inference itself: how compute is allocated across a generation, how uncertainty is handled, and how hard sub-problems are identified and treated differently from easy ones. We ask whether these mechanics can be made adaptive and principled rather than uniform. Our thesis is that it can, and that test-time cognition methods applied at inference can produce consistent, cost-efficient improvements in model performance without modifying the model or the agent.

Introducing DCA: our first algorithm for test-time cognition

Dynamic Compute Allocation (DCA) is our first published algorithm in this direction. Drawing on systems and computational neuroscience, DCA is inspired by the observation that neural systems do not allocate computation uniformly: they modulate attention, gain, and processing effort based on input uncertainty, surprise, and behavioral salience.

DCA decides when and how to intervene in an agentic trajectory on a given task. At each step, DCA reads signals from the current state of the trajectory to estimate how much additional compute would benefit that step. It then applies a lightweight form of explicit search to determine the final action to take, rather than accepting the model's first-pass output. This search is not exhaustive; it is selective, triggered by DCA's estimate of where additional deliberation has the highest expected return. It requires no changes to model weights or agent configuration.

DCA's effectiveness also relies on the reasoning ability of the underlying models. Not all models have responded equally to the current set of techniques DCA employs, and characterising the conditions under which this occurs is part of ongoing work. What follows are our first results, evaluated on agentic coding tasks across Terminal-Bench 2.0.

Why Voaige Cognition is different
RL approaches Voaige Cognition (DCA)
Requires trainingYes (fine-tuning, RL, or continued pretraining)No (operates on frozen weights)
Domain-specific dataOften (curated datasets per task or domain)No (domain agnostic by design)
Tuning overheadHigh (full retraining cycles)Low (a few parameters per agent-model combination)
Applies to closed-source modelsRarely (requires model access)Yes (works at the inference layer)
The three layers of an agentic system
  Agent Layer  prompts · rag · tool use · planning
★  Test-Time Cognition  DCA v0.1
  Model Layer  weights · architecture · training
Most improvements target the agent layer (prompts, planning, retrieval) or the model layer (architecture, training, RL). Test-time cognition is a third, largely unexplored layer between them.
Benchmark: Terminal-Bench 2.0

Terminal-Bench 2.0 is a suite of agentic coding tasks requiring multi-step reasoning, tool use, and execution in a live terminal environment. We evaluated three agent harnesses (Terminus 2.0.0, Mini-SWE-agent 2.2.1, OpenHands 1.4.0) across seven models, measuring success rate with and without DCA applied. All runs were conducted with no wall clock timeout, to evaluate the native performance of agentic systems without the additional constraint of task time limits. All reported results are averaged over 3 trials.

§1 · Single Model

Inference headroom exists, and DCA finds it

We applied DCA v0.1 to GPT-OSS-120B on Terminus 2.0.0 and GLM-4.7-Flash on OpenHands 1.4.0. The endpoint prompt was not modified. The model weights were not modified. The agent harness was not modified. The only change was the introduction of a deployment-side layer that adaptively allocates compute across the generation. Both configurations show a measurable gain in success rate.

GPT-OSS-120B+DCA v0.1
Baseline vs. DCA v0.1 · success rate % · Terminal-Bench 2.0
Baseline Gain
Fig 1a. GPT-OSS-120B with DCA v0.1 on Terminus 2.0.0: +9.2 pp (52% relative gain).
GLM-4.7-Flash+DCA v0.1
Baseline vs. DCA v0.1 · success rate % · Terminal-Bench 2.0
Baseline Gain
Fig 1b. GLM-4.7-Flash with DCA v0.1 on OpenHands 1.4.0: +6.0 pp. The effect is consistent across model families.
The model already had the capacity. Uniform inference was not using it.

A 52% relative gain on GPT-OSS-120B is a significant result not because of its magnitude alone, but because of what it implies: the model, as deployed, was leaving substantial performance on the table at every inference call. DCA did not teach the model anything new. It found and allocated compute toward the steps where additional deliberation had the highest return, steps that uniform inference treats identically to every other step. The fact that this gain appears across two architecturally distinct model families rules out a model-specific artifact. It points to something more general: default inference, by treating all generation steps as equally deserving of compute, is a systematically suboptimal strategy for hard agentic tasks.

§2 · Multiple Agent Harnesses

The gains are not agent-specific

A single-harness result could be dismissed as an artifact of one agent's scaffolding. To test generality, we evaluated both models across all three agent harnesses: Terminus, OpenHands, and MiniSWE. DCA is tuned per agent-model combination, which is a deliberate choice: the cognitive architecture of inference interacts with how a specific harness structures its trajectory, and a one-size-fits-all strategy would ignore that interaction. The results show consistent gains across harnesses.

GPT-OSS-120B+DCA v0.1
Three agent harnesses · success rate % · Terminal-Bench 2.0
Baseline Gain Regression
Fig 2a. GPT-OSS-120B with DCA v0.1: positive gains across all three agent harnesses.
GLM-4.7-Flash+DCA v0.1
Three agent harnesses · success rate % · Terminal-Bench 2.0
Baseline Gain Regression
Fig 2b. GLM-4.7-Flash with DCA v0.1: gains on two of three agents. One isolated regression; DCA's interventions are selective rather than uniform.
Per agent-model tuning enables fine-grained cognition.

DCA is tuned for each agent-model combination, allowing inference-time cognition to be calibrated to how a specific harness interacts with a specific model. This produces gains across all three harnesses tested. Magnitude varies by combination, and one isolated regression is observed; characterising the conditions under which this occurs is part of ongoing work.

§3 · Composite Models

Multi-model composites exceed individual model performance

DCA interleaves models and orchestrates inferences across them based on the input and the agent trajectory at that instant. We tested a composite of GPT-OSS-120B and GLM-4.7-Flash applied through DCA. The chart below shows individual baselines, individual DCA results, and the composite result for each agent. The composite is compared against the stronger single-model baseline.

GLM-4.7-Flash+ GPT-OSS-120B DCA v0.1 Composite
vs. individual baselines and single-model DCA · success rate % · Terminal-Bench 2.0
Baseline (weaker model) Baseline (stronger model) DCA v0.1 (single model) Gain Regression
Fig 3. The DCA composite exceeds both individual baselines and individual DCA configurations, by up to +14.1 pp over the stronger single-model baseline.
The composite gain is not attributable to either model individually.

Neither model reaches the composite result alone, and neither does DCA applied to each model individually. The additional gain reflects DCA coordinating across models at inference time in ways that neither model achieves individually.

§4 · Closed-Source Models

Composite gains extend to closed-source frontier models

We tested DCA composite configurations on GPT-5, GPT-5-mini, and GPT-5.2. Each GPT-5 series model was augmented with a smaller model in a multi-model composite. The chart below shows baseline and composite success rate for each model across all evaluated agents. The green segment is the gain over baseline.

GPT-5 seriesDCA v0.1 Composite
Baseline vs. composite · success rate % · Terminal-Bench 2.0 · all agents
Baseline Gain Regression
Fig 4. DCA composites across the GPT-5 series. GPT-5: up to +7.9 pp. GPT-5-mini: up to +7.8 pp. GPT-5.2: +3.4 pp over a 59.9% baseline. Gains are present across all three models and persist where RL-trained reasoning is already strong.
DCA gains are additive to RL-encoded reasoning improvements.

GPT-5.2 represents a strong RL-trained baseline at 59.9%, with DCA reaching 63.3%. The composite still yields a measurable gain, consistent with DCA operating on a different axis than training-time improvements: adaptive inference allocation at test time rather than compressed search encoded during training.

§5 · Pareto Front Extension · GPT-5

Across all configurations, DCA delivers more per dollar

Performance gains from cognition are usually achieved at less than 2× the cost, often at a very favourable location in the Pareto frontier. The chart plots all evaluated configurations by median task cost ($) and success rate on Mini-SWE-agent. Labels in brackets (minimal, low, medium, high) denote the reasoning effort level of the model. MiniMax M2.5 serves as a low-cost reference baseline. For closed-source models, DCA operates on a subset of its full capabilities; the gains shown are therefore a lower bound.

GPT-5MiniMax M2.5DCA v0.1 Composite
Median task cost ($) vs. success rate · Mini-SWE-agent · Terminal-Bench 2.0
Baseline (no cognition) Voaige (cognition-augmented) Baseline Pareto frontier
At every cost point on the baseline frontier, DCA finds a higher-accuracy path.

At each cost level, the Voaige configuration achieves higher accuracy than the best available baseline. The DCA composite of MiniMax M2.5 and GPT-5 (medium) reaches 56.9% at $0.26, exceeding the GPT-5 (high) baseline of 51.7% at $0.44.

Fig 5. Baseline configurations (grey) define a cost-to-accuracy frontier (dashed). Voaige DCA configurations (teal) lie above this frontier at every evaluated cost point.
§6 · Accuracy & Cost · GPT-5.2

DCA achieves higher accuracy at lower cost on GPT-5.2

GPT-5.2 evaluated across reasoning effort levels, alongside MiniMax M2.5 and Deepseek v3.2 as additional reference baselines. The three-model DCA composite (MiniMax M2.5 + Deepseek v3.2 + GPT-5.2) reaches 64.8% accuracy at $0.26, above the GPT-5.2 (medium) ceiling of 57.7% at $0.45. Each constituent model evaluated individually falls below this result, as does DCA applied to any single model alone.

GPT-5.2MiniMax M2.5Deepseek v3.2DCA v0.1 Composite
Success rate % (left) · Median task cost $ (right) · Mini-SWE-agent · Terminal-Bench 2.0
Success Rate (%)
Median Task Cost ($)
Baseline DCA composite
The composite result is not reachable by any single model, or by DCA on any model individually.

This is the result that warrants the closest attention. MiniMax M2.5 alone, Deepseek v3.2 alone, and GPT-5.2 alone each fall short of 64.8%. DCA applied to any one of them individually also falls short. The three-model composite reaches a level of accuracy that none of the components achieves on its own, and that DCA cannot reach when operating on a single model. What the composite result reflects is DCA coordinating inference across three models simultaneously, routing computation to whichever model is best positioned for each step of the trajectory. That coordination is the source of the gain, not any single model's capability, and not DCA operating in isolation.

Fig 6. Left: success rate by configuration. Right: median task cost by configuration. The three-model DCA composite (teal) leads on accuracy while undercutting the strongest baseline on cost.
§7 · Accuracy & Cost · Sonnet 4.5, Haiku 4.5 & Opus 4.6

DCA yields consistent accuracy gains across the Claude model family

The result that stands out here is not any individual number but the pattern across all three. Haiku 4.5, Sonnet 4.5, and Opus 4.6 span a wide range of capability and cost within the Claude family. DCA produces a gain on all three. A common assumption is that stronger models leave less room for improvement at inference time, having already compressed more reasoning into their weights during training. The data here does not support that assumption. Opus 4.6, the strongest model in this evaluation, responds to DCA at the same order of magnitude as the others.

Sonnet 4.5 Haiku 4.5 Opus 4.6 + DCA v0.1
Success rate % (left) · Median task cost $ (right) · Terminus 2.0.0 · Terminal-Bench 2.0
Success Rate (%)
Median Task Cost ($)
Baseline DCA v0.1
Stronger models do not exhaust the gains available at inference time.

Haiku 4.5 is the lowest-cost model evaluated. The fact that it responds to DCA suggests that the headroom DCA exploits is not something only weaker models possess. It appears to be a property of how inference works by default across model scales: compute is allocated uniformly, regardless of which steps in the trajectory actually need more deliberation. DCA corrects for that. The Opus 4.6 result makes the same point from the other end of the capability spectrum: a frontier-class model, already expensive and highly capable, still yields a measurable gain when inference is made adaptive. Neither scale nor capability closes the gap that uniform inference leaves open.

Fig 7. Left: success rate for baseline and DCA configurations across Sonnet 4.5, Haiku 4.5, and Opus 4.6 on Terminus 2.0.0. Right: corresponding median task cost. DCA bars in teal.
Discussion

The Unexplored Axis

The results above are consistent with test-time cognition being a meaningful and largely independent axis of improvement, separable from model quality and agent design. The gains from DCA are present across model families, agent harnesses, and capability levels, including frontier models where RL-trained reasoning is already strong.

This is worth pausing on. The field has converged on a shared assumption: that the primary levers for improving model performance are training-time (more data, better architectures, longer RL runs). What a model does at inference is treated as a fixed consequence of that training. These results suggest that assumption is incomplete. Holding weights and context constant, varying only how cognition is conducted at test time, produces consistent and measurable improvements on hard agentic tasks. Test-time cognition is not a downstream consequence of training. It is a degree of freedom in its own right.

These findings do not diminish the contribution of training-time improvements. RL-trained reasoning is a strong foundation, and DCA builds on top of it. But the results suggest that how a model thinks at test time (how compute is allocated, how uncertainty is handled, how multiple models are coordinated) represents headroom that training alone does not capture. As base model capability increases, this space becomes more, not less, worth exploring. Stronger models have more structured reasoning for test-time cognition to operate over.

DCA v0.1 is an early instantiation of this research direction. The space of test-time cognition methods is larger than what any single algorithm explores. What these results establish is that the space is worth exploring, and that principled methods operating within it can produce gains that are real, consistent, and additive to everything the field has built so far.

We are at the beginning of understanding what is possible when cognition at inference time is treated as a first-class research problem.

Further reading: Why Test-Time Cognition?  ·  The Third Axis of Intelligence