Advances in AI system performance have converged on two primary axes of optimization. The first is parameter-level: architectural improvements, dataset scaling, extended training regimes, and reinforcement learning from human feedback. The second is context-level: prompt engineering, retrieval-augmented generation, tool use, and multi-step planning. Both axes have produced substantial gains.

These two approaches share a common structure. Parameter-level optimization modifies what the model has learned. Context-level optimization modifies what the model sees at inference time. Together, they define the two inputs to any model call: the weights and the prompt. The dominant research agenda is, in essence, a joint optimization over these two variables.

At Voaige, we hold a different conviction: that intelligence does not arise from stored parameters or structured inputs alone, but also from the computational process of inference itself. This is not a marginal extension of existing methods. It is a claim that reasoning, not retrieval, is the substrate of intelligence.

The empirical basis for this comes from biology. Across species, complex nervous systems do not retrieve fixed responses from synaptic connectivity alone. They execute active, adaptive computations: dynamically allocating resources, modulating attention under uncertainty, and revising internal representations in response to novel inputs. Cognitive and systems neuroscience have produced a substantial body of work characterizing these mechanisms, and we hold that those findings constitute a principled set of priors for engineering inference-time computation in large language models.

A New Dimension of the AI Stack

Between the agent layer and the model layer lies a largely uncharacterized computational space: the cognitive architecture of inference. It is not a function of weights, context, or scaffolding. It is the active process by which a frozen model maps inputs to outputs, and the degrees of freedom within that process, spanning sampling dynamics, activation geometry, and attention structure, remain almost entirely unexplored. Our research treats inference-time computation as a tractable scientific domain with measurable levers and principled optimization targets.

Research Constraints

We hold two properties invariant, and explore how cognitive architecture governs reasoning at inference time.

By fixing the two inputs that nearly all existing approaches optimize, the model's weights and the prompt entering the endpoint, we are free to explore a largely uncharacterized dimension of the AI stack: the cognitive architecture of inference. The levers we discover span the entire depth of the LLM stack, from sampling temperature to activation vectors and attention maps, and extend further to multi-model configurations, where the algorithmic space grows substantially larger.

Constraint I

The prompt and context entering the model remain unchanged. No agentic scaffolding, no RAG, no prompt manipulation.

Constraint II

The model weights remain unchanged. No fine-tuning, no pre-training modifications, no post-training interventions.

These invariants are not incidental. By occupying the space between the agentic harness and the backend model without modifying either, the TTC layer defines a structurally distinct position in the AI stack, one that complements rather than replaces existing paradigms.

The Test-Time Cognition Layer

AGENT LAYER Prompts, RAG, tool use, planning TEST-TIME COGNITION Voaige · cognitive architecture of inference MODEL LAYER Weights, architecture, training

The TTC layer sits between the agent and the model, operating without modifying either. It is the site of inference-time computation: the cognitive architecture that governs how reasoning unfolds between input and output.

The Curse of Dimensionality at Inference Time

The inference-time configuration space is far richer than most researchers expect. Sampling strategies, temperature schedules, activation steering, attention manipulation, multi-model interleaving: each is a distinct dimension of the cognitive architecture of inference, with measurable effects on accuracy and cost.

This richness, however, introduces a fundamental challenge. The space of possible inference-time configurations is exponentially large, and grows with each newly identified lever. In multi-model settings, where several models interact within a single inference pipeline, the combinatorial complexity compounds further.

"For training, the scaling laws are well understood: more data, larger models, longer training. For test-time cognition however, the algorithmic space is exponentially larger, and no scaling law yet governs it."

The optimization problem

The core technical challenge is therefore not narrow hyperparameter optimization, but navigation of a high-dimensional configuration space under constraints of accuracy, cost, and generalization. Brute-force exploration of this space is computationally intractable.

For problems of this nature, there is a well-known principle: the only viable approach is to reduce the search space using principled priors. The question then becomes: where do those priors come from?

Neuroscience as a Computational Prior

The empirical basis for this prior comes from evolutionary biology. Over hundreds of millions of years, natural selection produced a class of systems exhibiting the properties most relevant to this problem: high generalization under resource constraints, efficient dynamic allocation of compute, and robust performance on novel inputs. The mammalian brain represents the most well-studied instance of such a system.

Our research lies at the intersection of two fields: observations from systems and cognitive neuroscience about how cognition occurs in the mammalian brain, and the computational properties of our chosen substrate, large language models. We study the brain not to replicate its structures, but to extract the computational principles behind them and implement those principles using the levers available in LLMs.

The distinction is methodological. The goal is not to simulate neural structures or map cortical organization onto transformer architectures. Rather, it is to identify the computational principles underlying those structures, such as hierarchical abstraction, dynamic resource allocation, and selective attention under uncertainty, and to instantiate those principles within the LLM substrate.

These neuroscience-derived priors serve a precise function: they constrain the search over inference configurations. Rather than exhaustively searching an exponentially large space, biologically grounded priors focus the search on regions of the configuration space most likely to yield efficient, generalizable systems.

From Principles to Systems: Dynamic Compute Allocation

This methodology, characterizing inference-time levers, deriving computational priors from neuroscience, and instantiating those priors as algorithms, has produced our first system designed for cognition at inference time. We call it Dynamic Compute Allocation, or DCA.

DCA is built on a simple observation from neuroscience: not all inputs warrant equal computational expenditure. The brain allocates metabolic resources dynamically: more attention and processing for novel or uncertain stimuli, less for familiar patterns. DCA implements this principle in the LLM stack, dynamically adjusting inference-time compute based on the difficulty and structure of the input.

We have applied DCA to several models, both individually and in multi-model configurations. Modified versions of DCA have also been applied to closed-source models, with consistent benefits across the vast majority of settings. The details of these results will appear in subsequent posts.

"Our goal is not to replicate the brain. It is to understand the computational principles behind its structures, and implement them using the levers of our chosen substrate."

The Voaige philosophy

An Expanding Research Frontier

DCA represents the first instantiation of this research program. As additional inference-time levers are identified, the configuration space expands, increasing the importance of principled, prior-guided search methods.

The research program continues along both axes of the intersection: characterizing the neuroscientific mechanisms that inform our priors, and mapping the implementable levers within the LLM substrate. The objective is an inference process that dynamically adapts computational allocation, rather than one that applies uniform resources across inputs of varying complexity.

That is the third axis of intelligence.