Nearly all improvements to the AI stack in recent years have come from two directions. The first is improving the model: better architectures, more data, longer training, reinforcement learning from human feedback, and the scaling laws that govern all of it. The second is improving the agent: prompt engineering, retrieval-augmented generation, tool use, multi-step planning harnesses. These two dimensions have driven extraordinary progress.

But notice what they share. Improving the model changes the weights. Improving the agent changes the context: the prompt, the retrieved documents, the structure of the input. Together, they define the two inputs to any model endpoint: what the model knows (weights) and what it sees (context). Virtually the entire field is optimizing one or both of these.

At Voaige, we asked a different question: what if you hold both fixed?

Two Constraints That Open a New Space

Our research begins with a deliberate pair of constraints. We do not modify the weights of a model. And we do not modify the prompt or context entering the model endpoint. This may sound limiting, but it is precisely what opens the space.

Research Constraints

We hold two properties invariant, and explore everything else.

By fixing the two inputs that nearly all existing approaches optimize, the model's weights and the prompt entering the endpoint, we are free to explore a relatively underexplored dimension of the AI stack: the mechanics of inference itself. The levers we discover span the entire depth of the LLM stack, from sampling temperature to activation vectors and attention maps.

Constraint I

The prompt and context entering the model remain unchanged. No agentic scaffolding, no RAG, no prompt manipulation.

Constraint II

The model weights remain unchanged. No fine-tuning, no pre-training modifications, no post-training interventions.

These constraints are not arbitrary. They define a boundary that separates our work from the two dominant paradigms of AI improvement. And critically, they give our approach an architectural advantage: because we do not modify the model or the agent, a test-time cognition layer can sit between the two, between the agentic harness and the backend model, with minimal integration. It is a new layer in the stack, not a replacement for existing ones.

AGENT LAYER Prompts, RAG, tool use, planning TEST-TIME COGNITION Voaige · inference mechanics MODEL LAYER Weights, architecture, training

The Combinatorial Challenge

Once you accept these constraints and begin looking at the levers available within inference itself, a surprising landscape opens up. There are far more degrees of freedom than most researchers expect. Sampling strategies, temperature schedules, activation steering, attention manipulation, multi-model orchestration. Each is a dimension of configuration, and each affects accuracy and cost in non-trivial ways.

But this richness creates its own problem. The space of all possible configurations for an LLM at inference is exponentially large. Each time a new lever or dimension is discovered, the space grows again. When multiple models are intertwined, when the system involves not one but several models interacting, the combinatorial explosion is compounded further.

"For training, the scaling laws are well understood: more data, larger models, longer training. For test-time cognition however, the algorithmic space is exponentially larger, and no scaling law yet governs it."

The optimization problem

This is the core technical challenge of our research. We are not solving a narrow optimization over a few hyperparameters. We are navigating a very high-dimensional configuration space, searching for systems that are optimal across accuracy, cost, and generalization. The challenge is not that the space is unexplored; it is that brute-force exploration is intractable.

For problems of this nature, there is a well-known principle: the only viable approach is to reduce the search space using useful heuristics and constraints. The question then becomes: where do you source those heuristics?

Neuroscience as Heuristic

The answer, for us, comes from biology. We observe that evolution has already solved a version of this problem. Over hundreds of millions of years, it produced an entire class of systems that exhibit exactly the properties we seek: high generalization under resource constraints, efficient allocation of compute, and robust performance on novel inputs. The mammalian brain, and the human brain in particular, is arguably the most complex and efficient such system ever studied.

Our research lies at the intersection of two fields: observations from systems and cognitive neuroscience about how cognition occurs in the mammalian brain, and the low-level mechanics of our chosen substrate, large language models. We study the brain not to replicate its structures, but to extract the computational principles behind them and implement those principles using the levers available in LLMs.

This distinction is critical. We are not building a neural simulation. We are not mapping cortical columns onto transformer layers. We are identifying principles like hierarchical abstraction, dynamic resource allocation, and selective attention under uncertainty, and asking how those principles can be realized in a different substrate with different mechanics. The goal is to understand why certain computational strategies evolved, and to implement the underlying logic using the tools we have.

These neuroscience-derived heuristics serve a precise function: they constrain the search over inference configurations. Instead of blindly exploring an exponentially large space, we use biologically grounded principles to focus our search on regions that are likely to yield efficient, generalizable systems.

From Principles to Systems: Dynamic Compute Allocation

This methodology, discover levers, draw heuristics from neuroscience, build algorithms, has produced our first system designed for cognition at inference time. We call it Dynamic Compute Allocation, or DCA.

DCA is built on a simple observation from neuroscience: not all inputs deserve equal computational effort. The brain allocates metabolic resources dynamically: more attention and processing for novel or uncertain stimuli, less for familiar patterns. DCA implements this principle in the LLM stack, dynamically adjusting inference-time compute based on the difficulty and structure of the input.

We have applied DCA to several models, both individually and in multi-model configurations. Modified versions of DCA have also been applied to closed-source models, with consistent benefits across the vast majority of settings. The details of these results will appear in subsequent posts.

"Our goal is not to replicate the brain. It is to understand the computational principles behind its structures, and implement them using the levers of our chosen substrate."

The Voaige philosophy

An Ongoing Journey

DCA is the first output of our research program, not the last. The space of inference-time levers continues to grow as we and others discover new dimensions of the LLM stack that affect performance. Each discovery expands the configuration space, and each expansion makes principled, heuristic-guided search more important, not less.

We are continuing to deepen our understanding of both sides of the intersection: the neuroscience that provides our heuristics, and the substrate mechanics that define what is implementable. The next frontier is not just a smarter model. It is a smarter process of inference, one that adapts, allocates, and reasons in ways that neither the model nor the agent alone can provide.

That is the third axis of intelligence.