Why Test-Time Cognition?

There's an old distinction in cognitive science between two modes of thought, formalized by Yoshua Bengio as System 1 and System 2. System 1 is fast, automatic, pattern-driven: the kind that lets you recognize a face in a crowd or catch a ball mid-flight. System 2 is slow, deliberate, effortful: the kind you use to solve a chess problem or plan a surgical approach.

For most of the history of modern AI, we've been building extraordinarily capable versions of the first kind. Large language models are, at their core, compression machines: they distill the patterns of billions of documents into weights, and at inference time, they recall. They are fast. They are fluent. And for problems they've seen before, they are often astonishing.

But they struggle to think. Not because they lack knowledge, but because they weren't built to search.

A Note on Difficulty

What Is a Hard Problem?

Before tracing that history, it's worth defining what we mean by a hard problem. Difficulty is not absolute. A problem that challenges one model may be trivial for another. What makes a problem hard, in a precise sense, is how out-of-distribution it is for a given model: how far it sits from the patterns encoded in its weights during training. A model that has seen thousands of similar problems will retrieve a good answer quickly. A model that hasn't must reason its way to one. That distinction, between recall and reasoning, is what separates easy problems from hard ones. It is also what makes test-time cognition necessary.

Part I

A History of Thinking at Inference

The idea that intelligence requires search, not just memory, has a long history in AI research. It predates transformers, predates deep learning, predates the perceptron. What's changed is our ability to act on it at scale.

1950s – 1980s

Classical AI as search. Early systems like chess engines were explicitly built around search trees. Intelligence, in this framing, was depth of lookahead: how far you could trace the consequences of an action. The weakness: combinatorial explosion made deep search intractable without powerful heuristics.

2016

AlphaGo. DeepMind's landmark system combined learned value and policy functions with Monte Carlo Tree Search at inference. The model didn't just retrieve a move; it explored a tree of futures, guided by learned intuitions. For the first time, this approach beat the world's best human player at Go.

2022

Chain-of-thought prompting. Researchers at Google showed that prompting LLMs to produce intermediate reasoning steps, before answering, dramatically improved performance on complex tasks. A simple insight with large consequences: giving the model space to "think out loud" is a form of test-time compute.

2023

Process reward models and tree-of-thought. Researchers trained models to score intermediate reasoning steps, giving RL a richer signal than final-answer correctness alone. The search happened during training: RL explored reasoning traces and distilled that exploration into weights. The result was a model that reasoned more reliably at inference, but only because the search had already been done.

2024

OpenAI o1 and DeepSeek R1. The first large-scale demonstration that RL training over extended reasoning traces dramatically improves performance on hard problems. The model doesn't search at inference; it replays patterns of reasoning that RL discovered during training. Powerful within distribution. The scaling axis shifted: not just parameters, but the depth of reasoning encoded into weights.

2026

The field converges on RL-powered reasoning. Every major lab is scaling test-time compute through reinforcement learning: training models to reason longer, explore more traces, and self-verify. RL-powered reasoning is currently the only known cost-effective method for improving reasoning at test time, and the gains are real. But the approach remains fundamentally the same: RL encodes search into weights during training, and the model replays it at inference. Explicit, efficient, structured search at test time remains largely unexplored.

"The field has discovered that spending more compute at test time works.
Voaige is working on when, where, and how."

The frontier problem

Part II

What the Current Approaches Get Wrong

RL-powered reasoning is currently the only known cost-effective path to improved test-time reasoning, and it is a meaningful step forward. Its limitation is structural: the reasoning capability encoded during training is bounded by the distribution of problems explored. At inference, the model has no adaptive mechanism. It applies learned patterns; it does not adapt on the fly.

Generating many reasoning traces and picking the best one (best-of-N sampling) works. But it's expensive, and it doesn't get smarter as problems get harder; it just gets slower. There's no learned sense of where to look, when to look deeper, or how to decompose a problem so that search becomes tractable.

The brain doesn't work this way. When a human expert encounters a hard problem, they don't enumerate possibilities randomly. They chunk the problem hierarchically. They allocate attention where uncertainty is highest. They know, from experience, which branches of reasoning are likely to be dead ends, and they prune early. This is not magic. It is a learned structure for search.

RL gives models impressive reasoning capability, but not this kind of structure. It encodes patterns of good reasoning into weights. It does not teach a model to navigate genuinely novel problems by searching intelligently at inference. That gap is not a critique of RL; it is simply what RL alone cannot provide.

Part III

The Voaige Approach

At Voaige, we believe the next frontier in AI isn't a bigger model; it's a smarter thinker. Our work is organized around a single core thesis, and a set of principled methods for pursuing it.

Core Thesis

Reinforcement learning is memorized search. Hard problems require cognition at test time.

Hard problems, as defined above, are those that sit outside a model's training distribution: problems too novel for recall, too far from what RL explored during training. For these, memorized shortcuts are not enough. RL compresses the results of search into weights, encoding policies that approximate what a search process would have found. This is powerful within distribution, and RL remains a critical component of capable reasoning systems. But genuine novelty breaks the approximation. The model needs to actually reason at test time, and RL alone does not provide that.

The question is not whether to search; it's how to make that cognition principled, efficient, and brain-inspired.

Hierarchical Abstraction

Hard problems require decomposition. We study how models can learn to chunk problems into tractable subproblems, and search within each level of abstraction independently.

Dynamic Compute Allocation

Not all tokens deserve equal effort. We develop methods for models to recognize difficulty and allocate inference compute where it matters, not uniformly across all inputs.

Learned Search Heuristics

Efficient search requires good priors on where to look. We train models to internalize heuristics that guide search: not brute-force sampling, but directed exploration.

Brain-Inspired Principles

The brain solves allocation and abstraction through mechanisms evolution refined over millions of years. We take these seriously as engineering inspiration, not just metaphor.

Why now?

The science has existed for decades, in neuroscience and cognitive systems research. What has changed is the substrate. Large models, trained with RL, now have enough raw reasoning capability to respond to a new class of algorithms: ones that operate at inference, allocate compute adaptively, and search in structured, principled ways. The community has validated that RL-powered reasoning works at test time. But RL-powered reasoning is not the destination; it is currently the only known cost-effective method. The open problem is building something more principled on top of it: a cognitive layer that goes beyond what RL alone can do.

We're building toward AI systems that don't just recall; they reason. Systems that use RL as a foundation and extend it with explicit, efficient search at test time. Not because RL is insufficient, but because the hardest problems demand something more: cognition that is adaptive, structured, and grounded in how intelligent systems actually work.

The next frontier is not a model that knows more. It is a model that thinks: efficiently, adaptively, and on problems it has never seen before.

That is cognition.

✦

Why Test-TimeCognition?