Optimization-Driven AI Emerges as New Path to General-Purpose Models

Home

News

February 22, 2026

HarryRoberts

# research # LLMs

Researchers from the University of Illinois Urbana-Champaign and the University of Virginia have created a new model architecture that could pave the way for more resilient AI systems with enhanced reasoning power.

Named the energy-based transformer (EBT), this architecture naturally leverages inference-time scaling to address intricate challenges. For businesses, this means cost-efficient AI applications that can adapt to new scenarios without requiring specialized, fine-tuned models.

The Challenge of System 2 Thinking

In psychology, human cognition is typically split into two modes: System 1, which is quick and instinctive, and System 2, which is slower, more deliberate, and analytical. Today's large language models (LLMs) are excellent at System 1 tasks, but the AI field is now concentrating on developing System 2 thinking to handle more complex reasoning problems.

Reasoning models employ various inference-time scaling methods to boost their performance on tough questions. A common technique is reinforcement learning (RL), used in models such as DeepSeek-R1 and OpenAI’s “o-series,” where the AI earns rewards for generating reasoning tokens until it arrives at the correct solution. Another strategy, often referred to as best-of-n, involves creating several possible answers and using a verification system to pick the best one.

However, these approaches come with notable limitations. They are usually confined to a narrow set of easily verifiable problems, like mathematics and coding, and may reduce performance on other tasks such as creative writing. Moreover, recent studies indicate that RL-based methods might not be teaching models new reasoning skills, but simply encouraging them to reuse successful reasoning patterns they already know. This restricts their capacity to solve problems that demand genuine exploration beyond their training scope.

Energy-Based Models (EBM)

This new architecture takes a different route, building on a class of models known as energy-based models (EBMs). The central concept is straightforward: Rather than generating an answer directly, the model learns an “energy function” that serves as a verifier. This function takes an input (such as a prompt) and a candidate prediction, then assigns it a value, or “energy.” A low energy score suggests high compatibility, meaning the prediction fits the input well, while a high energy score indicates a poor match.

Applying this to AI reasoning, the researchers propose in a paper that developers should consider “thinking as an optimization procedure with respect to a learned verifier, which evaluates the compatibility (unnormalized probability) between an input and candidate prediction.” The process starts with a random prediction, which is then gradually refined by minimizing its energy score and exploring possible solutions until it converges on a highly compatible answer. This method is grounded in the idea that verifying a solution is often far simpler than generating one from scratch.

This “verifier-centric” design tackles three major challenges in AI reasoning. First, it enables dynamic compute allocation, letting models “think” longer on difficult problems and less on easier ones. Second, EBMs naturally accommodate the uncertainty of real-world issues where no single clear answer exists. Third, they function as their own verifiers, removing the need for external models.

Unlike other systems that employ separate generators and verifiers, EBMs integrate both into a single, unified model. A major benefit of this setup is improved generalization. Since verifying a solution on new, out-of-distribution (OOD) data is typically easier than generating a correct answer, EBMs can better manage unfamiliar situations.

Despite their potential, EBMs have historically faced scalability issues. To address this, the researchers introduced EBTs, which are specialized transformer models designed for this framework. EBTs are trained to first check the compatibility between a context and a prediction, then refine predictions until they identify the lowest-energy (most compatible) output. This procedure effectively mimics a thinking process for every prediction. The team created two EBT variants: a decoder-only model inspired by the GPT architecture, and a bidirectional model similar to BERT.

Energy-based transformer (source: GitHub)

The architecture of EBTs makes them adaptable and compatible with various inference-time scaling methods. “EBTs can generate longer CoTs, self-verify, do best-of-N [or] you can sample from many EBTs,” Alexi Gladstone, a PhD student in computer science at the University of Illinois Urbana-Champaign and the paper’s lead author, told VentureBeat. “The best part is, all of these capabilities are learned during pretraining.”

EBTs in Action

The researchers tested EBTs against established architectures: the popular transformer++ recipe for text generation (discrete modalities) and the diffusion transformer (DiT) for tasks like video prediction and image denoising (continuous modalities). They assessed the models based on two main criteria: “Learning scalability,” or how efficiently they train, and “thinking scalability,” which measures how performance improves with more computation during inference.

During pretraining, EBTs showed superior efficiency, achieving up to a 35% higher scaling rate than Transformer++ across data, batch size, parameters, and compute. This means EBTs can be trained faster and more affordably.

At inference, EBTs also surpassed existing models on reasoning tasks. By “thinking longer” (using more optimization steps) and performing “self-verification” (generating multiple candidates and selecting the one with the lowest energy), EBTs improved language modeling performance by 29% more than Transformer++. “This aligns with our claims that because traditional feed-forward transformers cannot dynamically allocate additional computation for each prediction being made, they are unable to improve performance for each token by thinking for longer,” the researchers explain.

For image denoising, EBTs delivered better results than DiTs while using 99% fewer forward passes.

Importantly, the study revealed that EBTs generalize more effectively than other architectures. Even with the same or weaker pretraining performance, EBTs outperformed existing models on downstream tasks. The performance gains from System 2 thinking were most pronounced on data that was further out-of-distribution (unlike the training data), indicating that EBTs are especially robust when confronting novel and difficult challenges.

The researchers note that “the benefits of EBTs’ thinking are not uniform across all data but scale positively with the magnitude of distributional shifts, highlighting thinking as a critical mechanism for robust generalization beyond training distributions.”

The advantages of EBTs are significant for two key reasons. First, they suggest that at the massive scale of today’s foundation models, EBTs could substantially outperform the classic transformer architecture used in LLMs. The authors observe that “at the scale of modern foundation models trained on 1,000X more data with models 1,000X larger, we expect the pretraining performance of EBTs to be significantly better than that of the Transformer++ recipe.”

Second, EBTs demonstrate much greater data efficiency. This is a crucial advantage in an era where high-quality training data is increasingly a major bottleneck for scaling AI. “As data has become one of the major limiting factors in further scaling, this makes EBTs especially appealing,” the paper states.

Despite its different inference mechanism, the EBT architecture is highly compatible with the transformer, allowing it to serve as a drop-in replacement for current LLMs.

“EBTs are very compatible with current hardware/inference frameworks,” Gladstone said, including speculative decoding using feed-forward models on both GPUs or TPUs. He added that he is confident they can run on specialized accelerators such as LPUs and optimization algorithms like FlashAttention-3, or can be deployed through common inference frameworks such as vLLM.

For developers and enterprises, the strong reasoning and generalization abilities of EBTs could make them a powerful and dependable foundation for building the next generation of AI applications. “Thinking longer can broadly help on almost all enterprise applications, but I think the most exciting will be those requiring more important decisions, safety or applications with limited data,” Gladstone said.

Multiverse Computing Launches Free Compressed Generative AI Model Large language models face a significant challenge: their immense size. Spanish startup Multiverse Computing is tackling this problem by creating compressed models designed to bridge the gap between the capabilities of cutting-edge AI and what busine

Secret Tracking Data Exposes Theft of AI Models A new method can invisibly watermark models like ChatGPT in seconds without retraining, leaving no trace in standard outputs and resisting all practical removal attempts. The key distinction between watermarking and 'copyright-baiting' is that waterm

AI Systems Tricked into Approving Absurd Scientific Papers New research reveals that AI systems can now produce fraudulent scientific papers that other AI models mistakenly accept as authentic. These fabricated studies bypass detection methods that were previously effective, highlighting the risk of research