How to Make LLMs 3x Faster by Predicting More, Without Draft Models

This approach, which boasts a three-fold increase in speed and minimal impact on output quality, directly addresses a major challenge for AI systems in production: large-scale latency.

Depiction of Artificial Intelligence, Internet of Things, Network Security, Global Business, Robots, and Digital technology concepts related to online marketing, data analysis, and e-commerce connections.

For IT decision-makers implementing agentic AI systems, significant inference delays and escalating GPU expenses have become major obstacles. These processes frequently produce thousands of tokens per request, leading to a performance deficit that existing hardware finds challenging to overcome.

However, researchers from the University of Maryland, Lawrence Livermore National Labs, Columbia University, and TogetherAI claim to have achieved a threefold increase in inference speed on reasoning evaluations. This is accomplished by refining pre-trained models, integrating acceleration directly into their parameters, thereby eliminating the necessity for speculative decoding or supplementary draft models.

The team’s recent publication details a multi-token prediction method. This technique transforms conventional next-token models into parallel decoders by incorporating a unique mask token and employing an online self-distillation learning goal.

During benchmark evaluations, this methodology achieved over a triple acceleration rate with only a slight reduction in accuracy. This compromise could be attractive to organizations facing the challenge of optimizing both cost and model performance in their operational AI deployments.

Crucially, the developed model is said to maintain the exact same implementation as its original pre-trained version. It can be deployed without requiring any extra verifier component or specific inference programming.

Mechanism of the Approach

Conventional Large Language Models (LLMs) produce only one token with each forward pass, a fundamental architectural limitation that inherently restricts processing capacity.

This sequential constraint poses a particular challenge for reasoning models, as they often produce thousands of tokens in a “chain of thought” process, even for concise final answers. Generating several tokens in a single pass can significantly decrease both latency and expenses.

To guarantee logical consistency, the researchers employ a student-teacher framework. Illustrating with a zookeeper analogy, they point out that a model independently predicting multiple words could illogically suggest a zookeeper gave “meat to a panda.” The teacher model then assesses these multi-token segments to verify their contextual validity.

“Our proposal introduces a reinforcement learning-influenced training framework where a student model produces a sequence of concurrent token predictions,” the researchers said in the paper. “To circumvent the disadvantages of the conventional offline objective, the student’s output is judged by an LM critic/teacher, instead of being measured against a pre-defined ground-truth token sequence.”

“By contrasting the student’s forecasts with the subsequent token recommendations provided by the teacher, we generate an on-policy reward signal. This signal empowers the student to rapidly enhance the caliber of its multi-token predictions,” they added.

During the inference phase, the system employs a confidence-adaptive (ConfAdapt) decoding approach that intelligently adjusts the number of tokens generated per pass. When the model exhibits high confidence, it produces larger blocks of text. Conversely, as uncertainty increases, it reverts to smaller incremental steps, thereby sustaining accuracy while still benefiting from speed enhancements.

Tests conducted using the GSM8K math reasoning benchmarks demonstrated that an 8-billion parameter model achieved over three times faster performance with less than a 3% decrease in accuracy. A more compact 4-billion parameter model delivered comparable speed improvements, albeit with a greater 7% accuracy reduction. More ambitious setups boosted acceleration up to five times, but at the expense of more significant accuracy compromises.

In contrast to speculative decoding, which necessitates supplementary speculator models and bespoke inference workflows, this methodology trains a singular model that keeps the identical implementation as its initial checkpoint, eliminating the need for any additional verifier.

Implications for Enterprise AI

According to analysts, the more significant inquiry is whether this technique will substantially alter the architectural design of inference stacks in operational environments.

“Speculative decoding endeavors to bypass that restriction by employing a draft model to suggest tokens and a target model to confirm them,” said Sanchit Vir Gogia, chief analyst at Greyhound Research. “In principle, this offers acceleration without any loss. However, in practical applications, the cost of verification, interactions during batching, and the divergence between draft and target models diminish the actual benefits achieved.”

Conversely, he explained that the multi-token strategy preserves the core autoregressive structure but transfers the optimization process to the training stage.

“The financial benefit is contingent upon the entropy distribution throughout the output,” Gogia said. “For tasks that are rich in reasoning or highly structured, predictable sequences can be generated in larger segments with minimal quality reduction. In scenarios involving higher-entropy, open-ended generation, the acceleration benefits are diminished. This represents a form of selective compression, rather than a universally applicable speed boost.”

This differentiation holds significance for deployments within enterprise environments.

Gogia stated that “ConfAdapt inherently responds to entropy. Its primary benefit is most pronounced in tasks that involve structured frameworks, predictable linguistic components, and guidance-oriented outputs requiring human review.”

Gogia advised that businesses ought to perceive this method as a fine-tuned tool for efficiency, rather than a broad-spectrum accelerator.

Data ScienceAnalyticsArtificial Intelligence

Trending →

Gemini CLI: See Your Changes First

Orchestrating AI Agents with Amazon Bedrock

JetBrains uses AI to make Kotlin and Java debugging easier.

Postgres: The Go-To Database, Ready for AI’s Future

Apple’s 50th Birthday: They’re celebrating you.

How to Make LLMs 3x Faster by Predicting More, Without Draft Models

This approach, which boasts a three-fold increase in speed and minimal impact on output quality, directly addresses a major challenge for AI systems in production: large-scale latency.

Mechanism of the Approach

Implications for Enterprise AI

Leave a Reply Cancel reply

You Might Also Like ↷

Mistral AI Buys Koyeb to Boost Its Computing Power

Alibaba’s Qwen3-Max-Thinking opens up more AI choices for businesses.

Anthropic: Widespread attempts to replicate Claude’s smarts.

Anthropic Sues US Government, Citing ‘Unprecedented and Unlawful’ Actions