Editorial illustration for coverage of Aurora, a system designed to improve speculative decoding and AI inference in real time.
Running large language models at scale is expensive. Serving a frontier model to millions of users requires enormous compute, and even small reductions in latency can have an outsized impact on both user experience and cost.
One of the most promising ways to improve inference efficiency is speculative decoding. In that setup, a smaller, faster draft model proposes tokens that a larger target model then verifies in parallel. When the draft model is well matched to the target, the result can be a meaningful increase in throughput without changing the final output. Paper
The problem is that draft models do not stay optimal for long. Traffic patterns shift, models get repurposed, and a speculator trained offline can drift away from what production workloads actually look like. Aurora, a new system described in a 2026 paper by researchers including contributors affiliated with Together AI and Stanford, takes a different approach: instead of retraining a speculator offline every so often, it updates the draft model continuously from live inference traces while serving remains online. arXiv
The Core Idea: Turn Serving Into Training
Aurora reframes speculative decoding as a joint learning-and-serving problem. In the system, the draft model acts like a policy, while the target model and token-verification process provide the feedback signal. Accepted tokens act as positive feedback; rejected proposals become an implicit training signal that helps the draft model better match the target over time. Paper
That creates a practical flywheel: every production request can help improve the speculator. Instead of relying on a traditional offline pipeline with data collection, distillation, and scheduled retraining, Aurora tries to learn directly from real traffic as it happens. Paper
How Aurora Works
Aurora splits the system into two decoupled components.
The inference server runs speculative decoding on top of an SGLang-based serving stack. The draft model proposes candidate tokens, the target model verifies them, and the system records accepted and rejected proposals together with the hidden-state information needed for training. Project page
The training server reads from that stream asynchronously, updates a copy of the draft model, and periodically hot-swaps improved weights back into the inference system without interrupting serving. According to the authors, this lets Aurora adapt the speculator continuously while requests are still flowing through production. Project page
The broader point is not just speed. Aurora is designed to reduce the mismatch between how speculators are trained and how they are actually used in deployment. The paper argues that this mismatch is one of the main reasons offline speculative pipelines degrade over time. Paper
The Reported Performance
The headline results are notable.
According to the paper, Aurora achieves a 1.5× day-0 speedup on recently released frontier models including MiniMax M2.1 229B and Qwen3-Coder-Next 80B while starting from an untrained speculator. On more established models such as Qwen3 and Llama3, the system reports an additional 1.25× speedup over a well-trained but static speculator when traffic distributions shift. Paper
The more interesting claim is strategic rather than just benchmark-driven: Aurora suggests that continuous online adaptation may be more useful in production than relying exclusively on a static speculator trained offline once and then left to age. The project page also says untrained speculators can become competitive after only thousands of requests, which supports the case for day-0 deployment. Project page
Why This Matters
For infrastructure teams, the appeal is obvious. Traditional speculative decoding setups can require expensive offline distillation workflows, careful activation collection, and retraining jobs that have to stay aligned with production behavior. Aurora’s argument is that much of that operational burden can be reduced by learning directly from live inference traffic instead. Paper
That also opens the door to day-0 deployment. Instead of waiting for a draft model to be pretrained offline before it can help at all, Aurora is designed to start serving immediately and improve as requests accumulate. For teams that frequently ship new model versions or see rapidly changing request mixes, that could be a meaningful operational advantage. Paper
There is also evidence that parts of the work are being shared publicly. Aurora has a public project page and GitHub site, and Together-hosted model artifacts are available on Hugging Face, including a draft model card that describes training from scratch with the Aurora framework. GitHub
The Caveat
Aurora still looks like a promising research system, not a universally settled new default for LLM inference.
The paper is currently available on arXiv, and that distinction matters. It is more accurate to say Aurora is described in a 2026 paper than to present it as an already established conference milestone. That does not make the result less interesting, but it does mean the claims should be read as strong research results rather than production consensus. arXiv
Even so, the idea is worth watching closely. If speculative decoding can adapt continuously to live traffic without disruptive retraining cycles, the implications for large-scale inference could be significant. Faster serving, lower costs, and less operational overhead is a combination that infrastructure teams rarely ignore for long.
