Papers
arxiv:2604.07023

MARS: Enabling Autoregressive Models Multi-Token Generation

Published on Apr 8
· Submitted by
Phi
on Apr 9
#3 Paper of the day
Authors:
,
,

Abstract

MARS is a fine-tuning method that enables autoregressive language models to predict multiple tokens per forward pass without architectural changes, maintaining accuracy while improving throughput and supporting dynamic speed adjustment.

AI-generated summary

Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.

Community

Paper author Paper submitter
edited about 15 hours ago

We release MARS (Mask AutoRegression): teaching AR models to generate multiple tokens per forward pass.

  • Zero arch changes, zero extra params, only need a checkpoint and reuse SFT data
  • One-token mode: matches or beats AR baseline on 6 benchmarks
  • Multi-token mode: 1.5-1.7x throughput, baseline-level accuracy
  • Real-time speed control via confidence threshold at serving time

The key insight: of the 4 gaps between AR and block diffusion, only 1 is inherent. Close the other 3, and you get multi-token prediction for free. We release all our code and checkpoints.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.07023
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.07023 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.07023 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.07023 in a Space README.md to link it from this page.

Collections including this paper 2