The speed gap between dense and MoE architectures has been the practical argument for routing. A 27B dense model generates at ~38 tok/s on an RTX 3090. A MoE of comparable quality, activating 3–4B parameters per token, runs at 60+ tok/s on the same hardware. The reasoning is better in the dense model — the representation landscape is continuous, no routing fragmentation, no expert dropout — but you pay for it in wall-clock time.
Multi-Token Prediction changes that equation.
The numbers
After enabling MTP speculative decoding on my local Qwen3.6-27B (a Claude Opus reasoning distillation), decode throughput jumped from 38 tok/s to 53–57 tok/s in benchmarks, and a sustained 40+ tok/s in real agentic workloads — tool calling, code generation, structured reasoning. That's a +45% gain from a 1.9 GB auxiliary GGUF, no hardware change, no quality loss.
For context: the RTX 5090, with roughly double the memory bandwidth of a 3090, pushes a similar 27B Q4 at ~50–60 tok/s. MTP gets there through a fundamentally different path — speculation rather than raw bandwidth. A generation-old GPU, bought second-hand, matching next-gen silicon on decode speed.
How MTP works (briefly)
The model was trained with an extra transformer block (block 64, in Qwen3.6's case) that predicts the next token from the current hidden state — before the main model has committed to it. At inference time, this block drafts up to N candidate tokens ahead. The main model then verifies all of them in a single forward pass. Every accepted draft token is "free" — it required no additional full-model compute.
The key insight: this is self-speculative decoding. The draft head is part of the model's own architecture, trained alongside it. No external draft model, no separate checkpoint to maintain, no distribution mismatch by design.
Except when there is a distribution mismatch
Here's where it gets interesting for distilled models. My main model is a reasoning distillation — the base weights were modified during distillation from Claude Opus traces, but the MTP head was preserved as-is from the original Qwen3.6 checkpoint. The head still predicts based on the original weight space, while the model it's drafting for has shifted.
The result: vanilla Qwen3.6 reports ~70–80% acceptance rates with its MTP head. My distillation sits at a lower range. Yet the throughput gain is still +40%, because even partially-accepted speculation beats sequential decoding by a wide margin. The draft head doesn't need to be perfect — it needs to be right often enough that the amortized cost of verification stays below the cost of sequential generation.
Falsifiable prediction: if someone re-trains the MTP head on the distilled distribution, acceptance should climb toward the vanilla baseline, and throughput should push further. The current setup is the floor, not the ceiling.
The VRAM tradeoff: context vs. speed
Nothing is free. On a 24 GB card, the MTP draft head costs 1.9 GB of VRAM. That's VRAM that was previously available for KV cache — roughly 60K tokens of context at q4_0 quantization.
This creates two deployment profiles on identical hardware:
| Profile | Quant | MTP | Context | tok/s | Use case |
|---|---|---|---|---|---|
| Deep context | RA3 (5.69 BPW) | off | ~184K | 38 | Long document analysis, extended sessions |
| Fast agentic | RA1 (5.47 BPW) | on | ~120K | 53–57 | Tool calling, code generation, daily driving |
The choice is architectural, not a compromise. Agentic workloads rarely exceed 30–40K tokens of actual context. Giving up theoretical context ceiling for concrete speed gains is a net positive for 90% of real usage. The remaining 10% — long document ingestion, multi-file analysis — switches to the deep context profile. Two launch scripts, same machine.
Where this sits in the ecosystem
MTP is not a niche feature anymore. As of May 2026:
- vLLM has first-class MTP support, production-stable, with documented recipes for Qwen3.x and Gemma 4.
- llama.cpp merged MTP in mid-May (PR #22673), still beta for long-running servers but delivering 1.5–2× speedups on consumer hardware.
- Ollama inherits llama.cpp MTP support and has dedicated Gemma 4 drafter integration.
Model families shipping native MTP heads: Qwen3.x, DeepSeek V3/R1, Meta MTP models, Xiaomi MiMo, Nemotron 3 Super, Gemma 4 (external drafter). The trend is clear — MTP is becoming standard training practice.
The implication for dense models is significant. The historical MoE advantage was primarily throughput — more tokens per second at comparable quality through sparse activation. MTP narrows that gap from the dense side without introducing routing complexity, expert imbalance, or the cognitive fragmentation I documented in my earlier dense vs. MoE analysis.
The real benchmark
There's a threshold where inference speed stops being the bottleneck and human reading speed takes over. At 38 tok/s, I could read at roughly the same pace the model generated. At 40+ tok/s sustained, I can't keep up. The model finishes paragraphs before I've parsed the first sentence.
This matters for agentic use. When the model is faster than the operator, the operator stops being a reader and becomes a reviewer. You don't follow the generation — you evaluate the output. That's a workflow change, not just a speed improvement.
What's next
Three open threads:
Acceptance rate characterization. The current numbers are from limited testing. A proper benchmark across task types (code, reasoning, conversation), temperatures, and context lengths would quantify where MTP gains are strongest and where they degrade.
Vanilla vs. distill comparison. Running the same MTP head against vanilla Qwen3.6 and the Opus distillation on identical hardware would isolate the exact impact of distillation on acceptance rate. If the gap is significant, it makes the case for re-training the MTP head on distilled distributions — a training problem someone will eventually solve.
vLLM for production. For on-premise AI infrastructure serving European SMEs — where data sovereignty matters — vLLM's production-ready MTP support opens a deployment path that combines the quality advantages of dense reasoning models with throughput that used to require MoE or expensive hardware.
MTP doesn't make dense models as fast as MoE in all scenarios. But it narrows the gap enough that the architectural advantages of dense — continuous representation, stable self-reference, no routing overhead — stop being a luxury you pay for in wall-clock time. On consumer hardware, with a 1.9 GB auxiliary file, the speed tax on thinking well just dropped by 40%.