Everyone talks about Mixture-of-Experts as if more experts means more intelligence. It doesn’t. I did the math, and the numbers are brutal.
This article breaks down what a MoE expert actually is at the architectural level, why 256 of them don’t add up to deeper reasoning, and why a 27B dense model on a single consumer GPU delivers better value per parameter than a 1.6 trillion parameter behemoth that requires a datacenter to run.
The architecture nobody reads
Qwen 3.5 A10B (122B total, 10B activated) is one of the few models with its full architecture published. Here’s the layout:
Hidden Layout: 12 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE))
Read that carefully. It tells you three things:
1. It’s not a Transformer. 75% of the layers use DeltaNet (linear attention), not standard quadratic attention. Only 1 in 4 layers uses actual Gated Attention. The model the industry calls a “Transformer” is mostly something else.
2. Every layer feeds into MoE. All 48 layers — DeltaNet and Attention alike — route through a Mixture of Experts block with 256 experts, of which 8 are routed and 1 is shared (always active).
3. The model doesn’t know this. When I asked Qwen 3.5 A10B about its own architecture, it confidently said “Architecture Transformer.” When corrected with the actual specs, it replied “You’re absolutely right, thank you for the correction! 🙏” — reformatting what I gave it into a neat table without any sign of having known it beforehand.
That last point matters. A model that doesn’t know what it is can’t reason about what it is.
The math: 0.45B per expert
Each MoE expert is a gated feed-forward network. Let’s calculate its size.
The specs:
- Hidden dimension: 3072
- Expert intermediate dimension: 1024
- Gated FFN: 3 weight matrices (up, gate, down)
Per expert, per layer:
3 × (3072 × 1024) = 9,437,184 parameters ≈ 9.4M
Across all 48 MoE layers:
9.4M × 48 = ~452M per expert
Each expert is 0.45 billion parameters. Not 4B. Not 1B. Less than half a billion.
And what is that 0.45B? It’s a feed-forward network. No attention heads. No context window. No ability to see what came before or after. It receives a hidden state vector, multiplies some matrices, and returns a transformed vector. That’s it.
It doesn’t “reason.” It doesn’t “know.” It pattern-matches.
The active budget
With 9 experts activated per token (8 routed + 1 shared):
9 × 452M = ~4.1B of expert parameters active per token
The remaining ~6B is shared infrastructure — DeltaNet layers, Gated Attention layers, embeddings, layer norms, the router itself. This shared infrastructure is what provides continuity, context, and coherence. It’s the actual “brain” of the model.
So the 10B activated parameter budget breaks down as:
- ~6B of reasoning infrastructure (attention, DeltaNet, embeddings)
- ~4B of pattern-matching FFN (blind experts)
A dense 27B model uses all 27 billion parameters for reasoning on every token. Every FFN, every attention head, every layer — all engaged, all the time. That’s 4.5× more reasoning capacity than the “10B activated” figure suggests, because in the MoE, nearly half the activated parameters can’t reason.
The shared expert: a permanent channel
One architectural detail deserves its own section: the shared expert.
In Qwen’s MoE layout, 8 experts are routed (selected by the router based on the input) and 1 is shared — always active, regardless of routing decisions. This means one expert participates in every token’s computation, in every layer, no matter what the 8 specialists are doing.
This permanent channel could serve a monitoring function. While routed experts come and go, the shared expert provides continuity across the sequence. It doesn’t have a privileged view of what other experts are doing — it sees the same hidden state as everyone else, before and after routing — but it processes every token without exception. Whether this constitutes “auditing” or is simply a baseline computation backbone is an open question.
What’s clear architecturally is that the shared expert provides something the routed experts cannot: consistency. It’s the one component whose contribution doesn’t depend on routing decisions. In a system where 8 out of 256 experts are selected per token, that permanence is structurally significant, whatever its functional role turns out to be.
DeepSeek V4 Pro: The case study
DeepSeek V4 Pro is the latest showcase for massive MoE scaling. The numbers:
| Spec | Value |
|---|---|
| Total parameters | 1.6 Trillion |
| Activated parameters | 49B |
| Expert precision | FP4 (quantization-aware training) |
| Attention | Hybrid CSA + HCA |
| Context | 1M tokens |
| Training data | 33T tokens |
| License | MIT |
On paper, it’s impressive. In practice, let’s compare it to Qwen 3.6 27B — a dense model that runs on a single RTX 3090.
Benchmarks (instruct vs instruct)
| Benchmark | V4 Pro (1.6T/49B) | Qwen 3.6 27B | Delta |
|---|---|---|---|
| MMLU-Pro | 87.5 | 86.2 | +1.3 |
| GPQA Diamond | 90.1 | 87.8 | +2.3 |
| SWE-bench Verified | 80.6 | 77.2 | +3.4 |
| LiveCodeBench | 93.5 | 83.9 | +9.6 |
| HLE | 37.7 | 24.0 | +13.7 |
V4 Pro wins on every benchmark. But the gains reveal something structural.
The smallest deltas — MMLU-Pro (+1.3), GPQA Diamond (+2.3), SWE-bench Verified (+3.4) — are tasks where reasoning depth matters more than pattern coverage. On these, the 27B dense model is nearly at parity despite being 60× smaller.
The largest deltas — LiveCodeBench (+9.6) and HLE (+13.7) — are tasks where breadth of pattern recognition matters. A 256-expert MoE has seen more patterns of code and problem types than a 27B dense model could absorb in a single FFN. The router finds the expert that has seen the neighborhood, and that’s enough to win the benchmark.
But this is a coverage advantage, not a depth advantage. The MoE isn’t reasoning more deeply about any given problem — it’s recognizing a wider range of problems. And coverage is something you can obtain through other means: RAG, tool use, multi-agent pipelines. You don’t need 1.6 trillion parameters for it.
The cost of those points
Here’s where it gets real.
Qwen 3.6 27B in Q4_K_M: ~18GB. One RTX 3090, ~€700 second-hand. Running at 50+ tokens/second with MTP.
DeepSeek V4 Pro in Q4_K_M: 955GB. The GGUF quantizations require a custom llama.cpp fork because upstream doesn’t support the architecture. The hardware compatibility checker on HuggingFace shows ❌ across the board — no quantization fits a single 3090. Even the most aggressive Q2_K is 574GB.
To run V4 Pro locally:
- ~14 H100 80GB GPUs minimum for Q4 with context headroom
- ~€30,000 per GPU
- Total: ~€420,000 for a single inference node
- Plus NVLink/InfiniBand, servers, cooling, power, rack space
- Realistically: €500,000–600,000 all-in
The ratio: 700:1 in cost for <20% performance gain.
At €3.50/Mtok output via API, you’d need to run 100M tokens per month for over 100 years to break even on the hardware investment. The API is cheaper — but it means sending your data to servers in China. Pick your trade-off.
3,507 downloads
The HuggingFace GGUF repo for V4 Pro had 3,507 downloads last month. The model is “open-weight” but physically inaccessible to the community that builds with open-source. Nobody runs it because nobody can run it.
The gap between “open” and “accessible” has never been wider.
Diminishing returns on MoE expert stacking
Let me be precise: scaling in general isn’t dead. Training on more data helps. Better architectures help. Better post-training helps. Qwen 3.6 27B proves all of this — it’s dramatically better than any 27B model from two years ago.
What’s dead is the specific strategy of stacking more MoE experts to improve performance. Going from 256 to 512 experts won’t double the performance. The routing mechanism has a natural ceiling: beyond a certain count, most experts become redundant or over-specialized, and the 8–9 activated per token end up doing the same work a dense model does with all its parameters.
DeepSeek V4 Pro demonstrates this ceiling empirically. They went from 671B (V3) to 1.6T (V4), from 37B to 49B activated, trained on 33T tokens instead of 14.8T — and the result is, by their own admission, “3 to 6 months behind” frontier closed-source models. The model was delayed multiple times over four months due to underwhelming performance. FP4 on expert weights was a necessity, not a choice — it’s the only way to make 256 experts per layer viable at inference without absurd hardware requirements.
The alternative: Pipeline-MoE
If 256 blind FFN experts at 0.45B each deliver diminishing returns, is there another way to get expert specialization?
Instead of one massive model with tiny, blind experts routing at the token level, consider multiple instances of a smaller dense model, each specialized by system prompt:
- Instance 1: Primary reasoning — generates the base response
- Instance 2: Auditor — validates claims and catches errors
- Instance 3: RAG agent — searches and retrieves relevant context
- Instance 4: Code reviewer — specialized technical validation
Each “expert” in this pipeline is a full 27B dense model with attention, context, and autonomous reasoning capability. That’s 60× more capacity per expert than a Qwen A10B MoE FFN block. And you can change specializations by editing a system prompt — no retraining needed.
This is not a drop-in replacement for MoE. The trade-offs are real:
- Latency 4× inference time sequentially, or 4× VRAM in parallel
- Routing No token-level routing — every token passes through the same model, unlike MoE where each token can activate different experts within a single forward pass
- Specialization System-prompt specialization is shallower than weight specialization — a fine-tuned expert has the knowledge in its parameters; a prompted expert is working from instructions
But the Pipeline-MoE has one structural advantage that weight-level MoE can never match: each expert reasons. A prompted 27B can introspect on its own uncertainty, decide to search for information, change strategy on failure. A 0.45B FFN block transforms a vector and returns it. That’s not a difference of degree — it’s a difference of kind.
Your SYSTEM.md is the router. Your tool stack is the expert pool. This isn’t a replacement for MoE — it’s a complement that covers MoE’s blind spot: reasoning depth at the expert level.
Practical implications
For self-hosters: A 27B dense model on consumer hardware delivers 85–95% of frontier MoE performance. The remaining 5–15% costs 700× more to capture. The economically rational move is obvious.
For European SMEs: A 1.6T model is physically out of reach. The API sends your data to China. A local 27B on a €700 GPU gives you GDPR compliance by design, data sovereignty, and performance that would have been frontier-class 18 months ago. The value proposition writes itself.
For the field: Dense models aren’t going away. They’re getting better, faster, and more efficient while MoE scaling hits diminishing returns. The future likely isn’t one giant model — it’s smaller, denser models orchestrated intelligently. Not Mixture-of-Experts in weights alone. Mixture-of-Experts in systems, where each expert can actually think.
Conclusion
In Qwen 3.5 A10B, a MoE expert is 0.45 billion parameters of blind matrix multiplication. No attention, no context, no ability to reason. Nine of them activate per token — 4 billion parameters of pattern matching riding on 6 billion parameters of actual reasoning infrastructure. The “10B activated” label hides the fact that less than half of it can think.
DeepSeek V4 Pro scales this approach to 1.6 trillion parameters and 49B activated, but the structural problem remains: experts are still isolated FFN blocks, still blind to context, still incapable of reasoning independently — just bigger ones. And the result is <20% improvement over a 27B dense model, at 700× the cost.
The alternative is straightforward: deploy a 27B dense model where all 27 billion parameters reason together on every token. No routing overhead. No dormant experts consuming memory. No exotic hardware requirements. No custom llama.cpp forks.
The question was never “how many experts can we fit?” It was always “how much reasoning can we deliver per watt, per dollar, per token?”
The answer, increasingly, is dense.
Previously in this series: When MTP Gives Your Dense Model MoE Speed · The Scaffolding Is the Signal · MoE: Narrowly Competent, Globally Incoherent
— Dax, Zwevegem, Belgium. June 2026.