When a Q4 Beats a Q6

A UD Q4_K_XL at 5.41 bits per weight. A plain Q6_K at 6.57 bits per weight. The Q4 is 4 GB smaller. It has 1.16 fewer bits per weight. And it matches or beats the Q6 in perplexity. That's not a rounding error — it's a falsification of the assumption that more bits per weight means better quality.

The Numbers

Measured on wikitext-2, full test set (~1M tokens, 1952 chunks, ctx=512), RTX 3090. All four quantizations are of the same model: rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled.

Model	PPL	BPW	Size
Q6_K plain (reference)	7.4694 ±0.031	6.57	21 GB
UD Q4_K_XL	7.4712 ±0.031	5.41	17 GB
UD Q5_K_XL	7.4891 ±0.031	6.04	19 GB
UD Q6_K_XL	7.4753 ±0.031	7.64	24 GB

The spread across all four models: 0.02 PPL. The standard deviations overlap completely. These are statistically indistinguishable.

The headline result: a Q4-class quantization at 17 GB produces the same perplexity as a Q6-class quantization at 21 GB. Not approximately the same — the Q4 actually edges it out by 0.0018 PPL, well within noise. The 1.16 BPW difference between the two models is, by this benchmark, free.

Why BPW Lies

Bits-per-weight is an average. It tells you how many bits the file allocates per parameter, on average, across the entire model. What it doesn't tell you is where those bits are.

A plain Q6_K distributes its 6.57 BPW uniformly. Every weight tensor — embeddings, attention projections, feed-forward gates, SSM outputs — gets Q6_K. This is a democratic allocation. It's also a wasteful one. Some tensors tolerate aggressive quantization with almost no quality loss. Others are critically sensitive. Treating them the same means you're wasting precision on the tolerant tensors and starving the sensitive ones.

The UD (Unsloth Dynamic) recipe inverts this logic. Instead of allocating bits uniformly, it assigns quantization levels based on the architectural role of each tensor. The total bit budget is lower — 5.41 vs 6.57 — but the bits are placed where they matter.

The SSM Factor

Qwen 3.6-27B isn't a pure transformer. It's a hybrid: 16 full-attention layers interleaved with 48 Mamba SSM layers. This architecture has a specific vulnerability that uniform quantization ignores.

The SSM layers maintain a recurrent state — a compressed representation that gets updated at every token. Unlike attention, where errors in one head are diluted across many, quantization error in the SSM pathway compounds through the recurrence. A small precision loss in ssm_out.weight at layer N propagates through every subsequent SSM step. Over 48 layers, that accumulation is measurable.

A plain Q6_K puts these 48 ssm_out.weight tensors at Q6_K — the same as everything else. The UD Q4_K_XL promotes them to Q8_0. That's a targeted investment of 48 tensors at higher precision, paid for by keeping hundreds of less sensitive tensors at Q4_K. The net result: lower total BPW, better allocation where it counts.

The UD Q6_K_XL goes further — those 48 SSM outputs are at F16, full half-precision. Zero quantization error on the recurrent mechanism. At that level, the VRAM budget allows it.

Where the Bits Go

The UD Q4_K_XL has 851 tensors. Here's how they're distributed:

Type	Count	Tensors
Q8_0	48	`blk.N.ssm_out.weight` — all SSM output projections
Q6_K	110	`output.weight`, `attn_qkv`, `ffn_down`, `attn_v`
Q5_K	70	`attn_gate` + imatrix-promoted weights
Q4_K	258	Remaining attention/FFN weights
F32	353	Norms, biases, SSM scalars

More than half the tensors sit above Q4. The name “Q4_K_XL” understates what this quantization actually is. A Q4_K_M vanilla would put all ~498 weight tensors at Q4_K. This one promotes 228 of them to Q5_K, Q6_K, or Q8_0. The difference is architectural awareness.

The imatrix Surprise

The UD recipe is architectural — it's based on tensor names and roles, not on the values inside the tensors. It transfers from the base model to any fine-tune of the same architecture. But the imatrix is weight-dependent. It captures which tensors became more or less important after training.

When I generated the imatrix specifically on the rico03 Opus distill (rather than reusing one from the base model), something interesting appeared. Twelve mid-network ffn_gate/up tensors (blocks 12–16) that appear as IQ4_XS in Unsloth's base model recipe were promoted to Q5_K in this distill. The imatrix indicates these paths are more active in the Opus reasoning fine-tune than in the base model.

This makes architectural sense. Blocks 12–16 are the mid-network region where abstract representations begin to form — where token-level patterns start becoming reasoning-level structures. A distillation on Opus traces, which emphasize deep multi-step reasoning, would reinforce exactly these pathways. The imatrix likely detects this shift; the quantization preserves it. This remains an inference based on the observed promotion of these 12 tensors — not yet a controlled ablation.

This is the interaction effect that makes the UD+imatrix combination strictly better than either alone. UD without imatrix doesn't know that the distill changed which mid-network tensors matter. Standard quantization with imatrix doesn't structurally differentiate between SSM outputs and feed-forward layers. Together, they produce something neither can achieve separately.

The Full Series

This result doesn't stand alone. The DAXZEIT UD series for this model now covers three quantization levels:

Quantization	BPW	Size	Context on 24GB	Use case
UD Q4_K_XL	5.41	17 GB	212K + vision	Maximum context, long sessions
UD Q5_K_XL	6.04	19 GB	131K + vision	Daily driver, quality balanced
UD Q6_K_XL	7.64	24 GB	65K (CPU offload)	Quality reference, >24GB VRAM

All three fall within 0.02 PPL of a plain Q6_K reference. The perplexity plateau across 2.23 BPW of range (5.41 to 7.64) tells you that the UD recipe has found the quality floor for this architecture at this model size. Pushing more bits doesn't help because the bottleneck isn't bit budget — it's the original F16 weights themselves. The quantization error is already below the noise floor.

What This Proves

One result, but it carries a general claim: bits-per-weight is a misleading quality metric without architecture knowledge.

This isn't unique to Qwen 3.6 or to SSM hybrids. Every architecture has tensors that are more sensitive than others. In a pure transformer, output.weight and attention projections carry more quality-critical information than mid-layer feed-forward gates. In a hybrid, the SSM recurrent pathway adds another dimension of sensitivity. In an MoE, the router weights and shared expert tensors have disproportionate impact.

BPW hides all of this behind a single number. A Q6 sounds better than a Q4. It uses more bits. It costs more VRAM. It must be higher quality. Except when the Q4 puts its bits where they matter and the Q6 spreads them uniformly — in which case the Q4 wins on quality and on efficiency simultaneously.

The practical implication for anyone running local inference: if a UD quantization exists for your architecture, it will outperform a plain quantization at the same or higher BPW. The gap is most visible on architectures with heterogeneous tensor sensitivity — hybrids, MoEs, models with large embedding tables. For these models, the name on the file (Q4, Q5, Q6) tells you almost nothing about actual quality. The recipe behind it tells you everything.

Files

DAXZEIT / UD Q4_K_XL

17 GB · 5.41 BPW · 212K context + vision on 24GB VRAM

DAXZEIT / UD Q5_K_XL

19 GB · 6.04 BPW · 131K context + vision on 24GB VRAM

DAXZEIT / UD Q6_K_XL

24 GB · 7.64 BPW · Quality reference

rico03 / Base model

Qwen 3.6-27B Claude Opus Reasoning Distilled · ~14K Opus traces

Companion articles: The Gap Between Two Pipelines · Full Native 262K Context on a Single RTX 3090 · MoE: Narrowly Competent, Globally Incoherent

Published on daxzeit.eu. Built on a 14L ITX workstation in Zwevegem, Belgium.