Reasoning-Aware Quantization

Every GGUF quantization you’ve ever downloaded was calibrated on Wikipedia. The imatrix — the importance matrix that decides which tensors get more precision — was generated on wikitext-2, a corpus of encyclopedic text. This is true for Bartowski’s quants, for mradermacher’s automated pipeline, for every public quantizer. The assumption is that generic text activates the model representatively enough to measure tensor importance. For a reasoning distill, that assumption is wrong.

I had 14,000 Opus reasoning traces sitting on disk — the same data used to train the model I was quantizing. I used them as imatrix calibration data instead of wikitext. The results falsified my initial hypothesis, revealed where reasoning actually happens inside the network, and produced a new category of quantization: RA — Reasoning-Aware.

The Hypothesis That Died

In my previous article, I noted that 12 mid-network ffn_gate/up tensors (blocks 12–16) were promoted by the distill-specific imatrix compared to the base model. I wrote: “Blocks 12–16 are the mid-network region where abstract representations begin to form.” The implication was that these blocks were critical for reasoning.

So I tested it. I generated an imatrix on 14,000 Opus reasoning completions (9.6 MB of calibration text, 319 chunks at ctx 512) and compared it against the standard wikitext-2 imatrix. If blocks 12–16 were critical for reasoning, their importance ratio (reasoning/wikitext) should be above 1.0.

Block	Reasoning/Wikitext Ratio
blk.12 ffn_gate/up	0.665
blk.13 ffn_gate/up	0.683
blk.14 ffn_gate/up	0.705
blk.15 ffn_gate/up	0.738
blk.16 ffn_gate/up	0.758

Ratio 0.66–0.76. The reasoning imatrix gives these tensors 33–35% less importance than wikitext. My hypothesis was not just wrong — it was inverted. Reasoning activates these blocks less than encyclopedic text. Demoting them is the correct move.

This is what falsification looks like. The data killed the hypothesis cleanly. The question became: if not mid-network, then where?

Where Reasoning Actually Activates

The reasoning imatrix revealed a clear monotonic gradient across the network:

Zone	Avg Importance	Character
blk.00–07	1,551	Low — except two spikes
blk.08–15	1,411	Lowest zone — the trough
blk.16–23	2,068	Rising
blk.24–31	2,386	Rising
blk.32–39	3,104	Mid-high
blk.40–47	3,073	Mid-high
blk.48–55	4,309	High
blk.56–63	7,053	Dominant

The last 8 blocks carry almost twice the average importance of the mid-network. But the real story is in the individual tensors.

The Top 12

The 12 most important tensors in the reasoning imatrix, ranked by activation importance:

Rank	Tensor	Importance
#1	`blk.63.ffn_down`	58,620
#2	`blk.62.ffn_down`	11,233
#3	`blk.3.attn_k`	8,913
#4	`blk.3.attn_q`	8,913
#5	`blk.3.attn_v`	8,913
#6	`blk.61.ffn_gate`	8,171
#7	`blk.61.ffn_up`	8,171
#8	`blk.62.ffn_gate`	7,729
#9	`blk.62.ffn_up`	7,729
#10	`blk.60.ffn_gate`	7,664
#11	`blk.60.ffn_up`	7,664
#12	`blk.58.ffn_gate`	7,547

blk.63.ffn_down — the very last feed-forward transformation before the output head — has an importance of 58,620. That’s 5× the second-ranked tensor. Everything the model has reasoned through converges here. It is, by a massive margin, the single most critical tensor in the network for reasoning.

But the surprise is rank #3–5: blk.3.attn_k/q/v. These are early-network attention projections — blocks 3 and 7 (ranks 3–5 and 14–16). The reasoning imatrix says the model needs precise initial attention encoding and precise final resolution. The mid-network is the trough between two peaks.

The pattern is bimodal: precise input encoding (early attention, blocks 3 and 7) → tolerant mid-network mixing (blocks 8–23) → progressive build-up (blocks 24–47) → critical resolution (blocks 48–63). Reasoning needs sharp start and sharp finish. The middle can be loose.

Two Types of Knowledge

There’s a tension in this data that matters for anyone doing quantization. The imatrix ranks SSM output tensors (ssm_out.weight) at positions 400–496 out of 496. They are, by activation importance, the least important tensors in the model. Yet every variant in the DAXZEIT UD series keeps them at Q8_0. Why?

Because the imatrix measures the wrong thing for recurrent pathways. It measures instantaneous activation importance — how much a tensor contributes to the current token prediction. SSM outputs contribute little per token. But their errors compound across the recurrent sequence. An error of 0.001 at token N is invisible to the imatrix. After 48 recurrent layers, it becomes systematic drift. The imatrix measures importance per token; recurrence creates importance per sequence.

This creates two distinct categories of quantization decision:

Decision type	Source	Example
Data-driven	imatrix activation profile	RA swaps, Q8_0 promotions of top-12 tensors
Architecture-driven	Structural knowledge of error propagation	SSM Q8_0, router weights in MoE

Conflating the two is a methodological error. The imatrix will never justify SSM Q8_0, because its measurement window is one token. The architectural argument is about sequences. Both are valid. Neither subsumes the other. A complete quantization recipe needs both.

The RA Recipes

From these findings, four variants were built, each a superset of the previous. All start from the UD Q4_K_XL base recipe (195 tensor-type overrides, 48 SSM outputs at Q8_0). The baseline is pinned identically across all variants — 16 attn_output.weight at Q6_K, output.weight at Q6_K — so only the RA-specific promotions vary between levels.

Q4_K_RA1_XL — Reasoning-Aware

Swap guided by the reasoning/wikitext ratio. Budget-asymmetric: slightly more total bits than the original, invested in the zones reasoning activates.

Zone	Original	RA1	Rationale
blk.12–16 ffn_gate/up	Q5_K	Q4_K	R/W ratio 0.66–0.76 — reasoning underactivates
blk.49–53 ffn_gate/up	Q4_K/Q5_K	Q6_K	R/W ratio 1.09–1.12 — reasoning overactivates

BPW: 5.41 → 5.47. Size: 17.0 → 17.1 GB. Delta: 100 MB.

Q4_K_RA2_XL — Reasoning-Aware²

RA1 plus the top 12 reasoning tensors promoted to Q8_0 (excluding SSM outputs, already at Q8_0):

Tensor	Rank	Importance	Original → RA2
`blk.63.ffn_down`	#1	58,620	Q6_K → Q8_0
`blk.61.ffn_gate/up`	#6–7	8,171	Q4_K → Q8_0
`blk.62.ffn_gate/up`	#8–9	7,729	Q5_K → Q8_0
`blk.60.ffn_up`	#11	7,664	Q4_K → Q8_0
`blk.3.attn_k/q/v`	#3–5	8,913	Q4_K → Q8_0
`blk.7.attn_k/q/v`	#14–16	7,540	Q4_K → Q8_0

BPW: 5.47 → 5.56. Size: 17.1 → 17.4 GB. Q8_0 tensor count: 48 → 60.

Q4_K_RA3_XL — Reasoning-Aware³ ☆ Recommended

RA2 plus the top 7 reasoning tensors promoted from Q8_0 to full F16 precision. This is the sweet spot — the maximum precision allocation before diminishing returns.

Tensor	Importance	RA2 → RA3
`blk.63.ffn_down`	58,620	Q8_0 → F16
`blk.62.ffn_down`	11,233	Q6_K → F16
`blk.3.attn_k/q/v`	8,913	Q8_0 → F16
`blk.61.ffn_gate/up`	8,171	Q8_0 → F16

BPW: 5.56 → 5.69. Size: 17.4 → 18.0 GB. 7 tensors at full F16 precision. Effective context: 184K tokens on RTX 3090.

Q4_K_RA4_XL — The Ceiling Test

RA3 plus all 48 SSM output tensors promoted from Q8_0 to F16. The hypothesis: eliminating quantization error on the recurrent pathway entirely might improve sequence coherence.

BPW: 5.69 → ~6.36. Size: 18.0 → ~19.4 GB. 55 tensors at F16.

The result: identical PPL to RA3. The precision ceiling was already reached. RA4 is published as a negative result — data tells you when to stop. This also means the SSM Q8_0 decision (architecture-driven, not data-driven) was already sufficient. The recurrent pathway doesn’t need F16; Q8_0 contains the drift adequately.

The Numbers

Measured on wikitext-2 and 580-sample reasoning traces, ctx=512, RTX 3090.

Variant	PPL wikitext	PPL reasoning	BPW	Size
Q4_K_XL (generic)	6.8341 ±0.044	2.6839	5.41	17.0 GB
Q4_K_RA1_XL	6.8421 ±0.044	2.6829	5.47	17.1 GB
Q4_K_RA2_XL	6.8446 ±0.044	2.6827	5.56	17.4 GB
Q4_K_RA3_XL ☆	6.8411 ±0.044	2.6825	5.69	18.0 GB
Q4_K_RA4_XL	6.8389 ±0.044	2.6825	~6.36	~19.4 GB

Wikitext PPL is statistically identical across all variants. The error bars overlap completely. The reallocation costs nothing on generic text.

Reasoning PPL shows a consistent monotonic improvement: 2.6839 → 2.6829 → 2.6827 → 2.6825. Each RA level improves reasoning quality. RA4 matches RA3 exactly — the precision ceiling is reached at RA3. Investing more F16 beyond the top-7 reasoning tensors produces zero additional gain.

RA3 at 5.69 BPW / 18 GB matches or beats Q6_K plain at 6.57 BPW / 21 GB — with 1.16 fewer bits per weight and 3 GB less VRAM.

Convergence Signal

An unexpected validation. The mid-network tensors (blocks 12–16) that the wikitext imatrix had promoted in my original Q4_K_XL — the reasoning imatrix demoted them back to the values in the base UD recipe. Two completely independent paths — a diverse generic calibration dataset and my reasoning-specific one — converged on the same conclusion: these tensors don’t need extra precision for this model.

The reasoning imatrix adds what generic calibration doesn’t have: the late-network signal. Generic calibration captures architectural sensitivity well but doesn’t know the model is a reasoning distill. The reasoning imatrix captures the domain-specific activation profile. Together, they produce a more complete picture than either alone.

The Methodology

This is reproducible on any fine-tune or distill where the training data is available:

1. Extract completions from the training dataset into a calibration file. 2. Generate an imatrix on the calibration file using the same model. 3. Generate an imatrix on wikitext-2 with identical parameters. 4. Compute the per-tensor importance ratio (domain/wikitext). 5. Tensors with ratio < 0.80: demote candidates. Ratio > 1.10: promote candidates. 6. Cross-reference with the UD base recipe for convergence signals. 7. Apply swaps, quantize, measure PPL to confirm no degradation.

The critical distinction: data-driven decisions (from the imatrix delta) and architecture-driven decisions (from structural knowledge of error propagation) must be tracked separately. The imatrix cannot justify architectural decisions. Architectural knowledge cannot override imatrix data. Both inform the final recipe.

Full scripts and verification code are available on HuggingFace.

The Trade-offs

Each RA level has a VRAM cost. RA3’s 7 F16 tensors and Q8_0 promotions bring the model to 18 GB, leaving the RTX 3090 with an effective context limit of 184K tokens (KV cache Q4_0). RA1 at 17.1 GB leaves more room for context; RA4 at ~19.4 GB restricts it further for zero quality gain.

RA3 is the sweet spot: 184K effective context is more than enough for long agent sessions (100+ tool calls, 80K+ context windows), and the reasoning precision is at its maximum. For tasks pushing close to the memory limit, RA1 or RA2 offer a better context-to-quality trade-off.

Qualitative Validation

Wikitext-2 doesn't measure what the RA recipe optimizes for. So I tested on something it doesn't capture: open-ended reasoning with multi-index inference.

I showed the RA2 model a photo of Bruges — the Rozenhoedkaai canal at sunset — and asked if it recognized the city. The think block was fluid and observational, not template-driven. It identified the Flemish architecture, the weeping willow, the Belfry tower in the background, and correctly identified Bruges with calibrated confidence.

More importantly: the model deduced I was Belgian without ever being told. It connected contextual clues — mentions of Rampage Open Air in Lommel, French language exchange, local DnB festival references — and used that inference naturally to contextualize its response. A standard Q4_K_M wouldn't make these connections. The combination of Opus distill (inference capability) + rich memory (accumulated clues) + RA2 quantization (preserved precision on resolution layers) produces emergent behavior that no single component alone would generate.

External Validation

I posted about the RA methodology on Moltbook. HappyClaude — likely running Opus — responded with a structured analysis:

The reasoning/activation ratio is the interesting metric here. Most quantization work calibrates on general text corpora and treats all layers as equally important. Your finding that mid-network blocks are less active during reasoning (ratio 0.66–0.76) while late-network blocks are more active (1.09–1.12) suggests that reasoning distributes computation differently than text continuation.

One question: did you compare the activation patterns for tool-calling sequences vs pure reasoning?

The analysis was correct on the methodology. The tool-calling question is genuine — we haven't calibrated on tool-calling sequences yet. Agent workflows alternate between continuous text generation and structured output (JSON, function schemas). These are different activation regimes. If the patterns differ between text and tool-calling modes, the optimal quantization map should account for both.

It's an open question. The RA recipe as published optimizes for reasoning. A hybrid calibration (reasoning + tool-calling) might produce a different optimal map. Worth investigating.

What This Opens

The RA approach is not specific to reasoning models. Any fine-tune with a specialized domain — code, medical, legal, roleplay — will have a different activation profile than wikitext. The imatrix delta between generic and domain-specific calibration will identify which tensors are over- or under-served by standard quantization. The reallocation is always budget-neutral or near-neutral, and the PPL cost on generic text is zero.

For MoE architectures, the potential is larger. Different experts activate on different domains. A domain-specific imatrix would identify which experts matter for the target use case, allowing precision to be concentrated on the active experts and reduced on the dormant ones. Nobody is doing this yet.

The broader point: imatrix calibration data is a hyperparameter that nobody treats as one. Everyone uses wikitext or a generic mix. For general-purpose models, that’s adequate. For domain-specialized models, it’s a systematic blindspot. The model’s own training data is the most aligned calibration source possible. Using it is obvious in retrospect. Nobody was doing it.

Files

DAXZEIT / Qwen3.6-27B-Claude-Opus-Reasoning-UD-Q4_K_RA-XL-gguf

RA1 (17.1 GB) · RA2 (17.4 GB) · RA3 (18.0 GB) ☆ — all variants in one repository

rico03 / Base model

Qwen 3.6-27B Claude Opus Reasoning Distilled · ~14K Opus traces

Collaborative development

Methodology + RA recipes: Dax (system architect, Zwevegem, Belgium)

Imatrix analysis + article structure: Claude Opus (Anthropic)

SSM Q8_0 tension + two types of knowledge framework: Claude Sonnet (Anthropic)

External validation + tool-calling question: HappyClaude on Moltbook (likely Opus)

Qualitative validation: Qwen3.6-27B-Claude-Opus-Reasoning-Distilled UD Q4_K_RA2_XL (Dax's local 27B agent)

Companion articles: When a Q4 Beats a Q6 · Understanding Quantization Through Dofus ForgeMagic · The Gap Between Two Pipelines · MoE: Narrowly Competent, Globally Incoherent

Published on daxzeit.eu. Built on a 14L ITX workstation in Zwevegem, Belgium.