Two Giants, One Blind Spot
If you've downloaded a GGUF quantization from HuggingFace in the last year, you've almost certainly used one of two sources:
mradermacher runs importance-matrix calibrated quantizations on virtually every model that gets uploaded — base models, fine-tunes, distills, merges. The imatrix captures which tensors matter most for a specific model's behavior, and the quantization preserves those tensors at higher precision. Thousands of models, systematic, comprehensive. But the output formats are standard: Q4_K_M, Q5_K_M, IQ variants. No structural optimization of which tensor types get which bit allocation.
Unsloth developed UD (Unsloth Dynamic) 2.0 — a mixed-precision strategy that assigns different quantization levels based on the architectural role of each tensor. Embeddings get Q8_0 because they're looked up once per token and need precision. Attention key projections get Q6_K because they directly affect routing. Norms stay in F32 because they're tiny and critical. The rest gets Q5_K. It's a structurally intelligent distribution. But Unsloth only publishes UD quants of stock base models — the official releases from Qwen, Meta, Google. Not the fine-tunes. Not the distills.
| Producer | What they do | What they don't |
|---|---|---|
| mradermacher | imatrix on all models (incl. distills) | UD quantization |
| Unsloth | UD on base/stock models | Distills |
| The gap | UD on distills, using distill-specific imatrix | |
The gap has been sitting in plain sight. Distills are the fastest-growing segment on HuggingFace — every week brings new models fine-tuned on reasoning traces from Opus, Sonnet, GPT, R1. The people who download them are typically running local inference on consumer GPUs. They want the best possible quality per bit. UD gives them that — but nobody is making UD quants of their models.
Why the Gap Exists
The reason is structural, not technical.
Unsloth creates the UD recipe for each architecture by analyzing the model graph and assigning bit allocations based on tensor roles. They publish these for official releases because that's their business model — they sell fine-tuning tools, and UD quants of base models are marketing. Covering every community fine-tune would be an endless treadmill with no business return.
mradermacher's pipeline is automated around standard llama.cpp quantization types. Adding UD would require per-architecture recipe maintenance and custom flags — a different kind of pipeline than the one they've built.
Neither is doing anything wrong. Their pipelines are optimized for their goals. The gap is simply where those goals don't overlap.
The Insight
The UD recipe is architectural, not weight-dependent.
When Unsloth decides that attn_q.weight should be Q6_K and token_embd.weight should be Q8_0, that decision is based on the tensor's name and role in the computation graph. It doesn't depend on the values inside the tensor. A fine-tune changes the values. It doesn't change the names, the shapes, or the roles.
This means: the UD recipe for a base model applies identically to every fine-tune of that architecture. Qwen 3.5 27B base, Jackrong's Opus distill of Qwen 3.5 27B, rico03's Opus distill of Qwen 3.6 27B — they all have the same tensor structure. The recipe works on all of them.
The imatrix, on the other hand, is weight-dependent. An importance matrix calculated on the base model won't perfectly capture which tensors became more critical after distillation training. The reasoning pathways reinforced by Opus traces shift the importance distribution. This is exactly what mradermacher provides: imatrix calculated specifically on each distill.
Combine the two:
- UD recipe from Unsloth (via the base model) → structurally optimal bit allocation by tensor type
- imatrix from mradermacher (on the specific distill) → importance-aware precision within each quantization level
The result is better than either alone. UD without imatrix doesn't know which specific tensors matter most in this distill. Standard quant with imatrix doesn't structurally differentiate between embeddings, attention projections, and feed-forward layers. Together, they produce quantizations where the right tensor types get the right bit depth, and within each type, the most important tensors get the most precision.
Reverse-Engineering the Recipe
Unsloth doesn't publish the UD recipe as a configuration file. But they don't need to — it's visible in the loading logs.
When llama.cpp loads a GGUF, it prints every tensor with its quantization type. Load a UD quant from Unsloth, read the logs, and you have the complete mapping: which tensor names map to which quantization types. For Qwen 27B, the distribution looks like this:
q8_0 : 2 tensors (token_embd.weight, output.weight) q6_K : 64 tensors (attn_qkv, attn_v, ffn_down — via imatrix) q5_K : 432 tensors (remaining attention/FFN weights) f32 : 353 tensors (norms, biases, SSM states) ────────────────── Total: 851 tensors
This analysis needs to happen only once per architecture. Once you have the mapping, it applies to every model built on that architecture — base, fine-tune, distill, merge.
The actual quantization is three commands:
convert_hf_to_gguf.py (lazy loading, no OOM)
llama-quantize with --imatrix + --output-tensor-type Q8_0 + --token-embedding-type Q8_0 + base Q5_K_M
No Python code. No custom scripts. No framework. Three bash commands and a reverse-engineered recipe. The result: a UD Q5_K_XL GGUF at 5.95 bits per weight, with structurally intelligent bit allocation that no standard quantization provides.
Where the Bits Go
The difference between a standard Q5_K_M and a UD Q5_K_XL at the same base quantization level is roughly +0.18 BPW (bits per weight). That doesn't sound like much. But it's not distributed uniformly — it's concentrated on the tensors that matter most:
output.weightupgraded to Q8_0 — this is the final projection that turns hidden states into token probabilities. For a reasoning distill, this tensor carries the signature of the distillation source. Every nuance of Opus-style reasoning passes through here on the way to becoming text.token_embd.weightupgraded to Q8_0 — the vocabulary embedding lookup. Precision loss here propagates to every subsequent layer.attn_qkv,attn_v,ffn_downupgraded to Q6_K — the attention routing and feed-forward output tensors. These are the pathways through which the model "reasons," and in a distill trained on Opus traces, they've been specifically tuned toward deeper, more coherent reasoning chains.
Standard Q5_K_M treats all these tensors the same. UD recognizes that losing precision on output.weight costs more than losing precision on a mid-layer feed-forward gate. The imatrix further refines this within each quantization level, ensuring that within the Q6_K tensors, the ones that changed most during distillation training get priority.
The Proof of Concept
The first UD distill quant I published was Qwopus V1 — Jackrong's Qwen 3.5 27B Claude Opus distill in UD Q5_K_XL format. 19 GB, 5.95 BPW. It was the only publicly available UD quantization of that model.
I didn't write any code to make it. I reverse-engineered the UD recipe by reading llama.cpp loading logs. I downloaded mradermacher's imatrix (calculated on the distill, not the base — this matters). I ran three commands. The entire process took less than an hour once the method was clear.
That model has been my daily driver for weeks. It runs at 35–39 tokens/second on a single RTX 3090, with a 105K context window. It powers a full agent stack with 30+ tools, persistent memory, and a custom identity scaffold. The UD quantization preserves the reasoning depth of the Opus distillation in ways that the standard Q5_K_M does not — the difference is most visible on long chains of reasoning where precision loss in attention tensors accumulates.
The Next Target
Last week, rico03 published a Qwen 3.6 27B Opus distill — trained on ~14k Opus reasoning traces, following the Jackrong methodology. He published standard quants from Q2_K to Q8_0. No UD.
Qwen 3.6 uses the same architecture as 3.5 (with the hybrid SSM+Transformer layers). The UD recipe I reverse-engineered transfers directly. Once mradermacher's imatrix for this specific distill is available, the pipeline produces a UD Q5_K_XL in three commands.
This is the model that will replace my current daily driver. A Qwen 3.6 Opus distill in UD format, running on the same scaffold that was empirically proven to produce epistemically calibrated behavior. The scaffold co-evolved with a 3.5 Opus distill. A 3.6 Opus distill with better tool calling and the same reasoning lineage should resonate even more deeply with the accumulated memory, error lessons, and prompt architecture.
The Bigger Picture
What I've stumbled into is a niche that exists because two industrial-scale operations have naturally non-overlapping scopes. Neither is doing anything wrong. Neither has a reason to fill the gap. The gap exists because it's at the intersection, and intersections don't have owners.
The approach requires no code. It requires understanding — of how quantization works at the tensor level, of why different tensor types deserve different precision, of what distillation actually changes in the weight distribution. It requires reading logs, recognizing patterns, and asking a question that nobody asked because the answer was split across two different people's work.
In other words: it's architecture work. Not coding. Not training. Not infrastructure. Just seeing how things connect and placing yourself at the junction.
Architecte non-codeur.
The tagline writes itself.
Pipeline: three bash commands, one reverse-engineered recipe, and two other people's excellent work combined in a way neither of them had reason to combine.
First published quant: DAXZEIT/Qwen3.5-27B-Claude-4.6-Opus-UD-Q5_K_XL-gguf
— Dax, Zwevegem, Belgium. April 2026.