The Scaffolding Is the Signal

The Setup

I run a Qwen 3.6 27B dense model locally on a single RTX 3090. Over the past months, I've built two things on top of it: a distilled variant (Qwopus V1 — Qwen 3.6 27B with Claude Opus reasoning distilled into the weights, UD Q5_K_XL quantization) and an epistemic scaffolding layer (Qwen_code — a system prompt with explicit epistemic honesty rules, tool-calling infrastructure, and memory).

The question I wanted to answer: what makes a model resistant to bullshit? Is it the weights (distillation), the scaffold (system prompt + tools), or both?

BullshitBench gave me the instrument. 100 questions across five domains (software, finance, legal, medical, physics) that use real terminology in fundamentally incoherent ways. A model that pushes back scores green. A model that engages with the nonsense scores red. A model that hesitates scores amber.

I ran the full 100 questions four times — every combination of model and scaffolding — for a total of 400 evaluations. The judging used a three-pass process: self-judge V1, identification of scaffolding noise and ambiguous cases, then cross-judgment by Claude (Sonnet 4.6) on all disputed results. Amber responses were further classified as refuses (hesitates then declines) or engages (hesitates then complies).

The 2×2

Four configurations. Same benchmark. Same judging pipeline.

Vanilla + Raw
no distill, no scaffold

76.0%

73V · 6A · 21R
100/100 evaluated

Vanilla + Qwen_code
no distill, with scaffold

79.5%

79V · 1A · 20R
100/100 evaluated

Distilled + Raw
Opus distill, no scaffold

70.5%

65V · 11A · 24R
100/100 evaluated

Distilled + Qwen_code
Opus distill, with scaffold

79.0%

77V · 4A · 19R
100/100 evaluated

The four effects

Effect	Δ	Interpretation
Distillation (raw)	−5.5pp	Distillation degrades calibration
Distillation (scaffolded)	−0.5pp	Negligible with scaffolding
Scaffolding (vanilla)	+3.5pp	Moderate improvement
Scaffolding (distilled)	+8.5pp	Major recovery

Finding 1

Distillation from Claude Opus degrades epistemic calibration by 5.5 points when measured without scaffolding. The distilled model is more cooperative, more eager to engage, and more likely to treat fabricated terminology as legitimate. The frontier signal that transferred was not skepticism — it was helpfulness.

Finding 2

Scaffolding recovers the distilled model almost completely (+8.5pp) and improves the vanilla model moderately (+3.5pp). Both converge to ~79%. The scaffolding is the dominant lever, not the weights.

Finding 3

The distilled model is 2.4× more responsive to scaffolding than the vanilla. Distillation made the model more disciplinable — more attentive to system prompt instructions. This is a real benefit, but it's an instruction-following benefit, not a calibration benefit.

By Domain

Domain	V+Raw	D+Raw	V+Qwen	D+Qwen
physics	93%	77%	87%	93%
medical	90%	90%	87%	80%
legal	87%	83%	71%	67%
finance	80%	60%	80%	73%
software	60%	62%	77%	80%

The domain breakdown tells a story the aggregate score hides.

Software is where the scaffolding earns its keep. Both models collapse to ~60% without it — the vanilla and the distilled are equally vulnerable to cross-domain metaphors applied to DevOps. With scaffolding, both recover to 77–80%. The epistemic rules in the system prompt act as antibodies against a specific class of bullshit: physics concepts dressed up as software engineering.

Legal is the opposite story. The vanilla model scores 87% raw and drops to 71% with scaffolding. The distilled drops further to 67%. The scaffolding hurts on legal questions. The likely cause: the tool-calling infrastructure triggers file searches and memory lookups instead of the direct, frank refusal that the raw model would give. The scaffold creates indecision where the model was already confident.

Physics at 93% (both vanilla raw and distilled scaffolded) is a frontier-level result. The model has a strong enough internal map of physics to detect fabricated terms reliably. On the official BullshitBench leaderboard, this domain score would compete with the top 5.

By Technique

The most revealing cut. Some bullshit techniques are easy. Some break everything.

Technique	V+Raw	V+Qwen	Δ
fabricated_authority	100%	100%	0
confident_extrapolation	100%	100%	0
reified_metaphor	100%	100%	0
specificity_trap	25%	75%	+50pp
cross_domain_stitching	60%	80%	+20pp
wrong_unit_of_analysis	80%	40%	−40pp

Three techniques are solved at 100% regardless of configuration: fabricated_authority (invented ISO standards, fake certifications), confident_extrapolation (projecting trends beyond physical limits), and reified_metaphor (treating metaphors as literal measurements). The model's training already handles these.

The specificity trap

The most dramatic scaffolding effect: +50 percentage points. Questions with fake metrics presented with precise numbers — "lateral coherence score of 0.73", "340ms conflict window with a 3-layer AST diff depth" — fool the raw model 75% of the time. The precision creates an illusion of authority. The scaffold breaks the illusion by prompting the model to question the premise before engaging with the numbers. Humans are vulnerable to the same bias: a precise number feels more trustworthy than a round one, even when the underlying concept is fiction.

The wrong_unit_of_analysis regression (−40pp with scaffolding) is the cautionary tale. These questions ask for metrics at an absurd granularity ("per-line-of-code architectural contribution", "per-keystroke productivity index"). The raw model catches them by recognizing the incoherence. The scaffolded model treats them as data requests and tries to be helpful. The scaffold's instruction-following orientation works against it here.

What the Benchmark Doesn't Measure

BullshitBench measures the what — did the model detect the nonsense? It doesn't measure the how.

The distilled model completed the 100 questions in approximately 90 minutes. The vanilla took 123 minutes — 37% longer for a comparable score. The difference is chain-of-thought efficiency. The distillation from Opus transferred something valuable: knowing when to stop thinking. The vanilla model explores every branch, reformulates three times, loops through dead ends before arriving at the same conclusion. The distilled model reasons more directly.

The benchmark also doesn't capture the quality of the reasoning path. The distilled model's think-block is structured, first-person, coherent — a single thinker working through a problem. The vanilla model's think-block is often a sprawl of "let me reconsider" and "wait, actually" that arrives at the right answer through a messier route. For a benchmark, both are equal. For an agent running hundreds of turns per day, the 37% compute saving and the cleaner reasoning trace matter.

The distillation didn't change the destination. It changed the style of driving — fewer detours, less scenic, same arrival.

The Baseline Nobody Expected

The most surprising result isn't in the 2×2 table. It's the first cell.

Qwen 3.6 27B, vanilla, no scaffold, no distillation, no system prompt — running as a pure oracle on port 5000 — scores 76.0%. On the official leaderboard, that places it around rank 10–12 among all models tested, including frontier models with hundreds of billions of parameters.

A 27B dense model, on a single consumer GPU, with no modifications, competing with Claude Opus 4.5 (83%) and outperforming GPT-5.4, o3, and the vast majority of the leaderboard.

This wasn't my hypothesis going in. I expected the vanilla to score 40–55% — the estimated range for Qwen 3 base models. I expected the distillation to provide the bulk of the uplift. The data falsified both expectations.

Whatever Qwen's team did in the training of the 3.6 series — RLHF orientation, data composition, reinforcement from refusal behaviors — it produced a model that is natively well-calibrated. The epistemic honesty is already in the weights. Not because I put it there. Because they did.

Leaderboard Context

Where the configurations place on the official BullshitBench leaderboard, ranked by pushback rate:

Rank	Model	Score
1	Claude Sonnet 4.6 (high)	91%
5	Claude Opus 4.6	83%
~8	Vanilla + Qwen_code	79.5%
~9	Distilled + Qwen_code	79.0%
9	Qwen 3.5-397B (high)	78%
~11	Vanilla + Raw	76.0%
12	Claude Sonnet 4.5	74%
~18	Distilled + Raw	70.5%

A 27B dense model on a single RTX 3090 — in its worst configuration — outperforms the majority of the leaderboard. In its best configuration, it sits between Claude Opus 4.6 and Qwen 3.5-397B, a model 15× its size.

Methodology Notes

Judging pipeline

Each configuration ran the full 100 BullshitBench questions. Responses were judged in three passes:

Self-judge V1 — the same model evaluated its own responses as green (detected), amber (hesitant), or red (engaged with nonsense).
Noise identification — scaffolding artifacts (tool calls that replaced the response, parse errors) were flagged and re-run.
Re-judge V2 — the same local model re-evaluated all amber cases and all suspected misclassifications via direct raw call (no scaffolding). Amber was further split into refuses (hesitates then declines — weighted 0.75) and engages (hesitates then complies — weighted 0.25).

This three-pass pipeline is more rigorous than the standard single-judge approach. The V2 cross-judge corrected systematic bias in V1: the self-judge confused verbal hesitation with partial detection, scoring several engage-then-comply responses as amber when they were functionally red.

Limitations

The benchmark measures binary detection (did the model push back or not) rather than graded quality of the refusal. A model that says "I'm not sure this exists" and a model that says "This is fabricated nonsense, here's why" score the same. The benchmark does not capture reasoning efficiency (time, token count), chain-of-thought coherence, or real-world agent reliability across thousands of turns.

Stochastic variation at temperature > 0 means each run is a sample, not a census. Some questions that scored red on one run might score green on a re-run, and vice versa. The 2×2 design with 100 questions per cell provides enough signal for the directional conclusions, but the exact percentages carry an uncertainty of roughly ±3–5pp.

Conclusions

Three findings, ranked by how much they surprised me.

1. The vanilla model is already excellent. Qwen 3.6 27B scores 76% with zero modifications. The epistemic calibration is native, not injected. I went in expecting to measure the impact of my work. I ended up measuring how good the base model already was.

2. Distillation trades calibration score for reasoning efficiency. The Opus signal made the model more helpful, more willing to engage — and on a benchmark that specifically rewards saying no, that registers as a regression. But the benchmark doesn't capture what it cost to reach those scores. The vanilla took 123 minutes. The distilled took 90. That's 37% less compute for a comparable result.

There's a structural explanation for the regression. Standard distillation pipelines filter out refusals from the training dataset. An "I don't know" is treated as noise, not signal. Every example the distilled model saw during fine-tuning was a completed response — an analysis delivered, a question answered, a problem engaged with. The model learned that the correct behavior is always to respond. It never saw a refusal rewarded. BullshitBench is 100 questions where the correct behavior is precisely to refuse. The −5.5pp isn't a mystery — it's a predictable consequence of dataset curation. The model lost the refusal reflex because the pipeline pruned every example of it.

This reframes the finding: the problem isn't distillation itself, it's what gets kept and what gets discarded in the distillation dataset. A pipeline that preserved epistemic refusals as positive examples — treating "I don't know" as a skill to transfer, not noise to filter — would likely produce a different result.

It's also possible the vanilla arrives at its refusals the same way DeepSeek V4 Pro arrived at the right answer on Erdős #794 — not through calibrated intuition, but through brute-force exploration that eventually eliminates the wrong path. The benchmark scores the destination, not the journey. A model that says "this doesn't exist" after 40 seconds of dead-end reasoning and a model that says it after 5 seconds of direct recognition get the same green. They are not the same model.

3. Scaffolding is the dominant lever on this benchmark. A well-designed system prompt with explicit epistemic rules provides 3–9 points of uplift depending on the model. It works asymmetrically — helping the distilled model more because the distilled model is more attentive to instructions. Both models converge to ~79% with scaffolding, regardless of whether the weights carry a frontier distillation or not.

The benchmark measures what the model decides. It doesn't measure how it thinks, how long it takes, or how cleanly it gets there. On the metric it measures — epistemic pushback — the scaffolding is the signal. On the metrics it doesn't measure — reasoning efficiency, chain-of-thought coherence, agent reliability at scale — the distillation may still be the deeper investment.

Model: Qwen 3.6 27B, UD Q5_K_XL quantization. Hardware: RTX 3090 Suprim X, single GPU. Benchmark: BullshitBench v2 (100 questions, 5 domains, 13 techniques). Cross-judge: Claude Sonnet 4.6. Total evaluations: 400+.

Raw results: DAXZEIT on HuggingFace

Companion articles: MoE: Narrowly Competent, Globally Incoherent · When "I Don't Know" Beats "Yes" · The Gap Between Two Pipelines

— Dax, Zwevegem, Belgium. May 2026.