Decorrelation, the leaky human, and the danger you can't name

The only question

Every argument about AI output quality asks the wrong thing — fast enough, good enough, did the threat-prompt squeeze out another twenty-four percent. The real question is harder: what is validating this? A system that writes its own work and then grades its own work has no validation in it — just a confident number on the far end. The evidence is direct: LLM judges recognize and favor their own outputs, and models cannot reliably self-correct their own reasoning. Real validation is defense in depth — layers, each catching what the last let through — and it only works if the layers fail differently.

Decorrelation, not independence

That last clause is the whole game, and it is the most robust result in reliability engineering: redundancy without diversity fails. Stack identical layers and one common cause defeats all of them at once — nuclear and aviation safety call this common-mode failure, and it is why regulators mandate diverse protection, not merely duplicated protection.

But this is the correction the first draft needed. Genuine independence is far harder to buy than “use a different model.” When Knight and Leveson had 27 teams independently write the same program, the versions failed in correlated ways — the independence assumption was rejected. The same is now measured in language models: their errors correlate well above chance, and — the uncomfortable part — that correlation rises with capability, even across different providers and architectures. Shared lineage (a common base model, shared training data, distillation) guarantees shared blind spots.

So independence is not a switch you flip; it is a degree you measure and audit. Rename the variable decorrelation, and rank your options worst to best: same model or a fine-tuned cousin (treat as zero offset); a different model family (partial, shrinking offset — measure it); a different kind of reasoner, a human from first principles (the best available, still imperfect). A relabel is not a layer. That is independence theater.

Decorrelation, illustrated. The two model slabs are different cheeses, yet they share the one gap on the flaw's path — the hazard neither was built to conceive. Independent generation offsets the known holes; it cannot offset the unconceived one. That is why the human, a different kind of reasoner, is the last slice.

What the evidence says

Even different models agree on the wrong answer about 60% of the time when both err; correlation is higher within a provider or architecture and rises with capability, even across providers. (Kim et al. 2025, arXiv:2506.07962)
A model judging its own output skews high — it favors familiar, low-perplexity text regardless of who wrote it (Wataoka et al. 2025); measured self-preference spans roughly −38% to +90% across judges on Arena-Hard (Zhang et al., UDA 2025).
Checker quality is bounded by checker capability — even frontier models fail a large share of objectively-decidable judging cases, so a weaker checker is the least able to catch a stronger model's subtle errors. (Tan et al., JudgeBench 2025)
Judges favor their own model family regardless of quality (Spiliopoulou et al. 2025); combined with errors that correlate most among capable models (Kim et al. 2025), the weak-checks-strong pairing is the worst case and strong-checks-weak the most forgiving.
The reliable gain is aggregating diverse judges, not trusting one: a panel of different models tracks human judgment more closely than any single judge and dampens each model's individual bias. (Verga et al. 2024, arXiv:2404.18796)

A different model offsets some holes — but a majority stay shared, and the offset shrinks as models get more capable. Real decorrelation comes from a different kind of reasoner, and from combining diverse checkers — not from swapping model badges.

Two variables, not five

An earlier version of this taxonomy — A, U, D(K), D(R), D(U) — tangled two different things and only split the top of one of them, which is exactly what made it confusing. Pull them apart. Every model action raises two orthogonal questions:

Severity — the blast radius. How bad is it if this ships? Acceptable, Unacceptable, Dangerous. Knowability — can you write the rule? Known (specifiable), Recognized (a suspected risk class with no specifics — the known unknown), Unknown (unconceived — the unknown unknown).

Display them together as one grid. Knowability decides which layer can catch it (known → encoded rules; recognized → a decorrelated second reasoner; unknown → only general human judgment, and imperfectly). Severity decides how non-negotiable the check is. The single corner where both max out — Dangerous × Unknown — is the one place you stop checking and start accepting.

The control each cell demands. Columns pick the layer; rows set the mandate. The tinted corner is where no layer suffices and someone must own the residual risk.
severity ↓ knowability →	Knownwrite the rule	Recognizedknown unknown	Unknownunknown unknown
Acceptable	ship	ship · spot-check	ship · log & monitor
Unacceptable	rule / guardrail	decorrelated model	human review
Dangerous	rule, mandatory	decorrelated model + human	full stack + ownerdanger zone

Automated layers are reliable only leftward (Known); the human is the only candidate in the right column, and a weak one. Severity rises downward, so the bottom-right cell carries the most mandate and the least catchability — hence an owner, not just a checker.

The human is necessary, but leaky

The case that automation misses the unknown is solid: guardrails and classifiers are point-in-time defenses against known attack classes — they fail on simple obfuscation and on anything not yet conceived. So a general-judgment human is necessary in the unknown column, because nothing else has a shot. But the first draft over-credited the human. The human-factors record is unforgiving: automation complacency afflicts experts and resists training; vigilance measurably decays within thirty minutes on task; automating the easy parts leaves the human the hardest parts with the least practice (Bainbridge’s “ironies of automation”); and a thin human layer becomes a moral crumple zone, absorbing blame for failures it could not control. The medical independent double-check — the closest analog to “add a second checker” — has notably weak evidence behind it, partly because the second check is rarely truly independent. The deliberate pause helps against rubber-stamping, but it is not enough. The human layer must be engineered — bounded workload, rotation, skill maintenance, decision aids that surface disagreement — not merely inserted.

The fourth layer: the human’s mirror

Behind the human, a strict-rule model catches the human’s lapses — fatigue, volume, the rubber-stamp — against documented guidelines. It is the human’s mirror: tireless where the human drifts, openly blind where the human is essential. This is underwriting QA — a checklist review of the reviewer’s work before it goes out the door. Its caveats are not advice, they are requirements. It flags, it never overrides — the moment a rule-model can veto a human’s judgment on a recognized or unknown hazard, the novel-hazard decision has gone to the layer least able to make it. Its approval never reads as “cleared,” because it cannot see the unconceived. And mind automation bias: a visible net makes the human upstream sloppier, eroding the layer it is meant to back.

The recommendation — and where the regress stops

Tier validation to the blast radius. Ship the Acceptable row with light or no check; over-checking the cheap stuff manufactures alarm fatigue and rubber-stamping. Route the Recognized column through a genuinely decorrelated second model. Make decorrelated generation plus an engineered human a mandatory, non-skippable floor for the Dangerous-Recognized and Dangerous-Unknown cells. The expensive, non-outsourceable work is the triage — putting each action in the right cell — and it cannot be done by the layer being checked.

The layer that ends the stack is not a checker. You cannot validate your way to certainty: add layers forever and each buys less while quietly manufacturing false security — when everyone is in charge, no one is. The regress terminates in a named human who signs “this is the residual risk we accept” — owned, documented, senior. That is the real top of the stack: not someone who checks, someone who accepts.

The recommended stack, end to end. Two decorrelated AI layers (amber, copper) catch the known and recognized; the slow human catches most of the unknown. The rule model (indigo — a different kind) catches the human's lapses but is blind to the unconceived, so the residual passes straight through its holes. The final block has no holes and bears a signature, not a checkmark — it doesn't filter, it accepts.

The Rumsfenneahedron

The grid in section 03 has nine cells — three severities by three knowabilities. Fold it into a solid and you get a nine-faced object. Call it the Rumsfenneahedron,^† or just the Fenny. Eight of its faces can be turned face-down and handled — each takes a control that fits it. The ninth will not lie flat. However you set the solid on the table, the Dangerous × Unknown face rights itself back up. That is the entire argument in one object: you cannot put the unknown away, so it does not get a checklist. It gets an owner.

Acceptable × Known — ship; light sampling.
Acceptable × Recognized — ship; spot-check.
Acceptable × Unknown — ship; log and watch for drift.
Unacceptable × Known — an encoded rule or guardrail.
Unacceptable × Recognized — a decorrelated second reasoner.
Unacceptable × Unknown — human review.
Dangerous × Known — a mandatory rule, no skip.
Dangerous × Recognized — decorrelated reasoner, plus a human, plus a standing red-team.
Dangerous × Unknown — the face that won’t lie flat: a named owner of residual risk, with graceful degradation as the floor.

^† Named for the 2002 taxonomy of known knowns, known unknowns, and unknown unknowns — the epistemics, not the politics — and for a nine-celled grid that refuses to lie flat. An enneahedron has nine faces; the Fenny has nine cells. Spoken: the Fenny.

Finally, add what the framework was missing. Near-miss and blameless reporting — the loop that converts unknown unknowns into recognized ones over time. Structured, adversarial red-teaming — the discovery engine for the right-hand columns. And sampling with statistical process control — so “light check” on the cheap cells is statistically justified rather than a vibe, with drift detection.

You don't validate your way to certainty. You decorrelate what you can — and you name who accepts what's left.

References

Kim, E., Garg, A., Peng, K., & Garg, N. (2025). Correlated Errors in Large Language Models. ICML 2025. arXiv:2506.07962
Wataoka, K., Takahashi, T., & Ri, R. (2025). Self-Preference Bias in LLM-as-a-Judge. arXiv:2410.21819
Zhang, Y., Wang, C., Wu, L., Yu, W., Wang, Y., Bao, G., & Tang, J. (2025). UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge. arXiv:2508.09724
Tan, S., Zhuang, S., Montgomery, K., et al. (2025). JudgeBench: A Benchmark for Evaluating LLM-based Judges. arXiv:2410.12784
Spiliopoulou, E., et al. (2025). Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge. arXiv:2508.06709
Verga, P., et al. (2024). Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv:2404.18796
Knight, J. C., & Leveson, N. G. (1986). An Experimental Evaluation of the Assumption of Independence in Multiversion Programming. IEEE Transactions on Software Engineering, SE-12(1), 96–109.
Reason, J. (2000). Human Error: Models and Management. BMJ, 320(7237), 768–770. doi:10.1136/bmj.320.7237.768
Bainbridge, L. (1983). Ironies of Automation. Automatica, 19(6), 775–779.
Parasuraman, R., & Manzey, D. H. (2010). Complacency and Bias in Human Use of Automation: An Attentional Integration. Human Factors, 52(3), 381–410. doi:10.1177/0018720810376055
Mackworth, N. H. (1948). The Breakdown of Vigilance During Prolonged Visual Search. Quarterly Journal of Experimental Psychology, 1(1), 6–21.
Elish, M. C. (2019). Moral Crumple Zones: Cautionary Tales in Human-Robot Interaction. Engaging Science, Technology, and Society, 5, 40–60. doi:10.17351/ests2019.260
Koyama, A. K., Maddox, C-S. S., Li, L., Bucknall, T., & Westbrook, J. I. (2020). Effectiveness of Double Checking to Reduce Medication Administration Errors: A Systematic Review. BMJ Quality & Safety, 29(7), 595–603. doi:10.1136/bmjqs-2019-009552