walkingtodo.ai

WTD — WATCHING THE DETECTIVES

Validating AI output · v2

Decorrelation, the leaky human, and the danger you can't name

Validating AI output is not a question of speed. It is a question of whether your layers fail differently — and of naming who accepts the risk no layer can catch.

The only question

Every argument about AI output quality asks the wrong thing — fast enough, good enough, did the threat-prompt squeeze out another twenty-four percent. The real question is harder: what is validating this? A system that writes its own work and then grades its own work has no validation in it — just a confident number on the far end. The evidence is direct: LLM judges recognize and favor their own outputs, and models cannot reliably self-correct their own reasoning. Real validation is defense in depth — layers, each catching what the last let through — and it only works if the layers fail differently.

Decorrelation, not independence

That last clause is the whole game, and it is the most robust result in reliability engineering: redundancy without diversity fails. Stack identical layers and one common cause defeats all of them at once — nuclear and aviation safety call this common-mode failure, and it is why regulators mandate diverse protection, not merely duplicated protection.

But this is the correction the first draft needed. Genuine independence is far harder to buy than “use a different model.” When Knight and Leveson had 27 teams independently write the same program, the versions failed in correlated ways — the independence assumption was rejected. The same is now measured in language models: their errors correlate well above chance, and — the uncomfortable part — that correlation rises with capability, even across different providers and architectures. Shared lineage (a common base model, shared training data, distillation) guarantees shared blind spots.

So independence is not a switch you flip; it is a degree you measure and audit. Rename the variable decorrelation, and rank your options worst to best: same model or a fine-tuned cousin (treat as zero offset); a different model family (partial, shrinking offset — measure it); a different kind of reasoner, a human from first principles (the best available, still imperfect). A relabel is not a layer. That is independence theater.

Three second-checkers, least to most decorrelated. Who catches the miss? Same model Different model Human in the loop Holes align — ships Holes offset — caught Small holes, slow — caught The more differently a layer fails, the more it catches.
Decorrelation, illustrated. The two model slabs are different cheeses, yet they share the one gap on the flaw's path — the hazard neither was built to conceive. Independent generation offsets the known holes; it cannot offset the unconceived one. That is why the human, a different kind of reasoner, is the last slice.

Two variables, not five

An earlier version of this taxonomy — A, U, D(K), D(R), D(U) — tangled two different things and only split the top of one of them, which is exactly what made it confusing. Pull them apart. Every model action raises two orthogonal questions:

Severity — the blast radius. How bad is it if this ships? Acceptable, Unacceptable, Dangerous. Knowability — can you write the rule? Known (specifiable), Recognized (a suspected risk class with no specifics — the known unknown), Unknown (unconceived — the unknown unknown).

Display them together as one grid. Knowability decides which layer can catch it (known → encoded rules; recognized → a decorrelated second reasoner; unknown → only general human judgment, and imperfectly). Severity decides how non-negotiable the check is. The single corner where both max out — Dangerous × Unknown — is the one place you stop checking and start accepting.

The control each cell demands. Columns pick the layer; rows set the mandate. The tinted corner is where no layer suffices and someone must own the residual risk.
severity ↓
knowability →
Knownwrite the rule Recognizedknown unknown Unknownunknown unknown
Acceptableshipship · spot-checkship · log & monitor
Unacceptablerule / guardraildecorrelated modelhuman review
Dangerousrule, mandatorydecorrelated model + humanfull stack + ownerdanger zone

Automated layers are reliable only leftward (Known); the human is the only candidate in the right column, and a weak one. Severity rises downward, so the bottom-right cell carries the most mandate and the least catchability — hence an owner, not just a checker.

The human is necessary, but leaky

The case that automation misses the unknown is solid: guardrails and classifiers are point-in-time defenses against known attack classes — they fail on simple obfuscation and on anything not yet conceived. So a general-judgment human is necessary in the unknown column, because nothing else has a shot. But the first draft over-credited the human. The human-factors record is unforgiving: automation complacency afflicts experts and resists training; vigilance measurably decays within thirty minutes on task; automating the easy parts leaves the human the hardest parts with the least practice (Bainbridge’s “ironies of automation”); and a thin human layer becomes a moral crumple zone, absorbing blame for failures it could not control. The medical independent double-check — the closest analog to “add a second checker” — has notably weak evidence behind it, partly because the second check is rarely truly independent. The deliberate pause helps against rubber-stamping, but it is not enough. The human layer must be engineered — bounded workload, rotation, skill maintenance, decision aids that surface disagreement — not merely inserted.

The fourth layer: the human’s mirror

Behind the human, a strict-rule model catches the human’s lapses — fatigue, volume, the rubber-stamp — against documented guidelines. It is the human’s mirror: tireless where the human drifts, openly blind where the human is essential. This is underwriting QA — a checklist review of the reviewer’s work before it goes out the door. Its caveats are not advice, they are requirements. It flags, it never overrides — the moment a rule-model can veto a human’s judgment on a recognized or unknown hazard, the novel-hazard decision has gone to the layer least able to make it. Its approval never reads as “cleared,” because it cannot see the unconceived. And mind automation bias: a visible net makes the human upstream sloppier, eroding the layer it is meant to back.

The recommendation — and where the regress stops

Tier validation to the blast radius. Ship the Acceptable row with light or no check; over-checking the cheap stuff manufactures alarm fatigue and rubber-stamping. Route the Recognized column through a genuinely decorrelated second model. Make decorrelated generation plus an engineered human a mandatory, non-skippable floor for the Dangerous-Recognized and Dangerous-Unknown cells. The expensive, non-outsourceable work is the triage — putting each action in the right cell — and it cannot be done by the layer being checked.

The layer that ends the stack is not a checker. You cannot validate your way to certainty: add layers forever and each buys less while quietly manufacturing false security — when everyone is in charge, no one is. The regress terminates in a named human who signs “this is the residual risk we accept” — owned, documented, senior. That is the real top of the stack: not someone who checks, someone who accepts.

The recommended stack, end to end. Catch what you can, own what's left. Model A Model B Human Rule QA Approval unknown residual generates decorrelated check catches unknown catches lapses accepts residual The last layer doesn't catch — it accepts.
The recommended stack, end to end. Two decorrelated AI layers (amber, copper) catch the known and recognized; the slow human catches most of the unknown. The rule model (indigo — a different kind) catches the human's lapses but is blind to the unconceived, so the residual passes straight through its holes. The final block has no holes and bears a signature, not a checkmark — it doesn't filter, it accepts.

The Rumsfenneahedron

The grid in section 03 has nine cells — three severities by three knowabilities. Fold it into a solid and you get a nine-faced object. Call it the Rumsfenneahedron, or just the Fenny. Eight of its faces can be turned face-down and handled — each takes a control that fits it. The ninth will not lie flat. However you set the solid on the table, the Dangerous × Unknown face rights itself back up. That is the entire argument in one object: you cannot put the unknown away, so it does not get a checklist. It gets an owner.

  • Acceptable × Known — ship; light sampling.
  • Acceptable × Recognized — ship; spot-check.
  • Acceptable × Unknown — ship; log and watch for drift.
  • Unacceptable × Known — an encoded rule or guardrail.
  • Unacceptable × Recognized — a decorrelated second reasoner.
  • Unacceptable × Unknown — human review.
  • Dangerous × Known — a mandatory rule, no skip.
  • Dangerous × Recognized — decorrelated reasoner, plus a human, plus a standing red-team.
  • Dangerous × Unknown — the face that won’t lie flat: a named owner of residual risk, with graceful degradation as the floor.

Named for the 2002 taxonomy of known knowns, known unknowns, and unknown unknowns — the epistemics, not the politics — and for a nine-celled grid that refuses to lie flat. An enneahedron has nine faces; the Fenny has nine cells. Spoken: the Fenny.

Finally, add what the framework was missing. Near-miss and blameless reporting — the loop that converts unknown unknowns into recognized ones over time. Structured, adversarial red-teaming — the discovery engine for the right-hand columns. And sampling with statistical process control — so “light check” on the cheap cells is statistically justified rather than a vibe, with drift detection.

You don't validate your way to certainty. You decorrelate what you can — and you name who accepts what's left.

References

  1. Kim, E., Garg, A., Peng, K., & Garg, N. (2025). Correlated Errors in Large Language Models. ICML 2025. arXiv:2506.07962
  2. Wataoka, K., Takahashi, T., & Ri, R. (2025). Self-Preference Bias in LLM-as-a-Judge. arXiv:2410.21819
  3. Zhang, Y., Wang, C., Wu, L., Yu, W., Wang, Y., Bao, G., & Tang, J. (2025). UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge. arXiv:2508.09724
  4. Tan, S., Zhuang, S., Montgomery, K., et al. (2025). JudgeBench: A Benchmark for Evaluating LLM-based Judges. arXiv:2410.12784
  5. Spiliopoulou, E., et al. (2025). Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge. arXiv:2508.06709
  6. Verga, P., et al. (2024). Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv:2404.18796
  7. Knight, J. C., & Leveson, N. G. (1986). An Experimental Evaluation of the Assumption of Independence in Multiversion Programming. IEEE Transactions on Software Engineering, SE-12(1), 96–109.
  8. Reason, J. (2000). Human Error: Models and Management. BMJ, 320(7237), 768–770. doi:10.1136/bmj.320.7237.768
  9. Bainbridge, L. (1983). Ironies of Automation. Automatica, 19(6), 775–779.
  10. Parasuraman, R., & Manzey, D. H. (2010). Complacency and Bias in Human Use of Automation: An Attentional Integration. Human Factors, 52(3), 381–410. doi:10.1177/0018720810376055
  11. Mackworth, N. H. (1948). The Breakdown of Vigilance During Prolonged Visual Search. Quarterly Journal of Experimental Psychology, 1(1), 6–21.
  12. Elish, M. C. (2019). Moral Crumple Zones: Cautionary Tales in Human-Robot Interaction. Engaging Science, Technology, and Society, 5, 40–60. doi:10.17351/ests2019.260
  13. Koyama, A. K., Maddox, C-S. S., Li, L., Bucknall, T., & Westbrook, J. I. (2020). Effectiveness of Double Checking to Reduce Medication Administration Errors: A Systematic Review. BMJ Quality & Safety, 29(7), 595–603. doi:10.1136/bmjqs-2019-009552