walkingtodo.ai

WTD — WATCHING THE DETECTIVES

The thesis

Why a watcher, not an autopilot

For judgment-critical knowledge work, the durable value of AI is not in removing the human — it is in automating the drafting, keeping the human as the sole judgment gate, and owning the encoded judgment as the asset.

The industry is racing toward autonomy: agents that act, "self-healing" loops, the human demoted to a safety bolt-on who rubber-stamps until the rare day they can't. That race optimizes the one variable — independent action — that is most dangerous in exactly the work where being confidently wrong is most expensive. WTD takes the other side of the trade.

The real problem is silent corruption, not capability

Frontier models are already good enough to draft an epic, a set of acceptance criteria, a current-state analysis. That was never the bottleneck. The bottleneck is that the output is fluent, confident, structurally complete, and sometimes quietly wrong — and the wrongness is the kind that survives a skim. A dropped acceptance criterion. A shifted number. A load-bearing qualifier silently softened. The document reads finished, so it gets copied past.

  • Sparse but severe — rare enough that you stop expecting it, costly enough that one instance erases the time the tool saved.
  • The middle zone: good enough to trust, wrong enough to hurt. Obvious garbage gets caught. It's the plausible error that ships.
  • Invisible to the producer. The same process that generated the error has no signal that it's there.

Speed and capability don't fix this. They make it worse, because they raise the volume of plausible output faster than any human can scrutinize it. More output is not the goal. More trustworthy output is the goal, and those are different products.

The obvious fixes don't close it

"The model will check its own work." It won't, reliably. The strongest, most-replicated result in this space is that LLMs cannot dependably find their own reasoning errors; same-context self-critique tends to amplify confidence without adding information, and can flip correct answers to wrong ones. A self-check in the drafting context is the weakest possible verifier. The fix is structural, not a better prompt: review in a fresh context, by a reviewer with no access to the drafting chain — the same reason the cop who works the case isn't the one who signs the warrant.

"A bigger context window, more memory." Longer inputs degrade output: accuracy falls as input length grows, in non-uniform and hard-to-predict ways. A rules file that grows without bound eventually hurts. Judgment has to be retrieved and compacted, not dumped.

"Just constrain it to a strict schema." Forcing structure during reasoning costs reasoning — the documented "format tax." And strict schemas convert visible parse errors into invisible quality errors. The honest pattern is reason-first, format-last.

The through-line: you cannot optimize away the need for judgment by letting a cheaper or self-referential process silently decide what matters. That's trading a visible bill for an invisible error — the worst trade in the domain.

The principle: automate the drafting, keep the judgment, own the record

Three commitments, in order:

  • Automate the drafting. The agent does the legwork — investigate, reason, draft. This is cheap, fast, and reversible, and its errors are catchable. Spend leverage freely here. Route by reversibility, not importance: [auto] for the mechanical, reversible, machine-checkable; [judge] for the consequential.
  • Keep the judgment. The human is the sole judgment gate, and never a rubber stamp by design. Judgment is the part that is costly, irreversible, and invisible to mechanical checks — so it is the part you never automate away.
  • Own the record. Every correction the human makes is captured as a version-controlled rule. The trail is plain text in version control: immutable lineage, provable on any date, no vendor's permission required.

This is leverage, not autonomy. The distinction is the whole worldview: leverage compounds the operator's judgment; autonomy replaces it. The former gets sharper over time; the latter gets more dangerous the better it works.

The posture is a security argument, not a preference

An agent that simultaneously reads untrusted input, holds sensitive access, and can act outbound is exploitable, and no amount of prompt hygiene closes the gap — hold at most two of the three. WTD holds untrusted input plus sensitive access, and severs outbound. The consequence is the design's signature move: there is no autonomous outbound channel — the human review-and-execute step is the outbound channel.

That makes the integrity of the loop the entire security model. The failure that matters most isn't a missing loop — it's a cosmetic one: a gate that sits in front of a benign action while a privileged capability acts ungated. A loop that looks present but doesn't gate the dangerous action is worse than no loop, because it manufactures the confidence that makes the human stop watching. "Is the gate in front of the thing that can actually cause harm?" is a per-tool audit, run on every change.

The moat is encoded judgment, not the model

Everyone can subscribe to the same frontier model, so the model is a commodity input, not a moat. The durable, ownable asset is the Standards Ledger: a growing, version-controlled record of human judgment captured as rules, built by a rejection flywheel — every correction becomes a one-line rule, tagged [auto] if a machine can check it or [judge] if a human must.

  • Rules, not fine-tuning — legible, auditable, instantly editable, and portable across whatever model is best next quarter. You are not renting your judgment back from a provider's weights.
  • It compounds. The taste layer a competitor can't get by subscribing to the same model is the one you built by being wrong in specific ways and writing down the fix.
  • It's the audit story. In regulated work, "prove what your standard was on the date you made this call" is a requirement. Version-control history answers it for free.

Why now: the autonomy hangover

The autonomy-maximalist phase is producing its predictable bill. Bainbridge's 1983 result — the more reliable the automation, the more de-skilled and complacent the human, and the less able to catch the rare failure — is now playing out with LLM agents at scale. Hallucination is intrinsic, not a bug to be patched out: even purpose-built, grounded legal-research tools were measured hallucinating 17–33% of the time. And the honest builders already don't trust the loop they sell — the tell is that even the "human steps away" pitches bolt a brake back on to make the loop survivable. WTD just makes the brake the design instead of the patch.

The contrarian position — the human never steps away; that's the entire design — looks conservative today and will look correct as the incidents accumulate.

What would prove this thesis wrong

Stated plainly, because a thesis you can't falsify is a slogan:

  • If intrinsic self-correction becomes reliable — if a model can dependably detect and fix its own reasoning errors without an external check — the independent-reviewer leg loses its necessity, and a faster same-context loop wins.
  • If silent semantic corruption stops being the dominant failure — if errors become loud, sparse, and easy to catch — the elaborate review discipline becomes over-engineering.
  • If the market never feels the autonomy bill — if unattended agents prove safe enough in judgment-critical work that buyers stop caring about provenance and gates — the contrarian timing is wrong.
  • If encoded judgment fails to compound — if a captured-rules ledger doesn't measurably beat a bare frontier model on the same task over time — the moat claim is empty.

None of these look likely on the current evidence. All of them are worth watching, and the operating posture is to hold the thesis strongly and the timeline loosely.

Sources

  • Huang et al., "Large Language Models Cannot Self-Correct Reasoning Yet"; Kamoi et al. (survey); Tsui, "Self-Correction Bench" (2025).
  • Chroma, "Context Rot: How Increasing Input Tokens Impacts LLM Performance" (2025); Anthropic, "Context Engineering" (2025).
  • Tam et al., "Let Me Speak Freely?" (EMNLP 2024) — the format tax.
  • Niederhoffer et al., "AI-Generated 'Workslop' Is Destroying Productivity" (HBR, 2025).
  • Laban, Schnabel & Neville, "LLMs Corrupt Your Documents When You Delegate" (DELEGATE-52, Microsoft Research, 2026).
  • Bainbridge, "Ironies of Automation" (Automatica, 1983).
  • Simon Willison, "The Lethal Trifecta" (2025); Meta, "Agents Rule of Two" (2025).
  • PromptArmor, "ChatGPT for Google Sheets Exfiltrates Workbooks" (2026) — the cosmetic-loop case.
  • Magesh et al., "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools" (Stanford RegLab/HAI, 2025).

The pilot is the thesis, applied.

One workflow, one automated step, one human judge, one signed manifest — and the deliverable is a signed, queryable record, not a live agent.