Skip to main content

ADR 0008: Empirical proof as a framework primitive

Status

Accepted

Context

Models complete the "successful task" pattern with confident summaries. Without structural proof gates, hallucinated greens pass review.

Decision

Every code-changing task archetype mandates verbatim command output (or bounded equivalent) in ## Self-review / verification sections. Paraphrase is invalid — a passed test with no pasted output is not a proof [REFLEXION]. Skeptic reviewers re-run checks locally.

Encoded in empirical-proof skill plus task templates (/scaffold/.agents/templates/ and each skill's references/task-template.md).

This is a specification-and-conspicuousness primitive, not an enforcement one. Pasted output is agent-self-attested — gameable by fabricated or stale paste — so its power is that omission is visible, not that it is prevented. Mechanical enforcement (re-run the bound commands in a clean checkout, block promotion on failure or empty paste) is the contract a compliant runtime honours, specified by 0023. The Skeptic re-run (0021, rule 5) is the in-framework approximation; a harness is the real thing.

Consequences

  • Positive: makes a whole class of false completions conspicuous — the empty paste block is visible.
  • Negative: self-attested — absent a re-running harness (0023), the gate is the agent grading itself; conspicuousness reduces but does not eliminate fabricated or stale paste.
  • Negative: slower closes; brittle in broken dev envs → must surface explicit ## Blockers.
  • Open: the discipline's effectiveness at reducing hallucinated completion is not yet measured against a corpus-specific benchmark; it rests on the activation/Reflexion evidence.

Ready to run the loop on your own repo? Get started — copy the kit and write your first spec.