Skip to main content
sourceSource: corpus/docs/adrs/0033-golden-corpus.mdModified: 2026-06-23

type: adr id: 0033-golden-corpus status: accepted created: 2026-06-02 updated: 2026-06-02 supersedes: superseded_by:

ADR-0033: The golden corpus

Context

The conformance contract (§32, ADR 0026) is inert versioned data: the precise definition a future checker would honour. But a contract alone cannot demonstrate conformance — it states the rules, not the verdicts those rules produce on real specs. Without worked, verdict-pinned examples there is nothing to validate a checker (or a human reviewer) against, and nothing that proves the obligation language and its passes behave as the contract claims. Worse, a suite of only valid specimens cannot catch the failure mode corpus exists to prevent: a structurally well-formed artifact that is nonetheless wrong (Invariant 4 — schema-valid ≠ verified). Established compiler-conformance practice resolves this by shipping a suite of both allowed and disallowed productions whose conformity is known without the tool under test (§33.1). The pressure: corpus needs that oracle, but Invariant 1 (NO RUNTIME) forecloses shipping a checker to produce it.

Decision

Conformance is evidenced by a golden corpus of positive (must-compile) and negative (must-be-rejected) fixtures spanning the three recurring domains — auth-refresh, checkout, payment-5xx — with each positive domain fixture shipping the complete spec → obligations → task → trace → verdict → promotion pipeline chain, and each domain carrying its canonical defect class encoded with SOL-<LAYER>NNN codes. The corpus is inert data — the oracle, not a running checker: each fixture's expected verdict is pinned in its metadata header and is known independent of any tool. A conformant tool is checked against the corpus; until a launcher exists, a human validates a repository against it by hand. The full specification — pipeline chain, per-domain defect classes, task-file negative classes, the labeled prose precision/recall baseline, the pass-output rubrics, and the contamination-hygiene held-out/mutated variants — is detailed in the conformance reference (). The corpus ships under starter-kit/.agents/conformance/fixtures/; the three pipeline-complete walkthroughs also ship under docs/examples/ for human readers.

Alternatives considered

AlternativeWhy rejected
Ship the conformance contract (§32) alone, no fixturesA contract states rules but pins no verdicts; nothing validates a checker or a reviewer, and the contract's own claims go undemonstrated (§33.1).
Positive (must-compile) fixtures onlyCannot catch the core failure mode — a schema-valid artifact that is wrong (Invariant 4). Compiler-conformance practice requires disallowed productions whose rejection is known (§33.1).
Ship a checker to generate verdictsViolates Invariant 1 (NO RUNTIME). corpus ships the contract and its oracle, never the checker (§32.7); the corpus pins verdicts as data instead.
Canonical fixtures only, no held-out/mutated variantsPublic fixtures invite contamination: an agent-as-compiler reproduces the labels without performing the passes. The corpus MUST ship a semantically-equivalent mutated twin as the conformance gate (§33.7.1).
Grade passes with a Likert/quality scoreQuality scores are not decidable against the artifact alone. The pass-output rubrics are checkable boolean predicates — a single failing predicate fails the pass (§33.6).

Consequences

Positive

  • Gives corpus a tool-independent oracle: expected verdicts are pinned as data, so a future checker has a regression suite and a human has something concrete to validate against.
  • The negative fixtures defend Invariant 4 directly — every error-code family gets a guarding fixture, and the canonical "tests passed" hole is a first-class FAIL fixture (§33.4).
  • The held-out mutated variants make label-memorization detectable, so a passing verdict evidences a correctly executed pass rather than a recognized string (§33.7.1).

Negative

  • The corpus is a second representation of the language's rules alongside the §32 contract; the two must stay consistent (the fixtures themselves are the guard — a contract change that breaks a fixture is caught).
  • Curating positive and negative and mutated variants across three full pipeline chains is substantial authoring cost, and the prose precision/recall baseline demands inter-annotator-agreement discipline (§33.5).
  • The corpus is inert until a checker or eval harness exists; in the meantime it is validated by hand (NO RUNTIME).

Neutral / tradeoffs

  • The §33.5 precision/recall figures (≥0.90 / ≥0.85) are stated as v0.1 design targets for the curated gold set — chosen acceptance bars, not measurements — and the spec records the lower field ceiling for honest calibration (§0.7, §33.5).
  • The pass-output rubrics grade compiler behaviour (obligation/binding/scope/verdict preservation), not grammar; grammar is already covered by the SOL-S family and §33.4 (§33.6).

Status

Accepted (v0.1).

Affected obligations / constraints

  • Adds: the golden-corpus obligation — a conformant repository MUST ship positive + negative fixtures across the three domains, each positive fixture carrying the full pipeline chain with verdicts pinned as data (§33.1–§33.3).
  • Adds: the held-out mutated-variant obligation — each canonical domain fixture MUST ship at least one semantically-equivalent regenerated twin as the conformance gate (§33.7.1).
  • Modifies: the conformance contract of ADR 0026 / §32 — the contract is now evidenced by a shipped fixture oracle, not by prose alone.

Ledger note (2026-06-11): refined by ADR-0065; corpus framing superseded by ADR-0066.

Ready to run the loop on your own repo? Get started — copy the kit and write your first spec.