Skip to content

Contract tests for reviewer prompts (A3) #7

@azevedo

Description

@azevedo

Status

Declined for now. Detailed design captured for future revisit.

Context

ce-code-review has fixture-based regression tests for its reviewer prompts — synthetic diffs paired with expected-finding signatures, run periodically to catch prompt drift. The premise: prompt drift is silent. You "tighten the wording" of a reviewer and unbeknownst to you it now misses cases it used to catch.

This plugin has seven built-in reviewer prompts (agents/review/*.md), each of which is the only spec of what that reviewer is supposed to do. Edits happen in normal development with zero automated regression protection. The plugin is also published (.claude-plugin/plugin.json is versioned per CLAUDE.md) so silent regressions ship to users.

Decision

Declined for now.

Rationale

  1. Cost is high and concentrated up front. A credible 21+-fixture corpus (7 reviewers × 3 fixtures minimum) plus a runner plus assertion-design plus CI wiring plus documentation is a multi-day project before any benefit accrues.
  2. Benefit scales with edit velocity, which is currently low. Reviewer prompts are not under heavy concurrent edit pressure. The most likely consumer of regression-protection (multiple contributors editing prompts) isn't present.
  3. The hard part is assertion strategy, not the test runner. And that strategy is substantially easier with C2 (structured output) in place. Adopting A3 before C2 means doing the assertion design twice — first as prose matching, then re-doing it as structured-field assertions.
  4. LLM output non-determinism makes the value/cost ratio worse than for deterministic code. Contract tests for prompts are inherently harder to keep useful than tests for unit code.

Design captured (if revisited)

Decision Choice
Hard dependency C2 (validator pass / structured output) must land first — required for the assertion strategy
Soft dependency A2 (parameterized output path in dispatch templates) — only relevant if A2 lands first
Assertion strategy Structured-field assertions over C2's schema. NOT prose regex matching.
Execution mode Manual only. Single-contributor repo, no PR-based CI.
Fixture organization TBD. Verify against ce-code-review's tests/contracts/ layout before locking. Per-reviewer vs. whole-suite to be decided on data.
Negative controls Yes — include diffs that should not trigger findings for a given reviewer, to catch false-positive regressions.
Versioning Fixtures versioned with both model version (e.g., claude-opus-4-7) and prompt version. Revalidation protocol on model bumps.

Trigger conditions for revisit

Any one of:

  • C2 lands — removes the hardest design barrier (structured-field assertions become feasible)
  • Contributor count grows beyond one — increases the regression-protection benefit
  • Reviewer prompts enter a heavy edit cadence — increases the protection value
  • An observed regression incident — concrete demand for the tripwire

References

  • ce-code-review test infrastructure (specifics inaccessible during research — references/ and scripts/ subdirs returned 404 from public docs)
  • C2 issue (validator pass) — must land before A3 is revisited
  • Sister candidate that survived the decline: A1 protected-artifacts guard (cheap, accepted)

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions