Contract tests for reviewer prompts (A3)

## Status

**Declined for now.** Detailed design captured for future revisit.

## Context

`ce-code-review` has fixture-based regression tests for its reviewer prompts — synthetic diffs paired with expected-finding signatures, run periodically to catch prompt drift. The premise: prompt drift is silent. You "tighten the wording" of a reviewer and unbeknownst to you it now misses cases it used to catch.

This plugin has seven built-in reviewer prompts (`agents/review/*.md`), each of which is the only spec of what that reviewer is supposed to do. Edits happen in normal development with zero automated regression protection. The plugin is also published (`.claude-plugin/plugin.json` is versioned per CLAUDE.md) so silent regressions ship to users.

## Decision

**Declined for now.**

## Rationale

1. **Cost is high and concentrated up front.** A credible 21+-fixture corpus (7 reviewers × 3 fixtures minimum) plus a runner plus assertion-design plus CI wiring plus documentation is a multi-day project before any benefit accrues.
2. **Benefit scales with edit velocity, which is currently low.** Reviewer prompts are not under heavy concurrent edit pressure. The most likely consumer of regression-protection (multiple contributors editing prompts) isn't present.
3. **The hard part is assertion strategy, not the test runner.** And that strategy is *substantially* easier with C2 (structured output) in place. Adopting A3 before C2 means doing the assertion design twice — first as prose matching, then re-doing it as structured-field assertions.
4. **LLM output non-determinism makes the value/cost ratio worse than for deterministic code.** Contract tests for prompts are inherently harder to keep useful than tests for unit code.

## Design captured (if revisited)

| Decision | Choice |
|---|---|
| **Hard dependency** | C2 (validator pass / structured output) must land first — required for the assertion strategy |
| **Soft dependency** | A2 (parameterized output path in dispatch templates) — only relevant if A2 lands first |
| **Assertion strategy** | Structured-field assertions over C2's schema. NOT prose regex matching. |
| **Execution mode** | Manual only. Single-contributor repo, no PR-based CI. |
| **Fixture organization** | TBD. Verify against ce-code-review's `tests/contracts/` layout before locking. Per-reviewer vs. whole-suite to be decided on data. |
| **Negative controls** | Yes — include diffs that should *not* trigger findings for a given reviewer, to catch false-positive regressions. |
| **Versioning** | Fixtures versioned with both model version (e.g., claude-opus-4-7) and prompt version. Revalidation protocol on model bumps. |

## Trigger conditions for revisit

Any one of:
- **C2 lands** — removes the hardest design barrier (structured-field assertions become feasible)
- **Contributor count grows beyond one** — increases the regression-protection benefit
- **Reviewer prompts enter a heavy edit cadence** — increases the protection value
- **An observed regression incident** — concrete demand for the tripwire

## References

- ce-code-review test infrastructure (specifics inaccessible during research — `references/` and `scripts/` subdirs returned 404 from public docs)
- C2 issue (validator pass) — must land before A3 is revisited
- Sister candidate that survived the decline: A1 protected-artifacts guard (cheap, accepted)

Decision	Choice
Hard dependency	C2 (validator pass / structured output) must land first — required for the assertion strategy
Soft dependency	A2 (parameterized output path in dispatch templates) — only relevant if A2 lands first
Assertion strategy	Structured-field assertions over C2's schema. NOT prose regex matching.
Execution mode	Manual only. Single-contributor repo, no PR-based CI.
Fixture organization	TBD. Verify against ce-code-review's `tests/contracts/` layout before locking. Per-reviewer vs. whole-suite to be decided on data.
Negative controls	Yes — include diffs that should not trigger findings for a given reviewer, to catch false-positive regressions.
Versioning	Fixtures versioned with both model version (e.g., claude-opus-4-7) and prompt version. Revalidation protocol on model bumps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contract tests for reviewer prompts (A3) #7

Status

Context

Decision

Rationale

Design captured (if revisited)

Trigger conditions for revisit

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Contract tests for reviewer prompts (A3) #7

Description

Status

Context

Decision

Rationale

Design captured (if revisited)

Trigger conditions for revisit

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions