Status
Declined for now. Detailed design captured for future revisit.
Context
ce-code-review has fixture-based regression tests for its reviewer prompts — synthetic diffs paired with expected-finding signatures, run periodically to catch prompt drift. The premise: prompt drift is silent. You "tighten the wording" of a reviewer and unbeknownst to you it now misses cases it used to catch.
This plugin has seven built-in reviewer prompts (agents/review/*.md), each of which is the only spec of what that reviewer is supposed to do. Edits happen in normal development with zero automated regression protection. The plugin is also published (.claude-plugin/plugin.json is versioned per CLAUDE.md) so silent regressions ship to users.
Decision
Declined for now.
Rationale
- Cost is high and concentrated up front. A credible 21+-fixture corpus (7 reviewers × 3 fixtures minimum) plus a runner plus assertion-design plus CI wiring plus documentation is a multi-day project before any benefit accrues.
- Benefit scales with edit velocity, which is currently low. Reviewer prompts are not under heavy concurrent edit pressure. The most likely consumer of regression-protection (multiple contributors editing prompts) isn't present.
- The hard part is assertion strategy, not the test runner. And that strategy is substantially easier with C2 (structured output) in place. Adopting A3 before C2 means doing the assertion design twice — first as prose matching, then re-doing it as structured-field assertions.
- LLM output non-determinism makes the value/cost ratio worse than for deterministic code. Contract tests for prompts are inherently harder to keep useful than tests for unit code.
Design captured (if revisited)
| Decision |
Choice |
| Hard dependency |
C2 (validator pass / structured output) must land first — required for the assertion strategy |
| Soft dependency |
A2 (parameterized output path in dispatch templates) — only relevant if A2 lands first |
| Assertion strategy |
Structured-field assertions over C2's schema. NOT prose regex matching. |
| Execution mode |
Manual only. Single-contributor repo, no PR-based CI. |
| Fixture organization |
TBD. Verify against ce-code-review's tests/contracts/ layout before locking. Per-reviewer vs. whole-suite to be decided on data. |
| Negative controls |
Yes — include diffs that should not trigger findings for a given reviewer, to catch false-positive regressions. |
| Versioning |
Fixtures versioned with both model version (e.g., claude-opus-4-7) and prompt version. Revalidation protocol on model bumps. |
Trigger conditions for revisit
Any one of:
- C2 lands — removes the hardest design barrier (structured-field assertions become feasible)
- Contributor count grows beyond one — increases the regression-protection benefit
- Reviewer prompts enter a heavy edit cadence — increases the protection value
- An observed regression incident — concrete demand for the tripwire
References
- ce-code-review test infrastructure (specifics inaccessible during research —
references/ and scripts/ subdirs returned 404 from public docs)
- C2 issue (validator pass) — must land before A3 is revisited
- Sister candidate that survived the decline: A1 protected-artifacts guard (cheap, accepted)
Status
Declined for now. Detailed design captured for future revisit.
Context
ce-code-reviewhas fixture-based regression tests for its reviewer prompts — synthetic diffs paired with expected-finding signatures, run periodically to catch prompt drift. The premise: prompt drift is silent. You "tighten the wording" of a reviewer and unbeknownst to you it now misses cases it used to catch.This plugin has seven built-in reviewer prompts (
agents/review/*.md), each of which is the only spec of what that reviewer is supposed to do. Edits happen in normal development with zero automated regression protection. The plugin is also published (.claude-plugin/plugin.jsonis versioned per CLAUDE.md) so silent regressions ship to users.Decision
Declined for now.
Rationale
Design captured (if revisited)
tests/contracts/layout before locking. Per-reviewer vs. whole-suite to be decided on data.Trigger conditions for revisit
Any one of:
References
references/andscripts/subdirs returned 404 from public docs)