agentv-bench: automate keep/discard decision in iteration loop

## Objective

Automate the keep/discard decision inside the existing `agentv-bench` Step 5 iteration loop so obvious improvements do not require a human pause.

## Design Latitude

**Scope: Skill-only change** — no CLI, schema, or core code changes required.

After each iteration in the bench skill's optimization loop (SKILL.md Step 5), automatically:

1. Run `agentv compare baseline.jsonl candidate.jsonl --json`
2. Parse the structured output: `{ summary: { wins, losses, ties, meanDelta } }`
3. Apply keep/discard rules:
   - `wins > losses` → **keep** change, promote to new baseline
   - `wins <= losses` → **discard** change, revert, try different mutation
   - `meanDelta == 0` but simpler prompt → **keep** (simplicity criterion)
4. Log the decision and rationale before proceeding to next iteration

### Why this stays narrow

- This is the smallest useful improvement to the current bench loop.
- It should remain compatible with human checkpoints at iterations 3, 6, 9.
- It should remain complementary to `#748` rather than expanding into full unattended autoresearch.

## Acceptance Signals

- Clear-cut iterations no longer require manual keep/discard judgment
- Human checkpoints still fire at the existing intervals
- Decision logic uses existing `agentv compare --json` output only
- No new CLI flags, config fields, persistence layer, or runtime memory features are introduced

## Non-Goals

- Not full autoresearch / overnight unattended mutation loops
- Not mutator generation logic (`#746`)
- Not eval bootstrapping (`#747`)
- Not core iteration metadata work (`#335`)
- Not persistent session search, personal memory, or self-improving runtime features

## Related

- #748 — full autoresearch mode
- #746 — mutator subagent
- #747 — eval-generator subagent
- #335 — iteration metadata / termination taxonomy
- #1003 — tracking issue for optimization-loop roadmap coordination



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agentv-bench: automate keep/discard decision in iteration loop #958

Objective

Design Latitude

Why this stays narrow

Acceptance Signals

Non-Goals

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

agentv-bench: automate keep/discard decision in iteration loop #958

Description

Objective

Design Latitude

Why this stays narrow

Acceptance Signals

Non-Goals

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions