Skip to content

agentv-bench: automate keep/discard decision in iteration loop #958

@christso

Description

@christso

Objective

Automate the keep/discard decision inside the existing agentv-bench Step 5 iteration loop so obvious improvements do not require a human pause.

Design Latitude

Scope: Skill-only change — no CLI, schema, or core code changes required.

After each iteration in the bench skill's optimization loop (SKILL.md Step 5), automatically:

  1. Run agentv compare baseline.jsonl candidate.jsonl --json
  2. Parse the structured output: { summary: { wins, losses, ties, meanDelta } }
  3. Apply keep/discard rules:
    • wins > losseskeep change, promote to new baseline
    • wins <= lossesdiscard change, revert, try different mutation
    • meanDelta == 0 but simpler prompt → keep (simplicity criterion)
  4. Log the decision and rationale before proceeding to next iteration

Why this stays narrow

  • This is the smallest useful improvement to the current bench loop.
  • It should remain compatible with human checkpoints at iterations 3, 6, 9.
  • It should remain complementary to #748 rather than expanding into full unattended autoresearch.

Acceptance Signals

  • Clear-cut iterations no longer require manual keep/discard judgment
  • Human checkpoints still fire at the existing intervals
  • Decision logic uses existing agentv compare --json output only
  • No new CLI flags, config fields, persistence layer, or runtime memory features are introduced

Non-Goals

  • Not full autoresearch / overnight unattended mutation loops
  • Not mutator generation logic (#746)
  • Not eval bootstrapping (#747)
  • Not core iteration metadata work (#335)
  • Not persistent session search, personal memory, or self-improving runtime features

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    autoagentin-progressClaimed by an agent — do not duplicate work

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions