Skip to content

Benchmarking: fedify bench compare A/B mode and CI gate robustness #786

@dahlia

Description

@dahlia

Note

Sub-issue of #744. Before reading further, read #744 in full, including all of its comments, where the benchmarking tool's design is worked out in detail. This issue is one slice of that design and assumes the decisions recorded in those comments.

This is step 5 of 5. It depends on #783 and builds on the gating from #784.

Scope

The premise (detailed in the #744 comments): an application developer runs fedify bench in their own CI, where runners are too noisy to gate on precise latency, so the gate leans on robust signals and on same-runner comparison.

fedify bench compare

  • fedify bench compare --base <ref> --head <ref> runs the base and head revisions of the application against the same runner and reports the delta. Running both halves on one machine cancels the runner's absolute speed, which is the only trustworthy way to detect a latency regression in noisy CI.
  • --max-regression sets the tolerance, and a regression fails only when the delta exceeds the measured inter-run noise band, which is reported so the gate stays interpretable.

Gate robustness in the run engine

  • Median-of-N aggregation (runs, default 3) for latency and throughput gates, so one unlucky run does not fail the build; correctness gates (success rate, errors) need only a single run.
  • The robust signals are the primary gate (success rate, an error budget); throughput and latency are gross sanity bounds, not tight gates. This builds on the warn/fail severity from Benchmarking: required scenarios end-to-end with safety guard and CI gating #784.

Example profiles

  • A CI-safe profile (correctness plus gross bounds) and a perf-lab profile (tight latency with compare in a controlled environment), to make the boundary concrete and steer teams away from brittle tight-latency gates in shared CI.

Dependencies

Depends on #783 (the engine); builds on the expect gating and severity from #784.

Acceptance criteria

  • fedify bench compare runs base and head on the same runner and reports a noise-aware delta gated by --max-regression.
  • Median-of-N (runs) aggregation applies to latency and throughput gates.
  • CI-safe and perf-lab example profiles are provided.

Documentation

Add compare and the CI-safe versus perf-lab guidance to docs/manual/benchmarking.md, drawing the line between what CI should gate and what belongs in a controlled environment.

Metadata

Metadata

Assignees

Labels

Priority

High

Effort

Medium

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions