Benchmarking: `fedify bench compare` A/B mode and CI gate robustness

> [!NOTE]
> **Sub-issue of #744.** Before reading further, read #744 in full, including all of its comments, where the benchmarking tool's design is worked out in detail. This issue is one slice of that design and assumes the decisions recorded in those comments.

This is step 5 of 5. It depends on #783 and builds on the gating from #784.


Scope
-----

The premise (detailed in the #744 comments): an application developer runs `fedify bench` in their own CI, where runners are too noisy to gate on precise latency, so the gate leans on robust signals and on same-runner comparison.

### `fedify bench compare`

 -  `fedify bench compare --base <ref> --head <ref>` runs the base and head revisions of the application against the same runner and reports the delta. Running both halves on one machine cancels the runner's absolute speed, which is the only trustworthy way to detect a latency regression in noisy CI.
 -  `--max-regression` sets the tolerance, and a regression fails only when the delta exceeds the measured inter-run noise band, which is reported so the gate stays interpretable.

### Gate robustness in the run engine

 -  Median-of-N aggregation (`runs`, default 3) for latency and throughput gates, so one unlucky run does not fail the build; correctness gates (success rate, errors) need only a single run.
 -  The robust signals are the primary gate (success rate, an error budget); throughput and latency are gross sanity bounds, not tight gates. This builds on the `warn`/`fail` severity from #784.

### Example profiles

 -  A CI-safe profile (correctness plus gross bounds) and a perf-lab profile (tight latency with `compare` in a controlled environment), to make the boundary concrete and steer teams away from brittle tight-latency gates in shared CI.


Dependencies
------------

Depends on #783 (the engine); builds on the `expect` gating and severity from #784.


Acceptance criteria
-------------------

 -  [ ] `fedify bench compare` runs base and head on the same runner and reports a noise-aware delta gated by `--max-regression`.
 -  [ ] Median-of-N (`runs`) aggregation applies to latency and throughput gates.
 -  [ ] CI-safe and perf-lab example profiles are provided.


Documentation
-------------

Add `compare` and the CI-safe versus perf-lab guidance to *docs/manual/benchmarking.md*, drawing the line between what CI should gate and what belongs in a controlled environment.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmarking: `fedify bench compare` A/B mode and CI gate robustness #786

Scope

`fedify bench compare`

Gate robustness in the run engine

Example profiles

Dependencies

Acceptance criteria

Documentation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Benchmarking: fedify bench compare A/B mode and CI gate robustness #786

Description

Scope

fedify bench compare

Gate robustness in the run engine

Example profiles

Dependencies

Acceptance criteria

Documentation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Benchmarking: `fedify bench compare` A/B mode and CI gate robustness #786

`fedify bench compare`