Skip to content

Surface skipped conformance scenarios as baselined known failures#3030

Merged
maxisbey merged 3 commits into
mainfrom
conformance-skips
Jun 30, 2026
Merged

Surface skipped conformance scenarios as baselined known failures#3030
maxisbey merged 3 commits into
mainfrom
conformance-skips

Conversation

@maxisbey

@maxisbey maxisbey commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Conformance CI's green currently overstates coverage in two ways: 11 of the harness's server scenarios are never selected at all (the 10 extension-tagged tasks-* scenarios never match a --spec-version filter and sit on the harness's pending list, so the active, draft, and 2026 legs all exclude them; server-sse-polling is pending-listed and removed at 2026-07-28, so no leg can reach it), and several checks inside running scenarios skip silently when a prerequisite is missing — all 5 SEP-2243 Mcp-Param-* server-validation checks skip because no fixture tool carries an x-mcp-header annotation, and the SEP-2575 subscriptions/listen checks skip on -32601.

This PR makes those surfaces visible as baselined known failures instead of silent exclusions, in three steps (one commit each):

  1. New server leg: bare --suite all (no --spec-version) — selects every scenario shipped with the pinned harness, including the extension-tagged and pending-listed ones that no existing leg reaches. With it, server-sse-polling turns out to already pass (3/3), and the 9 runnable tasks-* scenarios fail and are baselined in expected-failures.yml as a burn-down ledger — the harness's stale-entry rule forces each entry out as tasks support lands. tasks-status-notifications is deliberately not listed: the harness skips it unconditionally (pending its rewrite against subscriptions/listen), and a baseline entry for a scenario with no failing checks is itself flagged stale. This is the same pattern typescript-sdk uses (their extensions leg runs --suite all with no --spec-version and baselines the same nine tasks failures).
  2. test_x_mcp_header fixture tool in the everything-server (string param annotated x-mcp-header: Region via json_schema_extra, mirroring typescript-sdk's fixture) — the 5 Mcp-Param server-validation checks now execute: the 3 accept-path checks pass, the reject-path checks fail because server-side Mcp-Param validation isn't implemented yet, and the scenario is baselined in both expected-failures files.
  3. Conformance pin bump b18aa9184944b268 (0.2.0-alpha.8, which includes conformance#372; not published to npm yet, so the same sha256-verified pkg.pr.new mechanics as before). Fixes to stdio_client to support Windows more robustly #372 turns missing-prerequisite skips into "Not testable" failures, which surfaces one more unimplemented surface: server-stateless now fails its 3 subscriptions/listen MUST checks — the everything-server's legacy resources/subscribe handlers make it advertise resources.subscribe in server/discover while subscriptions/listen answers -32601, and advertised-but-rejected is now a failure. The scenario is baselined in both files until the listen runtime lands (Implement SEP-2575: Make MCP Stateless #2804). The 2 listChanged SHOULD checks remain legitimately skipped (declared false, which Fixes to stdio_client to support Windows more robustly #372 still treats as honest feature absence), and the tasks-* scenarios fail with higher check counts (their cascade skips now fail too) without needing entry changes.

Net effect on the baseline: zero waivers for implemented surfaces is unchanged — every entry added here tracks a not-yet-built surface that previously produced no signal at all.

Motivation and Context

A green conformance run currently overstates coverage: unimplemented surfaces (tasks extension, server-side Mcp-Param validation, subscriptions/listen) produce no signal because their scenarios are filtered or their checks skip upstream of the expected-failures mechanism. Visible baselined failures are strictly better — they show up in the run output, they can't rot silently across harness pin bumps, and the stale-entry rule turns the baseline into a forced burn-down list as features land.

How Has This Been Tested?

Replicated all six CI legs locally against the new pinned tarball (sha256-verified), exactly as CI invokes them: the four server legs and both client legs exit 0 with "Baseline check passed" —

  • --suite active → 42 passed, 0 failed (new entries inert — those scenarios aren't in the suite)
  • --suite draft → 74 passed, 9 expected-failed (server-stateless 3 listen checks + http-custom-header-server-validation)
  • --suite all --spec-version 2026-07-28 → 103 passed, 9 expected-failed (same two)
  • --suite all (new leg) → 134 passed, 34 expected-failed (the above + 9 tasks-*); server-sse-polling and json-schema-2020-12 pass
  • both client legs → 421/370 checks passed, 0 failed, 0 warnings

Also verified the generated tools/list schema carries "x-mcp-header": "Region" through pydantic's json_schema_extra, and ran the discovery pass at the new pin before updating baselines to confirm server-stateless was the only newly-unexpected failure (no stale entries, no warnings anywhere).

Breaking Changes

None — CI configuration and a conformance fixture tool only.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

  • Suite choice for the new leg: --suite pending (14 scenarios) would cover the same job-level union slightly faster, since all = active ∪ draft ∪ pending at this harness and the other two suites already run in this job. Bare --suite all is preferred for the self-contained full-denominator run, robustness to scenarios moving between harness suite lists across pin bumps, and parity with typescript-sdk's CI.
  • Why per-scenario waivers and not something finer: the harness baseline is keyed by scenario name only; one entry waives every failing check in that scenario. For tasks-* and http-custom-header-server-validation that's cheap (each tracks exactly one unimplemented surface). For server-stateless it carries a real masking cost, called out in the baseline comment: while listed, new failures in the scenario's other 25 (currently passing) envelope/error-handling checks are also absorbed. The trade is deliberate — a visible, forced-burn-down known failure for the listen gap beats a silent skip — but it makes landing the listen runtime (Implement SEP-2575: Make MCP Stateless #2804) the way to get those 25 checks back under guard.
  • The capability inconsistency the harness flags is real, not a harness artifact: on the 2026 wire the server advertises resources.subscribe while the subscribe methods themselves answer 404/-32601 there. Implementing listen resolves it from one side; era-aware capability reporting in server/discover would resolve it from the other.
  • The bare-all leg re-runs the active/draft scenarios at their default wire versions (same connections as the existing legs), so most of its runtime is duplicate coverage; the harness has a 10s per-request timeout, and the tasks scenarios fail fast on -32601 rather than hanging. The one genuinely new timing-sensitive scenario is server-sse-polling (real-time SSE reconnect waits); it passed consistently in local runs and runs green in typescript-sdk's identical leg, but if it proves flaky on 2-core runners it cannot be pre-waived (a baseline entry for a passing scenario is flagged stale), so a flake would need a rerun or a harness-side fix.

AI Disclaimer

maxisbey added 2 commits June 30, 2026 10:34
…wn failures

The tasks-* scenarios are extension-tagged: they never match a
--spec-version filter and sit on the harness's pending list, so none of
the existing server legs can select them — they were silently absent
from CI with no skip line and no waiver. server-sse-polling (pending,
removed at 2026-07-28) was equally unreachable.

Add a bare --suite all server leg, which selects every scenario shipped
with the pinned harness, and baseline the nine runnable tasks-*
scenarios in expected-failures.yml. The stale-entry rule turns the
baseline into a forced burn-down list as tasks support lands.
server-sse-polling turns out to already pass. tasks-status-notifications
is deliberately not baselined: the harness skips it unconditionally, and
a baseline entry for a scenario with no failing checks is itself flagged
stale.
…header fixture tool

All five SEP-2243 Mcp-Param-* server-validation checks in
http-custom-header-server-validation skipped silently because no
everything-server tool carried an x-mcp-header annotation — the scenario
printed a green 0-passed/0-failed in every leg that selected it.

Add a test_x_mcp_header tool whose region parameter carries the
annotation via json_schema_extra (mirroring the typescript-sdk fixture).
The checks now execute: the accept-path checks pass and the reject-path
checks fail because server-side Mcp-Param validation is not implemented
yet, so the scenario is baselined in both expected-failures files. The
stale-entry rule forces the entries out when validation lands.
@maxisbey maxisbey marked this pull request as ready for review June 30, 2026 10:36

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

Tip: cubic could auto-approve low-risk PRs like this, if it thinks it's safe to merge. Learn more

Re-trigger cubic

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — CI conformance config plus an example-server fixture tool only; no SDK library code touched.

Extended reasoning...

Overview

This PR adds a fourth server conformance leg (bare --suite all) to .github/workflows/conformance.yml, baselines the newly-reachable scenarios (9 tasks-* extension scenarios and http-custom-header-server-validation) in the two expected-failures YAML files, and adds a test_x_mcp_header fixture tool to the everything-server example so the SEP-2243 Mcp-Param server-validation checks actually execute. No SDK library code is modified.

Security risks

None of substance. The new workflow step invokes the existing run-server.sh helper with different flags — no new actions, secrets, permissions, or untrusted inputs are introduced, and the harness package remains pinned by SHA-256 as before. The fixture tool is a trivial echo-style tool in an example conformance server, not in the published package.

Level of scrutiny

Low. The blast radius is limited to CI signal (a wrong baseline entry would fail the conformance job, not affect the shipped SDK) and a conformance fixture in an example server. The harness's stale-entry rule means over-broad waivers are self-correcting as features land. The changes follow the established pattern already used by the existing legs and mirror typescript-sdk's CI.

Other factors

The bug hunting system found no issues, the author documents local replication of all four legs against the pinned harness, and the YAML/workflow edits are mechanical and well-commented. The only operational risk noted (possible server-sse-polling flakiness on small runners) is acknowledged in the PR description and would only require a CI rerun, not a code change.

…cks it surfaces

conformance 4944b268 (0.2.0-alpha.8, unpublished on npm — same
sha256-verified pkg.pr.new mechanics as the previous pin) includes
upstream #372: checks whose prerequisite is missing now fail with a
'Not testable' message instead of skipping outside the pass/fail
denominator.

For this repo that surfaces one new known failure: server-stateless
fails its three subscriptions/listen MUST checks, because the
everything-server's legacy resources/subscribe handlers make it
advertise resources.subscribe in server/discover while
subscriptions/listen answers -32601 — advertised-but-rejected is now a
failure, not a skip. Baseline the scenario in both expected-failures
files until the listen runtime lands. The two listChanged SHOULD checks
remain legitimately skipped (declared false). The tasks-* scenarios
fail with higher check counts (their cascade skips now fail too) but
were already baselined; client legs are unaffected.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any bugs — the changes are confined to CI configuration and a conformance fixture tool — but the new bare --suite all leg, the baseline additions, and the harness pin bump to an unpublished pkg.pr.new build are CI policy decisions worth a maintainer's sign-off.

Extended reasoning...

Overview

This PR touches only conformance CI infrastructure: .github/workflows/conformance.yml (bumps the pinned harness from b18aa918 to 4944b268/0.2.0-alpha.8 and adds a fourth server leg running bare --suite all), both expected-failures*.yml baselines (adds server-stateless, http-custom-header-server-validation, and 9 tasks-* scenarios as known failures), and the example everything-server (adds a trivial test_x_mcp_header fixture tool with an x-mcp-header schema annotation). No SDK runtime code is modified.

Security risks

Low. The conformance harness is fetched from a pkg.pr.new URL, but it is pinned by commit SHA and verified against a recorded SHA-256 digest before use, continuing the pattern already established in the workflow; the workflow runs with contents: read only. The new fixture tool is an example-server change with no security surface. I cannot independently verify that the new digest corresponds to a genuine build of the upstream commit, which is one reason a maintainer should glance at the pin bump.

Level of scrutiny

Moderate. Nothing here ships in the SDK package, and a mistake would surface as a red (or misleadingly green) CI run rather than a user-facing defect. However, the PR encodes several policy choices a maintainer typically owns: pinning CI to an unpublished alpha build, adding a fourth conformance leg (extra CI time, mostly duplicate coverage per the author's own description), and adopting per-scenario baseline waivers as a burn-down ledger — including the caveat the author documents that a server-stateless waiver masks regressions in that scenario's other ~25 currently-passing checks.

Other factors

The bug-hunting pass found no issues, the author reports replicating all four server legs locally against the pinned harness, and the comments in the workflow and baseline files are unusually thorough about the trade-offs. The remaining open questions are judgment calls (suite choice, baseline granularity, flake risk of server-sse-polling on 2-core runners) rather than correctness bugs, so this is a quick review for someone who owns the conformance setup.

@maxisbey maxisbey merged commit 0b200ef into main Jun 30, 2026
61 checks passed
@maxisbey maxisbey deleted the conformance-skips branch June 30, 2026 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants