Surface skipped conformance scenarios as baselined known failures#3030
Conversation
…wn failures The tasks-* scenarios are extension-tagged: they never match a --spec-version filter and sit on the harness's pending list, so none of the existing server legs can select them — they were silently absent from CI with no skip line and no waiver. server-sse-polling (pending, removed at 2026-07-28) was equally unreachable. Add a bare --suite all server leg, which selects every scenario shipped with the pinned harness, and baseline the nine runnable tasks-* scenarios in expected-failures.yml. The stale-entry rule turns the baseline into a forced burn-down list as tasks support lands. server-sse-polling turns out to already pass. tasks-status-notifications is deliberately not baselined: the harness skips it unconditionally, and a baseline entry for a scenario with no failing checks is itself flagged stale.
…header fixture tool All five SEP-2243 Mcp-Param-* server-validation checks in http-custom-header-server-validation skipped silently because no everything-server tool carried an x-mcp-header annotation — the scenario printed a green 0-passed/0-failed in every leg that selected it. Add a test_x_mcp_header tool whose region parameter carries the annotation via json_schema_extra (mirroring the typescript-sdk fixture). The checks now execute: the accept-path checks pass and the reject-path checks fail because server-side Mcp-Param validation is not implemented yet, so the scenario is baselined in both expected-failures files. The stale-entry rule forces the entries out when validation lands.
There was a problem hiding this comment.
No issues found across 4 files
Tip: cubic could auto-approve low-risk PRs like this, if it thinks it's safe to merge. Learn more
Re-trigger cubic
There was a problem hiding this comment.
LGTM — CI conformance config plus an example-server fixture tool only; no SDK library code touched.
Extended reasoning...
Overview
This PR adds a fourth server conformance leg (bare --suite all) to .github/workflows/conformance.yml, baselines the newly-reachable scenarios (9 tasks-* extension scenarios and http-custom-header-server-validation) in the two expected-failures YAML files, and adds a test_x_mcp_header fixture tool to the everything-server example so the SEP-2243 Mcp-Param server-validation checks actually execute. No SDK library code is modified.
Security risks
None of substance. The new workflow step invokes the existing run-server.sh helper with different flags — no new actions, secrets, permissions, or untrusted inputs are introduced, and the harness package remains pinned by SHA-256 as before. The fixture tool is a trivial echo-style tool in an example conformance server, not in the published package.
Level of scrutiny
Low. The blast radius is limited to CI signal (a wrong baseline entry would fail the conformance job, not affect the shipped SDK) and a conformance fixture in an example server. The harness's stale-entry rule means over-broad waivers are self-correcting as features land. The changes follow the established pattern already used by the existing legs and mirror typescript-sdk's CI.
Other factors
The bug hunting system found no issues, the author documents local replication of all four legs against the pinned harness, and the YAML/workflow edits are mechanical and well-commented. The only operational risk noted (possible server-sse-polling flakiness on small runners) is acknowledged in the PR description and would only require a CI rerun, not a code change.
…cks it surfaces conformance 4944b268 (0.2.0-alpha.8, unpublished on npm — same sha256-verified pkg.pr.new mechanics as the previous pin) includes upstream #372: checks whose prerequisite is missing now fail with a 'Not testable' message instead of skipping outside the pass/fail denominator. For this repo that surfaces one new known failure: server-stateless fails its three subscriptions/listen MUST checks, because the everything-server's legacy resources/subscribe handlers make it advertise resources.subscribe in server/discover while subscriptions/listen answers -32601 — advertised-but-rejected is now a failure, not a skip. Baseline the scenario in both expected-failures files until the listen runtime lands. The two listChanged SHOULD checks remain legitimately skipped (declared false). The tasks-* scenarios fail with higher check counts (their cascade skips now fail too) but were already baselined; client legs are unaffected.
There was a problem hiding this comment.
I didn't find any bugs — the changes are confined to CI configuration and a conformance fixture tool — but the new bare --suite all leg, the baseline additions, and the harness pin bump to an unpublished pkg.pr.new build are CI policy decisions worth a maintainer's sign-off.
Extended reasoning...
Overview
This PR touches only conformance CI infrastructure: .github/workflows/conformance.yml (bumps the pinned harness from b18aa918 to 4944b268/0.2.0-alpha.8 and adds a fourth server leg running bare --suite all), both expected-failures*.yml baselines (adds server-stateless, http-custom-header-server-validation, and 9 tasks-* scenarios as known failures), and the example everything-server (adds a trivial test_x_mcp_header fixture tool with an x-mcp-header schema annotation). No SDK runtime code is modified.
Security risks
Low. The conformance harness is fetched from a pkg.pr.new URL, but it is pinned by commit SHA and verified against a recorded SHA-256 digest before use, continuing the pattern already established in the workflow; the workflow runs with contents: read only. The new fixture tool is an example-server change with no security surface. I cannot independently verify that the new digest corresponds to a genuine build of the upstream commit, which is one reason a maintainer should glance at the pin bump.
Level of scrutiny
Moderate. Nothing here ships in the SDK package, and a mistake would surface as a red (or misleadingly green) CI run rather than a user-facing defect. However, the PR encodes several policy choices a maintainer typically owns: pinning CI to an unpublished alpha build, adding a fourth conformance leg (extra CI time, mostly duplicate coverage per the author's own description), and adopting per-scenario baseline waivers as a burn-down ledger — including the caveat the author documents that a server-stateless waiver masks regressions in that scenario's other ~25 currently-passing checks.
Other factors
The bug-hunting pass found no issues, the author reports replicating all four server legs locally against the pinned harness, and the comments in the workflow and baseline files are unusually thorough about the trade-offs. The remaining open questions are judgment calls (suite choice, baseline granularity, flake risk of server-sse-polling on 2-core runners) rather than correctness bugs, so this is a quick review for someone who owns the conformance setup.
Conformance CI's green currently overstates coverage in two ways: 11 of the harness's server scenarios are never selected at all (the 10 extension-tagged
tasks-*scenarios never match a--spec-versionfilter and sit on the harness's pending list, so theactive,draft, and 2026 legs all exclude them;server-sse-pollingis pending-listed and removed at 2026-07-28, so no leg can reach it), and several checks inside running scenarios skip silently when a prerequisite is missing — all 5 SEP-2243Mcp-Param-*server-validation checks skip because no fixture tool carries anx-mcp-headerannotation, and the SEP-2575subscriptions/listenchecks skip on-32601.This PR makes those surfaces visible as baselined known failures instead of silent exclusions, in three steps (one commit each):
--suite all(no--spec-version) — selects every scenario shipped with the pinned harness, including the extension-tagged and pending-listed ones that no existing leg reaches. With it,server-sse-pollingturns out to already pass (3/3), and the 9 runnabletasks-*scenarios fail and are baselined inexpected-failures.ymlas a burn-down ledger — the harness's stale-entry rule forces each entry out as tasks support lands.tasks-status-notificationsis deliberately not listed: the harness skips it unconditionally (pending its rewrite againstsubscriptions/listen), and a baseline entry for a scenario with no failing checks is itself flagged stale. This is the same pattern typescript-sdk uses (their extensions leg runs--suite allwith no--spec-versionand baselines the same nine tasks failures).test_x_mcp_headerfixture tool in the everything-server (string param annotatedx-mcp-header: Regionviajson_schema_extra, mirroring typescript-sdk's fixture) — the 5 Mcp-Param server-validation checks now execute: the 3 accept-path checks pass, the reject-path checks fail because server-sideMcp-Paramvalidation isn't implemented yet, and the scenario is baselined in both expected-failures files.b18aa918→4944b268(0.2.0-alpha.8, which includes conformance#372; not published to npm yet, so the same sha256-verified pkg.pr.new mechanics as before). Fixes to stdio_client to support Windows more robustly #372 turns missing-prerequisite skips into "Not testable" failures, which surfaces one more unimplemented surface:server-statelessnow fails its 3subscriptions/listenMUST checks — the everything-server's legacyresources/subscribehandlers make it advertiseresources.subscribeinserver/discoverwhilesubscriptions/listenanswers-32601, and advertised-but-rejected is now a failure. The scenario is baselined in both files until the listen runtime lands (Implement SEP-2575: Make MCP Stateless #2804). The 2listChangedSHOULD checks remain legitimately skipped (declaredfalse, which Fixes to stdio_client to support Windows more robustly #372 still treats as honest feature absence), and thetasks-*scenarios fail with higher check counts (their cascade skips now fail too) without needing entry changes.Net effect on the baseline: zero waivers for implemented surfaces is unchanged — every entry added here tracks a not-yet-built surface that previously produced no signal at all.
Motivation and Context
A green conformance run currently overstates coverage: unimplemented surfaces (tasks extension, server-side Mcp-Param validation, subscriptions/listen) produce no signal because their scenarios are filtered or their checks skip upstream of the expected-failures mechanism. Visible baselined failures are strictly better — they show up in the run output, they can't rot silently across harness pin bumps, and the stale-entry rule turns the baseline into a forced burn-down list as features land.
How Has This Been Tested?
Replicated all six CI legs locally against the new pinned tarball (sha256-verified), exactly as CI invokes them: the four server legs and both client legs exit 0 with "Baseline check passed" —
--suite active→ 42 passed, 0 failed (new entries inert — those scenarios aren't in the suite)--suite draft→ 74 passed, 9 expected-failed (server-stateless3 listen checks +http-custom-header-server-validation)--suite all --spec-version 2026-07-28→ 103 passed, 9 expected-failed (same two)--suite all(new leg) → 134 passed, 34 expected-failed (the above + 9tasks-*);server-sse-pollingandjson-schema-2020-12passAlso verified the generated
tools/listschema carries"x-mcp-header": "Region"through pydantic'sjson_schema_extra, and ran the discovery pass at the new pin before updating baselines to confirmserver-statelesswas the only newly-unexpected failure (no stale entries, no warnings anywhere).Breaking Changes
None — CI configuration and a conformance fixture tool only.
Types of changes
Checklist
Additional context
--suite pending(14 scenarios) would cover the same job-level union slightly faster, sinceall = active ∪ draft ∪ pendingat this harness and the other two suites already run in this job. Bare--suite allis preferred for the self-contained full-denominator run, robustness to scenarios moving between harness suite lists across pin bumps, and parity with typescript-sdk's CI.tasks-*andhttp-custom-header-server-validationthat's cheap (each tracks exactly one unimplemented surface). Forserver-statelessit carries a real masking cost, called out in the baseline comment: while listed, new failures in the scenario's other 25 (currently passing) envelope/error-handling checks are also absorbed. The trade is deliberate — a visible, forced-burn-down known failure for the listen gap beats a silent skip — but it makes landing the listen runtime (Implement SEP-2575: Make MCP Stateless #2804) the way to get those 25 checks back under guard.resources.subscribewhile the subscribe methods themselves answer 404/-32601there. Implementing listen resolves it from one side; era-aware capability reporting inserver/discoverwould resolve it from the other.-32601rather than hanging. The one genuinely new timing-sensitive scenario isserver-sse-polling(real-time SSE reconnect waits); it passed consistently in local runs and runs green in typescript-sdk's identical leg, but if it proves flaky on 2-core runners it cannot be pre-waived (a baseline entry for a passing scenario is flagged stale), so a flake would need a rerun or a harness-side fix.AI Disclaimer