Extend the interaction suite to the 2026-07-28 spec#3018
Conversation
There was a problem hiding this comment.
No issues found across 29 files
Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.
Tip: cubic could auto-approve low-risk PRs like this, if it thinks it's safe to merge. Learn more
Re-trigger cubic
There was a problem hiding this comment.
I didn't find any bugs in this change, but it's a large addition (~109 new test functions across 29 files) that pins spec interpretations and 65 documented divergences for the 2026-07-28 revision, so it deserves a human maintainer's review rather than an automated approval.
Extended reasoning...
Overview
This PR extends tests/interaction/ from the 2025-11-25 baseline to the 2026-07-28 spec revision. It touches 29 test files (no src/ changes), adding roughly 109 new test functions covering MRTR (multi-round-trip requests), the modern streamable-HTTP entry, x-mcp-header validation, SEP-2549 caching hints, the RFC 9207 iss validation table, OAuth refresh rotation / AS binding / step-up bounds, JSON Schema dialect handling, and assorted lifecycle/versioning behaviours. It also reworks the requirements manifest (399 → 605 entries) and registers 64 deferred entries plus 65 recorded divergences.
Security risks
None direct: the change is test-only and does not modify any production or security-sensitive code paths. The auth-related tests exercise the OAuth client/server harness but only add coverage; the harness changes (_harness.py, _provider.py) add test-only knobs (code/iss overrides, non-rotating refresh provider) that are scoped to the test suite.
Level of scrutiny
Although the production-risk level is low (tests only), the PR encodes a large number of spec interpretations and intentionally pins current SDK behaviour where it diverges from the 2026-07-28 spec (65 divergences). Whether those interpretations and pins are the right ones is a maintainer judgment call about the project's conformance posture, not something an automated review can sign off on. The sheer size also exceeds what is appropriate for shadow approval.
Other factors
The bug-hunting pass found no bugs. The author reports the suite green over ten consecutive runs with pyright/ruff clean and the manifest↔test coverage gate passing at every commit, and the commits are structured to be individually reviewable. There are no prior reviews or unresolved reviewer comments on the PR.
There was a problem hiding this comment.
I didn't find any bugs, but this is a very large change (~109 new test functions, manifest restructuring, 65 pinned divergences) that warrants a human look — note also that the diff touches src/mcp/server/streamable_http.py (coverage pragma removals), while the description says no src/ changes.
Extended reasoning...
Overview
This PR extends tests/interaction/ from the 2025-11-25 baseline to the 2026-07-28 spec revision: ~109 new test functions across 40 test files, manifest groundwork (requirement id renames, era added_in/removed_in linkage, 64 deferred entries), new test families (MRTR, modern streamable-HTTP entry, x-mcp-header, SEP-2549 caching, RFC 9207 iss validation, AS binding, JSON Schema dialects), and a per-cell coverage gate in test_coverage.py. The only src/ change is removing two coverage pragmas (# pragma: no branch / # pragma: no cover) in streamable_http.py, which is behaviour-neutral but contradicts the PR description's claim that only tests/interaction/ is touched.
Security risks
No production code paths change, so there is no direct security exposure. However, several new tests deliberately pin security-adjacent divergences as green (e.g. the registration application_type echo corruption, the missing capability embed gates, unvalidated Mcp-Param-* headers, the audience-mismatch acceptance) — pinning these is a documented suite policy, but a human should confirm each pinned divergence is intentional rather than accidentally normalising a behaviour the team would prefer to fix first.
Level of scrutiny
Test-only changes are normally low risk, but this PR encodes substantial policy decisions: which spec obligations are deferred, which SDK divergences are accepted and pinned, and how the coverage manifest is structured going forward. Those choices shape future conformance work and the v2 burn-down list, so they merit maintainer review even though the runtime SDK is untouched. The sheer volume (1009 collected cells, 605 manifest entries, a ~440k-character diff) also makes it impractical to certify correctness automatically.
Other factors
The bug-hunting system found no bugs, the author reports ten consecutive green runs with pyright/ruff clean, and the manifest↔test coverage contract is enforced at import. The 15 commits are described as individually reviewable, which should help a human reviewer work through the divergence and deferral decisions commit by commit.
Eight entries cited pages or anchors that do not exist in the 2026-07-28 specification tree (the basic/lifecycle page was split into basic/versioning and server/discover; two anchors were renamed). Repoint each to the verified live section. Also add the missing added_in="2026-07-28" on client-auth:authorization-response:iss-verify: RFC 9207 iss validation is SEP-2468, new at 2026-07-28, and the prior 2025 source page carries no such requirement. Cell generation is unchanged (830 before and after).
Rename 14 requirement ids to the names shared with the typescript-sdk e2e suite (sampling:create:*, elicitation:url:action:*, protocol:meta:*, the client-auth stepup/iss/scope/dcr family, the mcpserver context helpers), updating every @requirement decorator to match. The resources:read:unknown-uri name previously sat on an entry whose test proves lowlevel handler-error passthrough, not the unknown-resource rule; that entry is re-described honestly as protocol:error:handler-error- passthrough (source 'sdk'), and the real unknown-resource entry (-32602 with the URI in error.data, SEP-2164) takes the vacated name with its source repointed to the 2026 page that mandates it. The protocol:meta:request-to-handler 2026 arm exclusion now carries the accurate reason (legacy-only-vocabulary) and a note explaining the envelope- key merge that breaks the equality assertion, so the re-admission checklist finds it. Three test docstrings quoting pre-rename spec anchors updated. Cell generation unchanged (830 before and after).
client-transport:http:protocol-version-stored and transport:streamable-http:origin-validation were second labels on assertions their tests already pin under client-transport:http:protocol-version-header and hosting:http:dns-rebinding respectively (decorators removed, tests kept). lifecycle:stateless:no-initialize described a pin API that no longer exists, deferred against coverage that does not exist, and bound no tests. transport:streamable-http:server-to-client is NOT deleted: the underlying behaviour is real 2025-era wire truth that 2026 forbids (SEP-2322), so it gets removed_in="2026-07-28" with the supersession note; the MRTR successor link lands with the era pass. Cells unchanged (830 before and after).
Sixteen fixes on sixteen entries, each evidenced against the bound test body or the spec text: narrow over-claiming strings to the assertions that exist (403-scope-upgrade's unproven no-loop clause, iss mismatch-only coverage, tools-only registration, arrival-only HTTP notification delivery, the custom-client auth clause), correct two entries that attributed the modern classifier's version rejection to the legacy transport, record the 2026 per-request logLevel gate divergence on the Context logging helper and the list-vs-object structured-content gap on text-mirror, re-ground the null-id deferral on the now-existing fault channel, and fix the discover result field name (supportedVersions) plus the 404 status the spec mandates for initialize at the modern entry. Cells unchanged (830 before and after).
… data Execute the era/supersession pass: 37 new successor entries (the MRTR family that replaces server-initiated requests, the per-request log-level pair, the subscriptions/listen family, discover-side successors), all registered deferred ahead of their tests; 82 existing entries edited - version-wide 2026 arm exclusions on era-retired behaviours become removed_in with a superseded_by link and an explanatory note (transport-shaped exclusions stay), no-heir removals get tombstone notes, and the per-request logLevel divergence is recorded on the three logging entries whose tests pin un-gated delivery on live 2026 cells. 62 supersession pairs, all bidirectional and versioned, enforced by the coverage gate at import. The twelve surviving version-wide exclusions are exactly the documented re-admission checklist. Cells unchanged (830 before and after).
The multi-round-trip request pattern (SEP-2322, the 2026-07-28 replacement for server-initiated requests) gets its first end-to-end coverage: - lowlevel/test_mrtr.py (new): the write-once roundtrip (byte-exact requestState echo, opacity via a non-parseable state, fresh JSON-RPC id on retry), state-only retry, omit-when-absent, and parallel-call isolation via a symmetric rendezvous that provably holds both loops mid-flight. - test_elicitation.py: form-mode elicitation over MRTR (basic, decline, cancel, schema primitives) and the capability gate, which pins the current un-gated embed behaviour with a recorded divergence. - test_resources.py / test_prompts.py: the resources/read and prompts/get MRTR origins (lowlevel-only; MCPServer admits InputRequiredResult on the tools path only). - test_sampling.py / test_roots.py: sampling/createMessage and roots/list embedded as MRTR input requests, with model preferences, system prompt, and context-inclusion pass-through. Seventeen requirement entries flip from deferred to tested; five entries are minted (the request-state client obligations and the two non-tools origins). 830 -> 859 collected cells, every new cell accounted; suite green three consecutive runs; 100% line and branch on the new file with no coverage pragmas.
…rectionality pins Ten more tests: the multi-round completion loop, the rounds cap, the at-least-one-of construction-site rejection, the inputResponses structural validation and key correspondence, the -32042 emission-ban wire scan, and the 2026 directionality edges - the push-API loud-fail split (the standalone leg pins NoBackChannelError green on both 2026 cells; a dedicated in-memory test pins the request-scoped leg still transmitting the forbidden frame, recorded as a per-transport, per-leg divergence so the eventual era-gate fix re-pins mechanically), a wire-trace proof that a 2026 exchange contains no server-initiated requests and no client-sent responses, and the sampling and roots embed capability gates (both pinned un-gated with recorded divergences, completing the embed-gate family). Five entries flip from deferred, five origin-new entries are minted with their tests. 859 -> 876 collected cells, every node accounted; suite green three consecutive runs.
The SEP-2243 header derivation pipeline gets full coverage: static definition validation with per-tool eviction and the logged warning (RFC 9110 token rule, control characters, case-insensitive duplicates, the number type the spec now forbids, items/nested-properties reachability), the base64 sentinel encoding both ways including the collision-escape row from the spec's own table, null/absent argument omission, and the Mcp-Method/Mcp-Name mismatch rejections (400, -32020). The known server-side gap - Mcp-Param-* values are not validated against the body - is pinned as a divergence carrying issue=L110 with the recognized-header judgement call recorded so the fix re-pins under either shape. The modern entry itself: response modes, lazy SSE upgrade, cacheable stamping, disconnect cancellation, header validation arms, and the initialize-removed rejections. 28 entries minted (23 tested, 5 deferred), one flip, and the ledger riders: issue=L109 on the three embed-gate divergences, issue=L107 on the push-API pin. 876 -> 909 collected cells, all accounted; suite green three runs.
…n three tests Add the HTTP request-scoped loud-fail test, completing all four legs of the push-API divergence record (both transports x both legs). Add the missing templates/list decorator on the static-and-templated listing test. Redesign the post-connect registration fixture to mutate the tool set between requests rather than from inside a handler, so the fixture itself no longer violates the 2026 list-stability requirement on live cells. Assert that an iss-mismatch rejection never exchanges the authorization code (with a liveness guard on the recorded /token calls). 909 -> 910 cells.
SEP-2549 server-side caching: cache hints pass through unmodified on prompts, resource-template, and discover results (the discover hints were previously pinned nowhere); ttlMs zero means immediately stale; absent hints default per the 2025 rules on the legacy cells; the interim input_required frame carries no hints while the same exchange's complete result does. Two recorded divergences: cross-page cacheScope consistency is delegated to handler authors (the spec MUST binds the server), and a negative ttlMs raises a validation error where the spec says clients should ignore-as-zero - the divergence note records that emission-side strictness is correct and only the inbound parse should clamp. Discover and versioning: instructions and derived capabilities ride DiscoverResult (with vacuity guards against silent legacy fallback), auto mode probes before negotiating, era-cached results are reused identically, the -32022 supported-list retry, dual-era precedence, and the era method gate - a 2025 method on a 2026 connection is method-not-found before any handler lookup, proven by a registered handler that never runs. 27 entries minted (13 tested, 14 deferred with greppable re-open tokens), 3 flips. 910 -> 931 cells, all accounted; suite green three runs.
…and refresh The full iss validation table from the 2026 authorization-response rules: match accepted, trailing-slash difference rejected without normalization (both comparison strings pinned as harness literals so server-side issuer serialization changes cannot invert the test), missing-iss rejected when advertised and tolerated when not, an unadvertised-but-present iss still validated, and an error redirect with a mismatched iss rejected on iss before the missing-code error - the ordering that proves validation applies equally to error responses. Step-up bounds: a second insufficient-scope 403 after one step-up surfaces as an error without another authorize round trip, and a 403 on the GET stream open steps up and reopens with the upgraded token (era-bound: the GET stream is removed at 2026-07-28). DCR defaults (grant_types omitted and passed through verbatim), refresh-token rotation handling at the single-refresh seam (replacement stored, preservation honoured when the server does not rotate), and a non-2xx token response surfacing typed. The as-binding entry splits into its two spec obligations (re-register and no-credential-reuse), both decorating the existing test unchanged. Harness: three small review-approved knobs (iss visibility, code override, persistent step-up shim, non-rotating provider). 16 entries minted (13 tested, 3 deferred), 931 -> 944 cells exact; suite green three runs.
…_type pass-through Pre-registered credentials bound to a different issuer are silently discarded and re-registered - the path the spec blesses only for DCR-persisted credentials; for manually provisioned ones it says an error should surface. The new test pins the silent replacement (flow completes, no error, the seeded credential never presented, storage rebound to the current AS) under a recorded divergence scoped to the issuer-stamped arm; the unbound arm is a documented limitation in the entry note, since a mismatch cannot be detected for credentials that never recorded a binding. A consumer-set application_type of web on a loopback redirect - a value the derivation heuristic would never produce - reaches the /register body verbatim, distinguishing pass-through from any future heuristic. Also: the last caching deferral gains its greppable re-open token, the omit_iss precedence is documented in the harness, and the app-type heuristic note cross-references its tested override sibling. 944 -> 946 cells; suite green three runs.
… capability Every behaviour the analysis identified that the SDK cannot yet express now has a manifest entry with a deferral stating exactly what is missing at this commit: the subscriptions/listen runtime family (types vendored, runtime absent, all carrying the greppable re-open token), the requestState integrity obligations (application-owned, the SDK passes opaque state through), the extension declaration surface, the legacy 2025 jsonschema wrap family (era-bound to the cells where it applies), the hosting-side auth surfaces, stdio-2026 service, and the cross-AS credential obligations (m2m credentials re-spelled into the as-binding family, targeting the spec obligation rather than another SDK's knob). Deferral reasons are re-grounded at this commit - no stale premises, no PR numbers, no internal references; two stale source attributions upgraded to the spec URLs that carry the requirement verbatim. Cells unchanged (946): deferred entries register coverage debt without running anything.
JSON Schema handling: prefixItems vocabulary enforcement, the 2020-12 default dialect (with a declared-dialect violation arm proving validation follows the tag), falsy structured content reaching the validator, and non-object outputs - plus the null structured-content divergence: a tool legitimately returning JSON null is indistinguishable from one returning nothing (the model collapses both to None and the dump drops them), so a spec-legal value raises; pinned with the fix direction recorded (absent-vs-null at the model layer, not a looser client check). MRTR edges: a retry missing a requested key is re-prompted rather than errored, unknown response keys are ignored, the resultType seam (absent means complete, input_required is never masked, unrecognized values rejected - flipping the deferred entry to a pinned divergence), and the max-tokens pass-through. Auth: refresh tokens are not reused across an AS change, CIMD documents are portable, the all-scopes single challenge, and the bundled AS registration echo dropping application_type (pinned against its ledger row). The scatter: list results are connection-independent and deterministically ordered, an empty-string cursor is a valid cursor (a 2026 rule the changelog never mentioned), cancellation stops notification delivery, SSE comment lines are ignored, legacy error codes pass through opaquely, sampling messages are not retained across rounds, multi-content reads, path-traversal rejection, and resource links in prompt content. 26 entries minted, one flipped, four divergences recorded. 946 -> 1009 cells, every node accounted; suite green three consecutive runs.
The unrecognized-resultType and scope-aggregation divergences gained tracking entries after their pins landed; wire the issue fields so the fixer trail is complete for every pinned divergence in the manifest.
MCPServer now passes InputRequiredResult through its prompt and resource pipelines, so the two origin entries' notes and the matching test docstrings no longer claim it cannot; the mcpserver mirrors are recorded as possible and not yet covered. No behaviour or assertion changes - the full suite is green unchanged against current main.
b46806f to
011bbd5
Compare
There was a problem hiding this comment.
I didn't find any bugs in the new tests, but this is a large PR (~109 new test functions across 29 files) that pins many spec divergences as expected behaviour and reshapes the requirements manifest, so it warrants a human maintainer's review of those judgment calls.
Extended reasoning...
Overview
This PR touches only tests/interaction/ — no src/ changes — extending the interaction suite from the 2025-11-25 baseline to the 2026-07-28 spec revision. It adds roughly 109 new test functions across 29 files, plus harness/provider extensions (refresh-token rotation knob, iss/code overrides on the headless OAuth callback, persistent step-up shim) and substantial manifest re-keying (requirement id renames, era links, 64 deferred entries, 65 recorded divergences).
Security risks
There is no production code change, so no direct security exposure. The auth tests do exercise security-relevant client behaviour (RFC 9207 iss validation, refresh-token rotation, AS-binding credential discard, scope step-up), and several tests deliberately pin current SDK behaviour that diverges from the spec (e.g. pre-registered credentials silently replaced rather than erroring, Mcp-Param header values not validated, no embed capability gate). Pinning these as green tests is a policy decision: it makes today's gaps regression-stable, but a maintainer should confirm each pinned divergence is acceptable rather than something that should be fixed before being enshrined in the test suite.
Level of scrutiny
As a tests-only change the blast radius is limited to CI signal, which lowers the bar relative to runtime code. However, the PR is far from mechanical: it encodes interpretations of a draft 2026 spec, decides which behaviours are SDK-defined versus spec-mandated, and restructures the requirements manifest that gates coverage. Those are design/judgment decisions the approval guidelines say a human should weigh in on, and the sheer volume (1000+ added lines of intricate, heavily-documented test logic) exceeds what I can confidently rubber-stamp.
Other factors
The bug-hunting pass found no bugs, the description reports ten consecutive green runs with pyright/ruff clean, and the tests themselves are unusually well documented (each pins a named requirement with rationale). There are no prior reviewer comments to address. The main thing a human reviewer should focus on is the divergence list and the manifest id renames rather than line-by-line test logic.
Extends
tests/interaction/from its 2025-11-25 baseline to full coverage of the 2026-07-28 revision. Onlytests/interaction/is touched — nosrc/changes.Part of #2891.
What's here
The 15 commits are individually reviewable; each is one coherent batch that landed green.
Manifest groundwork (first five): dead 2026 source URLs repointed at live sections, ids aligned with the typescript-sdk e2e suite vocabulary, three redundant entries retired, over-claiming behaviour strings narrowed to what their tests actually prove, and the era pass — retired behaviours get
removed_in, their replacements getadded_in, and the pairs are linkedsupersedes/superseded_by(62 bidirectional pairs, enforced by the coverage gate at import). The 2025→2026 transition is queryable data, and no test body branches on a version literal.The 2026 families (next nine): MRTR end to end — the write-once roundtrip,
requestStateecho/omission/opacity, parallel-call isolation via a symmetric rendezvous, multi-round completion and bounds, and all three origin methods — plus the 2026 message-direction rules (a wire trace contains no server-initiated requests and no client-sent responses), the modern streamable-HTTP entry (response modes, lazy SSE upgrade, header validation, cacheable stamping), thex-mcp-headerpipeline including both directions of the base64 sentinel encoding, SEP-2549 caching,server/discoverand versioning (including the era method gate: a 2025 method on a 2026 connection is method-not-found before any handler lookup), the auth additions (the RFC 9207issvalidation table, step-up bounds, DCR defaults, refresh rotation, AS binding and its pre-registered-credentials arm), JSON Schema handling (dialects, prefixItems, falsy/non-object/null structured content), and a tail of smaller obligations down to the empty-string-cursor rule.Tracking for what the SDK can't express yet (one commit): 64 deferred entries registered with reasons grounded at this commit — the manifest records the full 2026 coverage surface, not just what runs today, and every deferral names what unblocks it.
Numbers
830 → 1009 collected cells, 399 → 605 manifest entries, ~109 new test functions. The suite is green over ten consecutive runs, pyright and ruff are clean, and the manifest↔test coverage contract passes at every commit in the range.
Divergences
Where current SDK behaviour differs from the spec, the suite follows its documented divergence lifecycle: the test pins today's behaviour green and the entry records the divergence, with the re-pin instruction in the test docstring so the eventual fix is mechanical. 65 divergences are recorded. The ones with a verified root cause carry an
issue=tag referencing the v2 burn-down list; these will swap to GitHub issue links as those are filed. The ones most worth knowing about:input_requiredcapability embed gate is not enforced for any of the three features,Mcp-Param-*header values are not validated against the request body,nullstructured content collapses to absent and a spec-legal value is rejected.Conformance: cross-checked against the conformance suite at the CI pin — behaviours covered there and here agree; gaps that exist only upstream are tracked for separate filing.
AI Disclaimer