Skip to content

Extend the interaction suite to the 2026-07-28 spec#3018

Open
maxisbey wants to merge 16 commits into
mainfrom
interaction-2026-requirements
Open

Extend the interaction suite to the 2026-07-28 spec#3018
maxisbey wants to merge 16 commits into
mainfrom
interaction-2026-requirements

Conversation

@maxisbey

Copy link
Copy Markdown
Contributor

Extends tests/interaction/ from its 2025-11-25 baseline to full coverage of the 2026-07-28 revision. Only tests/interaction/ is touched — no src/ changes.

Part of #2891.

What's here

The 15 commits are individually reviewable; each is one coherent batch that landed green.

Manifest groundwork (first five): dead 2026 source URLs repointed at live sections, ids aligned with the typescript-sdk e2e suite vocabulary, three redundant entries retired, over-claiming behaviour strings narrowed to what their tests actually prove, and the era pass — retired behaviours get removed_in, their replacements get added_in, and the pairs are linked supersedes/superseded_by (62 bidirectional pairs, enforced by the coverage gate at import). The 2025→2026 transition is queryable data, and no test body branches on a version literal.

The 2026 families (next nine): MRTR end to end — the write-once roundtrip, requestState echo/omission/opacity, parallel-call isolation via a symmetric rendezvous, multi-round completion and bounds, and all three origin methods — plus the 2026 message-direction rules (a wire trace contains no server-initiated requests and no client-sent responses), the modern streamable-HTTP entry (response modes, lazy SSE upgrade, header validation, cacheable stamping), the x-mcp-header pipeline including both directions of the base64 sentinel encoding, SEP-2549 caching, server/discover and versioning (including the era method gate: a 2025 method on a 2026 connection is method-not-found before any handler lookup), the auth additions (the RFC 9207 iss validation table, step-up bounds, DCR defaults, refresh rotation, AS binding and its pre-registered-credentials arm), JSON Schema handling (dialects, prefixItems, falsy/non-object/null structured content), and a tail of smaller obligations down to the empty-string-cursor rule.

Tracking for what the SDK can't express yet (one commit): 64 deferred entries registered with reasons grounded at this commit — the manifest records the full 2026 coverage surface, not just what runs today, and every deferral names what unblocks it.

Numbers

830 → 1009 collected cells, 399 → 605 manifest entries, ~109 new test functions. The suite is green over ten consecutive runs, pyright and ruff are clean, and the manifest↔test coverage contract passes at every commit in the range.

Divergences

Where current SDK behaviour differs from the spec, the suite follows its documented divergence lifecycle: the test pins today's behaviour green and the entry records the divergence, with the re-pin instruction in the test docstring so the eventual fix is mechanical. 65 divergences are recorded. The ones with a verified root cause carry an issue= tag referencing the v2 burn-down list; these will swap to GitHub issue links as those are filed. The ones most worth knowing about:

  • the push-style server APIs still transmit on stdio/in-memory at a 2026 negotiated version (the gate is channel-based, not era-based),
  • the input_required capability embed gate is not enforced for any of the three features,
  • Mcp-Param-* header values are not validated against the request body,
  • a JSON null structured content collapses to absent and a spec-legal value is rejected.

Conformance: cross-checked against the conformance suite at the CI pin — behaviours covered there and here agree; gaps that exist only upstream are tracked for separate filing.

AI Disclaimer

@maxisbey maxisbey marked this pull request as ready for review June 29, 2026 10:18

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 29 files

Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.
Tip: cubic could auto-approve low-risk PRs like this, if it thinks it's safe to merge. Learn more

Re-trigger cubic

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any bugs in this change, but it's a large addition (~109 new test functions across 29 files) that pins spec interpretations and 65 documented divergences for the 2026-07-28 revision, so it deserves a human maintainer's review rather than an automated approval.

Extended reasoning...

Overview

This PR extends tests/interaction/ from the 2025-11-25 baseline to the 2026-07-28 spec revision. It touches 29 test files (no src/ changes), adding roughly 109 new test functions covering MRTR (multi-round-trip requests), the modern streamable-HTTP entry, x-mcp-header validation, SEP-2549 caching hints, the RFC 9207 iss validation table, OAuth refresh rotation / AS binding / step-up bounds, JSON Schema dialect handling, and assorted lifecycle/versioning behaviours. It also reworks the requirements manifest (399 → 605 entries) and registers 64 deferred entries plus 65 recorded divergences.

Security risks

None direct: the change is test-only and does not modify any production or security-sensitive code paths. The auth-related tests exercise the OAuth client/server harness but only add coverage; the harness changes (_harness.py, _provider.py) add test-only knobs (code/iss overrides, non-rotating refresh provider) that are scoped to the test suite.

Level of scrutiny

Although the production-risk level is low (tests only), the PR encodes a large number of spec interpretations and intentionally pins current SDK behaviour where it diverges from the 2026-07-28 spec (65 divergences). Whether those interpretations and pins are the right ones is a maintainer judgment call about the project's conformance posture, not something an automated review can sign off on. The sheer size also exceeds what is appropriate for shadow approval.

Other factors

The bug-hunting pass found no bugs. The author reports the suite green over ten consecutive runs with pyright/ruff clean and the manifest↔test coverage gate passing at every commit, and the commits are structured to be individually reviewable. There are no prior reviews or unresolved reviewer comments on the PR.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any bugs, but this is a very large change (~109 new test functions, manifest restructuring, 65 pinned divergences) that warrants a human look — note also that the diff touches src/mcp/server/streamable_http.py (coverage pragma removals), while the description says no src/ changes.

Extended reasoning...

Overview

This PR extends tests/interaction/ from the 2025-11-25 baseline to the 2026-07-28 spec revision: ~109 new test functions across 40 test files, manifest groundwork (requirement id renames, era added_in/removed_in linkage, 64 deferred entries), new test families (MRTR, modern streamable-HTTP entry, x-mcp-header, SEP-2549 caching, RFC 9207 iss validation, AS binding, JSON Schema dialects), and a per-cell coverage gate in test_coverage.py. The only src/ change is removing two coverage pragmas (# pragma: no branch / # pragma: no cover) in streamable_http.py, which is behaviour-neutral but contradicts the PR description's claim that only tests/interaction/ is touched.

Security risks

No production code paths change, so there is no direct security exposure. However, several new tests deliberately pin security-adjacent divergences as green (e.g. the registration application_type echo corruption, the missing capability embed gates, unvalidated Mcp-Param-* headers, the audience-mismatch acceptance) — pinning these is a documented suite policy, but a human should confirm each pinned divergence is intentional rather than accidentally normalising a behaviour the team would prefer to fix first.

Level of scrutiny

Test-only changes are normally low risk, but this PR encodes substantial policy decisions: which spec obligations are deferred, which SDK divergences are accepted and pinned, and how the coverage manifest is structured going forward. Those choices shape future conformance work and the v2 burn-down list, so they merit maintainer review even though the runtime SDK is untouched. The sheer volume (1009 collected cells, 605 manifest entries, a ~440k-character diff) also makes it impractical to certify correctness automatically.

Other factors

The bug-hunting system found no bugs, the author reports ten consecutive green runs with pyright/ruff clean, and the manifest↔test coverage contract is enforced at import. The 15 commits are described as individually reviewable, which should help a human reviewer work through the divergence and deferral decisions commit by commit.

maxisbey added 16 commits June 29, 2026 15:59
Eight entries cited pages or anchors that do not exist in the 2026-07-28
specification tree (the basic/lifecycle page was split into basic/versioning
and server/discover; two anchors were renamed). Repoint each to the verified
live section. Also add the missing added_in="2026-07-28" on
client-auth:authorization-response:iss-verify: RFC 9207 iss validation is
SEP-2468, new at 2026-07-28, and the prior 2025 source page carries no such
requirement. Cell generation is unchanged (830 before and after).
Rename 14 requirement ids to the names shared with the typescript-sdk e2e
suite (sampling:create:*, elicitation:url:action:*, protocol:meta:*, the
client-auth stepup/iss/scope/dcr family, the mcpserver context helpers),
updating every @requirement decorator to match.

The resources:read:unknown-uri name previously sat on an entry whose test
proves lowlevel handler-error passthrough, not the unknown-resource rule;
that entry is re-described honestly as protocol:error:handler-error-
passthrough (source 'sdk'), and the real unknown-resource entry (-32602 with
the URI in error.data, SEP-2164) takes the vacated name with its source
repointed to the 2026 page that mandates it.

The protocol:meta:request-to-handler 2026 arm exclusion now carries the
accurate reason (legacy-only-vocabulary) and a note explaining the envelope-
key merge that breaks the equality assertion, so the re-admission checklist
finds it. Three test docstrings quoting pre-rename spec anchors updated.
Cell generation unchanged (830 before and after).
client-transport:http:protocol-version-stored and
transport:streamable-http:origin-validation were second labels on assertions
their tests already pin under client-transport:http:protocol-version-header
and hosting:http:dns-rebinding respectively (decorators removed, tests kept).
lifecycle:stateless:no-initialize described a pin API that no longer exists,
deferred against coverage that does not exist, and bound no tests.

transport:streamable-http:server-to-client is NOT deleted: the underlying
behaviour is real 2025-era wire truth that 2026 forbids (SEP-2322), so it
gets removed_in="2026-07-28" with the supersession note; the MRTR successor
link lands with the era pass. Cells unchanged (830 before and after).
Sixteen fixes on sixteen entries, each evidenced against the bound test body
or the spec text: narrow over-claiming strings to the assertions that exist
(403-scope-upgrade's unproven no-loop clause, iss mismatch-only coverage,
tools-only registration, arrival-only HTTP notification delivery, the
custom-client auth clause), correct two entries that attributed the modern
classifier's version rejection to the legacy transport, record the 2026
per-request logLevel gate divergence on the Context logging helper and the
list-vs-object structured-content gap on text-mirror, re-ground the null-id
deferral on the now-existing fault channel, and fix the discover result
field name (supportedVersions) plus the 404 status the spec mandates for
initialize at the modern entry. Cells unchanged (830 before and after).
… data

Execute the era/supersession pass: 37 new successor entries (the MRTR family
that replaces server-initiated requests, the per-request log-level pair, the
subscriptions/listen family, discover-side successors), all registered
deferred ahead of their tests; 82 existing entries edited - version-wide
2026 arm exclusions on era-retired behaviours become removed_in with a
superseded_by link and an explanatory note (transport-shaped exclusions
stay), no-heir removals get tombstone notes, and the per-request logLevel
divergence is recorded on the three logging entries whose tests pin un-gated
delivery on live 2026 cells. 62 supersession pairs, all bidirectional and
versioned, enforced by the coverage gate at import. The twelve surviving
version-wide exclusions are exactly the documented re-admission checklist.
Cells unchanged (830 before and after).
The multi-round-trip request pattern (SEP-2322, the 2026-07-28 replacement
for server-initiated requests) gets its first end-to-end coverage:

- lowlevel/test_mrtr.py (new): the write-once roundtrip (byte-exact
  requestState echo, opacity via a non-parseable state, fresh JSON-RPC id on
  retry), state-only retry, omit-when-absent, and parallel-call isolation
  via a symmetric rendezvous that provably holds both loops mid-flight.
- test_elicitation.py: form-mode elicitation over MRTR (basic, decline,
  cancel, schema primitives) and the capability gate, which pins the
  current un-gated embed behaviour with a recorded divergence.
- test_resources.py / test_prompts.py: the resources/read and prompts/get
  MRTR origins (lowlevel-only; MCPServer admits InputRequiredResult on the
  tools path only).
- test_sampling.py / test_roots.py: sampling/createMessage and roots/list
  embedded as MRTR input requests, with model preferences, system prompt,
  and context-inclusion pass-through.

Seventeen requirement entries flip from deferred to tested; five entries
are minted (the request-state client obligations and the two non-tools
origins). 830 -> 859 collected cells, every new cell accounted; suite green
three consecutive runs; 100% line and branch on the new file with no
coverage pragmas.
…rectionality pins

Ten more tests: the multi-round completion loop, the rounds cap, the
at-least-one-of construction-site rejection, the inputResponses structural
validation and key correspondence, the -32042 emission-ban wire scan, and
the 2026 directionality edges - the push-API loud-fail split (the standalone
leg pins NoBackChannelError green on both 2026 cells; a dedicated in-memory
test pins the request-scoped leg still transmitting the forbidden frame,
recorded as a per-transport, per-leg divergence so the eventual era-gate fix
re-pins mechanically), a wire-trace proof that a 2026 exchange contains no
server-initiated requests and no client-sent responses, and the sampling and
roots embed capability gates (both pinned un-gated with recorded
divergences, completing the embed-gate family).

Five entries flip from deferred, five origin-new entries are minted with
their tests. 859 -> 876 collected cells, every node accounted; suite green
three consecutive runs.
The SEP-2243 header derivation pipeline gets full coverage: static
definition validation with per-tool eviction and the logged warning (RFC
9110 token rule, control characters, case-insensitive duplicates, the
number type the spec now forbids, items/nested-properties reachability),
the base64 sentinel encoding both ways including the collision-escape row
from the spec's own table, null/absent argument omission, and the
Mcp-Method/Mcp-Name mismatch rejections (400, -32020). The known
server-side gap - Mcp-Param-* values are not validated against the body -
is pinned as a divergence carrying issue=L110 with the recognized-header
judgement call recorded so the fix re-pins under either shape.

The modern entry itself: response modes, lazy SSE upgrade, cacheable
stamping, disconnect cancellation, header validation arms, and the
initialize-removed rejections.

28 entries minted (23 tested, 5 deferred), one flip, and the ledger riders:
issue=L109 on the three embed-gate divergences, issue=L107 on the push-API
pin. 876 -> 909 collected cells, all accounted; suite green three runs.
…n three tests

Add the HTTP request-scoped loud-fail test, completing all four legs of the
push-API divergence record (both transports x both legs). Add the missing
templates/list decorator on the static-and-templated listing test. Redesign
the post-connect registration fixture to mutate the tool set between
requests rather than from inside a handler, so the fixture itself no longer
violates the 2026 list-stability requirement on live cells. Assert that an
iss-mismatch rejection never exchanges the authorization code (with a
liveness guard on the recorded /token calls). 909 -> 910 cells.
SEP-2549 server-side caching: cache hints pass through unmodified on
prompts, resource-template, and discover results (the discover hints were
previously pinned nowhere); ttlMs zero means immediately stale; absent
hints default per the 2025 rules on the legacy cells; the interim
input_required frame carries no hints while the same exchange's complete
result does. Two recorded divergences: cross-page cacheScope consistency
is delegated to handler authors (the spec MUST binds the server), and a
negative ttlMs raises a validation error where the spec says clients
should ignore-as-zero - the divergence note records that emission-side
strictness is correct and only the inbound parse should clamp.

Discover and versioning: instructions and derived capabilities ride
DiscoverResult (with vacuity guards against silent legacy fallback), auto
mode probes before negotiating, era-cached results are reused identically,
the -32022 supported-list retry, dual-era precedence, and the era method
gate - a 2025 method on a 2026 connection is method-not-found before any
handler lookup, proven by a registered handler that never runs.

27 entries minted (13 tested, 14 deferred with greppable re-open tokens),
3 flips. 910 -> 931 cells, all accounted; suite green three runs.
…and refresh

The full iss validation table from the 2026 authorization-response rules:
match accepted, trailing-slash difference rejected without normalization
(both comparison strings pinned as harness literals so server-side issuer
serialization changes cannot invert the test), missing-iss rejected when
advertised and tolerated when not, an unadvertised-but-present iss still
validated, and an error redirect with a mismatched iss rejected on iss
before the missing-code error - the ordering that proves validation applies
equally to error responses.

Step-up bounds: a second insufficient-scope 403 after one step-up surfaces
as an error without another authorize round trip, and a 403 on the GET
stream open steps up and reopens with the upgraded token (era-bound: the
GET stream is removed at 2026-07-28). DCR defaults (grant_types omitted and
passed through verbatim), refresh-token rotation handling at the
single-refresh seam (replacement stored, preservation honoured when the
server does not rotate), and a non-2xx token response surfacing typed.

The as-binding entry splits into its two spec obligations (re-register and
no-credential-reuse), both decorating the existing test unchanged. Harness:
three small review-approved knobs (iss visibility, code override, persistent
step-up shim, non-rotating provider). 16 entries minted (13 tested, 3
deferred), 931 -> 944 cells exact; suite green three runs.
…_type pass-through

Pre-registered credentials bound to a different issuer are silently
discarded and re-registered - the path the spec blesses only for
DCR-persisted credentials; for manually provisioned ones it says an error
should surface. The new test pins the silent replacement (flow completes,
no error, the seeded credential never presented, storage rebound to the
current AS) under a recorded divergence scoped to the issuer-stamped arm;
the unbound arm is a documented limitation in the entry note, since a
mismatch cannot be detected for credentials that never recorded a binding.

A consumer-set application_type of web on a loopback redirect - a value
the derivation heuristic would never produce - reaches the /register body
verbatim, distinguishing pass-through from any future heuristic.

Also: the last caching deferral gains its greppable re-open token, the
omit_iss precedence is documented in the harness, and the app-type
heuristic note cross-references its tested override sibling. 944 -> 946
cells; suite green three runs.
… capability

Every behaviour the analysis identified that the SDK cannot yet express now
has a manifest entry with a deferral stating exactly what is missing at this
commit: the subscriptions/listen runtime family (types vendored, runtime
absent, all carrying the greppable re-open token), the requestState
integrity obligations (application-owned, the SDK passes opaque state
through), the extension declaration surface, the legacy 2025 jsonschema wrap
family (era-bound to the cells where it applies), the hosting-side auth
surfaces, stdio-2026 service, and the cross-AS credential obligations
(m2m credentials re-spelled into the as-binding family, targeting the spec
obligation rather than another SDK's knob).

Deferral reasons are re-grounded at this commit - no stale premises, no PR
numbers, no internal references; two stale source attributions upgraded to
the spec URLs that carry the requirement verbatim. Cells unchanged (946):
deferred entries register coverage debt without running anything.
JSON Schema handling: prefixItems vocabulary enforcement, the 2020-12
default dialect (with a declared-dialect violation arm proving validation
follows the tag), falsy structured content reaching the validator, and
non-object outputs - plus the null structured-content divergence: a tool
legitimately returning JSON null is indistinguishable from one returning
nothing (the model collapses both to None and the dump drops them), so a
spec-legal value raises; pinned with the fix direction recorded
(absent-vs-null at the model layer, not a looser client check).

MRTR edges: a retry missing a requested key is re-prompted rather than
errored, unknown response keys are ignored, the resultType seam (absent
means complete, input_required is never masked, unrecognized values
rejected - flipping the deferred entry to a pinned divergence), and the
max-tokens pass-through. Auth: refresh tokens are not reused across an AS
change, CIMD documents are portable, the all-scopes single challenge, and
the bundled AS registration echo dropping application_type (pinned against
its ledger row). The scatter: list results are connection-independent and
deterministically ordered, an empty-string cursor is a valid cursor (a
2026 rule the changelog never mentioned), cancellation stops notification
delivery, SSE comment lines are ignored, legacy error codes pass through
opaquely, sampling messages are not retained across rounds, multi-content
reads, path-traversal rejection, and resource links in prompt content.

26 entries minted, one flipped, four divergences recorded. 946 -> 1009
cells, every node accounted; suite green three consecutive runs.
The unrecognized-resultType and scope-aggregation divergences gained
tracking entries after their pins landed; wire the issue fields so the
fixer trail is complete for every pinned divergence in the manifest.
MCPServer now passes InputRequiredResult through its prompt and resource
pipelines, so the two origin entries' notes and the matching test
docstrings no longer claim it cannot; the mcpserver mirrors are recorded
as possible and not yet covered. No behaviour or assertion changes - the
full suite is green unchanged against current main.
@maxisbey maxisbey force-pushed the interaction-2026-requirements branch from b46806f to 011bbd5 Compare June 29, 2026 16:01

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any bugs in the new tests, but this is a large PR (~109 new test functions across 29 files) that pins many spec divergences as expected behaviour and reshapes the requirements manifest, so it warrants a human maintainer's review of those judgment calls.

Extended reasoning...

Overview

This PR touches only tests/interaction/ — no src/ changes — extending the interaction suite from the 2025-11-25 baseline to the 2026-07-28 spec revision. It adds roughly 109 new test functions across 29 files, plus harness/provider extensions (refresh-token rotation knob, iss/code overrides on the headless OAuth callback, persistent step-up shim) and substantial manifest re-keying (requirement id renames, era links, 64 deferred entries, 65 recorded divergences).

Security risks

There is no production code change, so no direct security exposure. The auth tests do exercise security-relevant client behaviour (RFC 9207 iss validation, refresh-token rotation, AS-binding credential discard, scope step-up), and several tests deliberately pin current SDK behaviour that diverges from the spec (e.g. pre-registered credentials silently replaced rather than erroring, Mcp-Param header values not validated, no embed capability gate). Pinning these as green tests is a policy decision: it makes today's gaps regression-stable, but a maintainer should confirm each pinned divergence is acceptable rather than something that should be fixed before being enshrined in the test suite.

Level of scrutiny

As a tests-only change the blast radius is limited to CI signal, which lowers the bar relative to runtime code. However, the PR is far from mechanical: it encodes interpretations of a draft 2026 spec, decides which behaviours are SDK-defined versus spec-mandated, and restructures the requirements manifest that gates coverage. Those are design/judgment decisions the approval guidelines say a human should weigh in on, and the sheer volume (1000+ added lines of intricate, heavily-documented test logic) exceeds what I can confidently rubber-stamp.

Other factors

The bug-hunting pass found no bugs, the description reports ten consecutive green runs with pyright/ruff clean, and the tests themselves are unusually well documented (each pins a named requirement with rationale). There are no prior reviewer comments to address. The main thing a human reviewer should focus on is the divergence list and the manifest id renames rather than line-by-line test logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant