Skip to content

[Repo Assist] fix(setup): cancel wizard session before disconnect to prevent stale-session retry errors#718

Draft
github-actions[bot] wants to merge 1 commit into
mainfrom
repo-assist/fix-issue-709-wizard-stale-session-4f1a22b14ae6daa1
Draft

[Repo Assist] fix(setup): cancel wizard session before disconnect to prevent stale-session retry errors#718
github-actions[bot] wants to merge 1 commit into
mainfrom
repo-assist/fix-issue-709-wizard-stale-session-4f1a22b14ae6daa1

Conversation

@github-actions

@github-actions github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

🤖 This PR was created by Repo Assist, an automated AI assistant.

Closes #709

Root Cause

When the Teams channel setup wizard hit a timeout (EnterWizardErrorAsync), the code called DisconnectAsync() immediately — which nulls _client and tears down the connection — before giving the user any recovery UI. Then CancelCurrentSessionAsync() checks _client != null first and short-circuits, so no wizard.cancel RPC is ever sent to the gateway. The server-side wizard session stays alive.

When the user retries, StartWizardAsync sends a fresh wizard.start into an already-active session and receives "wizard already running".

A secondary path: StartWizardAsync itself also called DisconnectAsync() at the top (to clean up any pre-existing connection) — same problem if the client was connected when an error payload arrived via ApplyPayloadAsync.

Fix

src/OpenClaw.SetupEngine.UI/Pages/WizardPage.xaml.cs (2 locations)

  • EnterWizardErrorAsync: replace await DisconnectAsync() with await CancelCurrentSessionAsync().
    CancelCurrentSessionAsync already calls DisconnectAsync() at the end, uses a 10-second timeout with catch {} for resilience, and is a no-op when _client is null or _currentSessionId is empty — so it is safe to call here and also sends wizard.cancel while the connection is still live.

  • StartWizardAsync: same replacement for the "clean up before starting" call at the top of the method.

Trade-offs

  • No new gateway RPCs are added; wizard.cancel was already the correct cleanup path.
  • CancelCurrentSessionAsync has a 10-second internal timeout — acceptable for an error-recovery path where we're already in a degraded state.
  • WinUI code-behind has no unit test infrastructure in this repo; the fix cannot be covered by an automated test. The existing UI flow is the only practical validation path.
  • PR Extend setup wizard timeout for Teams channel setup #648 (extend wizard timeout) is complementary — it addresses the root cause of the timeout itself; this PR addresses recovery when the timeout fires.

Test Status

  • Shared tests: ✅ 2122 passed, 8 pre-existing failures (Windows path / networking tests unrelated to this change)
  • Tray tests: ✅ 973 passed, 0 failed
  • Build: ⚠️ ./build.ps1 skipped — GitVersion MSBuild task requires GITHUB_ENV path present on CI runner; not available in agent environment (pre-existing infrastructure limitation, not caused by this change)

Generated by 🌈 Repo Assist, see workflow run. Learn more.

Add this agentic workflows to your repo

To install this agentic workflow, run

gh aw add githubnext/agentics/workflows/repo-assist.md@97143ac59cb3a13ef2a77581f929f06719c7402a

…session retry errors

When wizard.next timed out (e.g. Teams channel selection hanging),
EnterWizardErrorAsync called DisconnectAsync which nulled _client, then
showed "Start wizard again" / "Skip wizard" buttons. CancelCurrentSessionAsync
checked _client != null and skipped the wizard.cancel call — leaving the
server-side session active. Subsequent "Start wizard again" clicks then hit a
gateway "wizard already running" error.

Fix: replace await DisconnectAsync() with await CancelCurrentSessionAsync()
in both EnterWizardErrorAsync and StartWizardAsync. CancelCurrentSessionAsync
sends wizard.cancel (best-effort, catch ignored) then calls DisconnectAsync,
so the disconnect still happens. The session cancel is a no-op when _client
is already null or _sessionId is empty, so the first-start path is unaffected.

Closes #709

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@clawsweeper

clawsweeper Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codex review: needs maintainer review before merge. Reviewed June 7, 2026, 10:05 PM ET / 02:05 UTC.

Summary
The PR changes two setup wizard UI paths to call CancelCurrentSessionAsync() before disconnecting so stale server-side wizard sessions can be cleared before retry or skip.

Reproducibility: Not fully. Source inspection shows current master disconnects before cancellation in the reported recovery path, but I do not have a high-confidence live reproduction against a real gateway in this read-only review.

Review metrics: 2 noteworthy metrics.

  • Patch size: 1 file changed, 8 added, 2 removed. The implementation surface is narrow, but it touches setup wizard session cleanup behavior.
  • Validation gap: 1 required command skipped. AGENTS.md requires the full repo build after setup code changes, and the PR body says ./build.ps1 was not run.

Merge readiness
Overall: 🦪 silver shellfish
Proof: 🧂 unranked krab
Patch quality: 🐚 platinum hermit
Result: blocked until real behavior proof is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P2] Add redacted live proof that timeout recovery can retry or skip without a lingering wizard already running error.
  • [P1] Demonstrate the required ./build.ps1 validation or provide maintainer-accepted equivalent build evidence.

Proof guidance:

  • [P1] Needs real behavior proof before merge: Missing: add redacted after-fix proof such as a screenshot/video, terminal output, logs, recording, or linked artifact showing retry or skip after timeout; update the PR body to trigger re-review or ask a maintainer to comment @clawsweeper re-review.

Mantis proof suggestion
A short visible setup recovery proof would materially help because the behavior depends on real gateway session state, not just unit-level code flow. A maintainer can ask Mantis to capture proof by posting a new PR comment that starts with the OpenClaw Mantis account mention, followed by:

visual task: verify the setup wizard can recover from a Teams channel timeout and then retry or skip without a lingering "wizard already running" error.

Risk before merge

  • [P2] No after-fix live setup proof is attached, so source review alone does not prove the real gateway accepts wizard.cancel after the timeout path or that retry/skip no longer reports wizard already running.
  • [P1] The PR body says ./build.ps1 was skipped, so the repository-required full validation set has not been demonstrated for this branch.
  • [P2] The related Teams timeout work remains open at Extend setup wizard timeout for Teams channel setup #648; this PR addresses stale-session recovery after a timeout, not the underlying channel step hang.

Maintainer options:

  1. Require live wizard recovery proof (recommended)
    Ask for redacted logs, terminal output, or a short recording showing a timed-out Teams wizard can retry or skip without a lingering server session before merging.
  2. Accept source-only recovery risk
    Maintainers may choose to merge based on source review alone, but they would own the risk that the real gateway timeout path still leaves stale wizard state.
  3. Wait for the timeout fix
    Pause this branch until the related Teams timeout PR clarifies whether stale-session recovery is still needed as a separate fix.

Next step before merge

  • [P2] There is no narrow code repair to queue; maintainers need live wizard recovery proof and full validation before deciding whether to merge this draft PR.

Security
Cleared: The diff only changes setup wizard cleanup calls and does not add dependency, credential, permission, or code-execution surface.

Review details

Best possible solution:

Land a narrow cancel-before-disconnect fix only after redacted live wizard recovery proof and the full repository validation set demonstrate that retry and skip work without leaving a gateway session behind.

Do we have a high-confidence way to reproduce the issue?

Not fully. Source inspection shows current master disconnects before cancellation in the reported recovery path, but I do not have a high-confidence live reproduction against a real gateway in this read-only review.

Is this the best way to solve the issue?

Likely yes, but not proven. Reusing CancelCurrentSessionAsync() is the narrowest maintainable code path, but the branch still needs live gateway recovery proof and full validation before merge.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against d1b136347e95.

Label changes

Label justifications:

  • P1: The PR addresses a setup wizard retry path that can block users after a channel setup timeout.
  • merge-risk: 🚨 session-state: The diff changes wizard session cancellation and retry cleanup, which affects whether the gateway keeps or clears an active setup session.
  • rating: 🦪 silver shellfish: Overall readiness is 🦪 silver shellfish; proof is 🧂 unranked krab and patch quality is 🐚 platinum hermit.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: Missing: add redacted after-fix proof such as a screenshot/video, terminal output, logs, recording, or linked artifact showing retry or skip after timeout; update the PR body to trigger re-review or ask a maintainer to comment @clawsweeper re-review.
Evidence reviewed

Acceptance criteria:

  • [P1] ./build.ps1.
  • [P1] dotnet test ./tests/OpenClaw.Shared.Tests/OpenClaw.Shared.Tests.csproj --no-restore.
  • [P1] dotnet test ./tests/OpenClaw.Tray.Tests/OpenClaw.Tray.Tests.csproj --no-restore.

What I checked:

Likely related people:

  • Christine Yan: Blame and git log -S tie the current WizardPage.xaml.cs implementation, including StartWizardAsync, EnterWizardErrorAsync, and CancelCurrentSessionAsync, to commit 85445c7. (role: introduced behavior; confidence: high; commits: 85445c78066b; files: src/OpenClaw.SetupEngine.UI/Pages/WizardPage.xaml.cs)
  • Barbara Kudiess: Recent commit 8f134b3 changed the same wizard UI file for gateway console output and OAuth recovery visibility. (role: recent area contributor; confidence: medium; commits: 8f134b354df0; files: src/OpenClaw.SetupEngine.UI/Pages/WizardPage.xaml.cs)
  • ranjeshj: The related timeout PR is authored by ranjeshj, and commit afa6218 recently consolidated setup UI into the tray host and touched wizard cancellation/disconnect behavior. (role: adjacent setup owner; confidence: high; commits: afa6218338d6; files: src/OpenClaw.SetupEngine.UI/Pages/WizardPage.xaml.cs)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 Urgent regression or broken agent/channel workflow affecting real users now. merge-risk: 🚨 session-state 🚨 Merging this PR could lose, corrupt, stale, or mis-associate session or agent state. labels Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automation merge-risk: 🚨 session-state 🚨 Merging this PR could lose, corrupt, stale, or mis-associate session or agent state. P1 Urgent regression or broken agent/channel workflow affecting real users now. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. repo-assist status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Teams SDK Channel selection can't continue

0 participants