Skip to content

[Repo Assist] fix(setup): cancel wizard session before disconnect to prevent stale-session errors#717

Draft
github-actions[bot] wants to merge 1 commit into
mainfrom
repo-assist/fix-issue-709-wizard-stale-session-a883ff0fff229859
Draft

[Repo Assist] fix(setup): cancel wizard session before disconnect to prevent stale-session errors#717
github-actions[bot] wants to merge 1 commit into
mainfrom
repo-assist/fix-issue-709-wizard-stale-session-a883ff0fff229859

Conversation

@github-actions

@github-actions github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

🤖 This fix is proposed by Repo Assist, an automated AI assistant.

Problem

When the setup wizard's wizard.next call timed out (e.g. Teams channel selection hung for longer than the step timeout), the error recovery path left the gateway-side wizard session active. Subsequent attempts to "Start wizard again" then failed with a "wizard already running" gateway error — causing users to be permanently stuck in a broken setup loop.

Root cause (src/OpenClaw.SetupEngine.UI/Pages/WizardPage.xaml.cs):

EnterWizardErrorAsync called DisconnectAsync() which nulls _client, then showed "Start wizard again" / "Skip wizard" buttons. When the user clicked either button, CancelCurrentSessionAsync checked _client != null and skipped sending wizard.cancel — leaving the server-side session alive. The next wizard.start hit a live existing session.

Fix

Replace await DisconnectAsync() with await CancelCurrentSessionAsync() in two places:

  1. EnterWizardErrorAsync — cancel the session immediately when entering error state, before disconnect, so subsequent "Start wizard again" clicks don't hit a stale session.
  2. StartWizardAsync — cancel any prior server-side session before starting a fresh wizard.start. This covers the ShowError path in ApplyPayloadAsync where _client is still connected when the error is shown.

CancelCurrentSessionAsync already:

  • Sends wizard.cancel best-effort (catch { } ignores failures)
  • Uses a 10s timeout so it can't block indefinitely
  • Calls DisconnectAsync() at the end — so the disconnect still happens
  • Is a no-op when _client is null or _sessionId is empty (no change to first-start path)

Test Status

  • Shared tests: 2122 passed, 8 pre-existing failures (unchanged)
  • Tray tests: 973 passed, 0 failed, 2 skipped
  • ⚠️ Build: GitVersion MSBuild task fails due to missing GITHUB_ENV file (environment infrastructure issue, not caused by this change; pre-existing in this run environment)
  • i️ UI tests: WizardPage.xaml.cs is in a WinUI project; no unit tests exist for this UI code-behind file — WinUI runtime required. The fix is a minimal 2-line substitution.

Closes #709

Generated by 🌈 Repo Assist, see workflow run. Learn more.

Add this agentic workflows to your repo

To install this agentic workflow, run

gh aw add githubnext/agentics/workflows/repo-assist.md@97143ac59cb3a13ef2a77581f929f06719c7402a

…session retry errors

When wizard.next timed out (e.g. Teams channel selection hanging),
EnterWizardErrorAsync called DisconnectAsync which nulled _client, then
showed "Start wizard again" / "Skip wizard" buttons. CancelCurrentSessionAsync
checked _client != null and skipped the wizard.cancel call — leaving the
server-side session active. Subsequent "Start wizard again" clicks then hit a
gateway "wizard already running" error.

Fix: replace await DisconnectAsync() with await CancelCurrentSessionAsync()
in both EnterWizardErrorAsync and StartWizardAsync. CancelCurrentSessionAsync
sends wizard.cancel (best-effort, catch ignored) then calls DisconnectAsync,
so the disconnect still happens. The session cancel is a no-op when _client
is already null or _sessionId is empty, so the first-start path is unaffected.

Closes #709

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@clawsweeper

clawsweeper Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codex review: needs maintainer review before merge. Reviewed June 7, 2026, 10:08 PM ET / 02:08 UTC.

Summary
The PR changes WizardPage.xaml.cs to call CancelCurrentSessionAsync() instead of DisconnectAsync() when starting a fresh wizard and when entering wizard error recovery.

Reproducibility: yes. at source level: current main disconnects and nulls the wizard client before retry/skip can send wizard.cancel, and the linked report describes the resulting stuck setup loop. I did not run a live WinUI/gateway reproduction because this review is read-only.

Review metrics: 2 noteworthy metrics.

  • Diff scope: 1 file changed, 2 call sites replaced. The patch is small enough for focused source review of wizard session cleanup.
  • Validation signal: Build blocked; shared tests non-clean; tray tests reported clean. AGENTS.md requires the build plus shared and tray test commands before completion.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🐚 platinum hermit
Result: blocked until real behavior proof is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P2] Add redacted real setup proof showing the timeout/error recovery path and successful retry or skip without a wizard already running error.
  • [P1] Provide clean required validation or maintainer-confirmed CI evidence for the AGENTS.md build and test commands.

Proof guidance:

  • [P1] Needs real behavior proof before merge: No after-fix real setup proof is in the PR body; add a redacted recording, terminal/live output, or logs showing retry or skip after timeout, then update the PR body for re-review or ask a maintainer to comment @clawsweeper re-review.

Mantis proof suggestion
A short setup wizard/gateway smoke recording would materially prove the visible retry and skip recovery behavior. A maintainer can ask Mantis to capture proof by posting a new PR comment that starts with the OpenClaw Mantis account mention, followed by:

visual task: verify the setup wizard can recover from a timed-out gateway wizard step by retrying or skipping without a wizard already running error.

Risk before merge

  • [P2] No after-fix real setup proof demonstrates the Teams timeout/retry path; tests alone do not prove the gateway-side wizard session is actually cancelled in a live WinUI flow.
  • [P1] Repository-required validation is not clean in the PR body: the build is reported blocked and shared tests are reported with pre-existing failures.
  • [P1] The changed WinUI code-behind path has no focused automated regression test, so a manual or recorded gateway setup smoke is important before merge.

Maintainer options:

  1. Decide the mitigation before merge
    Land the narrow cancellation change after redacted live setup proof and required validation show retry or skip recovery no longer hits a stale gateway wizard session.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge

  • [P2] The remaining blocker is contributor or maintainer proof and validation for a real WinUI gateway setup path, not an automated code repair.

Security
Cleared: The diff only changes local setup wizard cleanup calls in one C# file and does not add dependency, credential, permission, workflow, or supply-chain exposure.

Review details

Best possible solution:

Land the narrow cancellation change after redacted live setup proof and required validation show retry or skip recovery no longer hits a stale gateway wizard session.

Do we have a high-confidence way to reproduce the issue?

Yes at source level: current main disconnects and nulls the wizard client before retry/skip can send wizard.cancel, and the linked report describes the resulting stuck setup loop. I did not run a live WinUI/gateway reproduction because this review is read-only.

Is this the best way to solve the issue?

Likely yes: reusing the existing CancelCurrentSessionAsync() helper is the narrowest maintainable fix and matches the console runner's cleanup pattern. The remaining gap is proof and validation, not a different code direction.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against d1b136347e95.

Label changes

Label justifications:

  • P1: The PR targets a setup wizard recovery loop that can block users from completing gateway setup.
  • rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🐚 platinum hermit.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: No after-fix real setup proof is in the PR body; add a redacted recording, terminal/live output, or logs showing retry or skip after timeout, then update the PR body for re-review or ask a maintainer to comment @clawsweeper re-review.
Evidence reviewed

Acceptance criteria:

  • [P1] ./build.ps1.
  • [P1] dotnet test ./tests/OpenClaw.Shared.Tests/OpenClaw.Shared.Tests.csproj --no-restore.
  • [P1] dotnet test ./tests/OpenClaw.Tray.Tests/OpenClaw.Tray.Tests.csproj --no-restore.
  • [P2] Manual or recorded setup wizard smoke: force a wizard.next timeout/error, then retry or skip without a wizard already running response.

What I checked:

  • Current main leaves the stale-session path plausible: On current main, StartWizardAsync and EnterWizardErrorAsync call DisconnectAsync(), while CancelCurrentSessionAsync() only sends wizard.cancel when _client and _sessionId are still available. (src/OpenClaw.SetupEngine.UI/Pages/WizardPage.xaml.cs:59, d1b136347e95)
  • PR diff is focused on the two cleanup call sites: The PR head replaces the two direct disconnect calls with CancelCurrentSessionAsync(), leaving the existing best-effort cancel helper and disconnect behavior in place. (src/OpenClaw.SetupEngine.UI/Pages/WizardPage.xaml.cs:59, 58599fa173b2)
  • Console setup runner already cancels unfinished wizard sessions: SetupWizardRunner sends wizard.cancel for unfinished sessions in cancellation/finally paths before disconnecting the client, which supports the PR's cleanup direction. (src/OpenClaw.SetupEngine/SetupWizardRunner.cs:188, d1b136347e95)
  • Onboarding docs define recovery as user-visible behavior: The onboarding docs say recovery choices remain available when gateway setup stalls or the wizard connection is lost so users are not trapped retrying a broken session. (docs/ONBOARDING_WIZARD.md:39, d1b136347e95)
  • Feature history: Blame and log history tie the current WinUI wizard implementation to the v0.6.3-era WizardPage introduction, with later adjacent setup wizard changes in the same file. (src/OpenClaw.SetupEngine.UI/Pages/WizardPage.xaml.cs:52, 85445c78066b)
  • Proof and validation remain incomplete: The PR body reports tray tests passed, shared tests had pre-existing failures, the build was blocked, and it does not include after-fix live setup proof of retry/skip after a wizard timeout. (58599fa173b2)

Likely related people:

  • Christine Yan: Blame points the WinUI WizardPage setup flow, including the existing disconnect/cancel helper structure, to commit 85445c7. (role: introduced behavior; confidence: medium; commits: 85445c78066b; files: src/OpenClaw.SetupEngine.UI/Pages/WizardPage.xaml.cs)
  • Barbara Kudiess: Recent wizard UI work in commit 8f134b3 adjusted WizardPage.xaml.cs around the same setup wizard and recovery-adjacent flow. (role: recent area contributor; confidence: medium; commits: 8f134b354df0; files: src/OpenClaw.SetupEngine.UI/Pages/WizardPage.xaml.cs)
  • Ranjesh: Commit afa6218 consolidated setup UI host behavior and touched navigation paths adjacent to this wizard page. (role: adjacent owner; confidence: low; commits: afa6218338d6; files: src/OpenClaw.SetupEngine.UI/Pages/WizardPage.xaml.cs)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 Urgent regression or broken agent/channel workflow affecting real users now. labels Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automation P1 Urgent regression or broken agent/channel workflow affecting real users now. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. repo-assist status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Teams SDK Channel selection can't continue

0 participants