Skip to content

Collect a hang dump on the wedging Windows PowerShell CI leg#2326

Draft
andyleejordan wants to merge 1 commit into
mainfrom
andyleejordan-diagnose-flaky-ci
Draft

Collect a hang dump on the wedging Windows PowerShell CI leg#2326
andyleejordan wants to merge 1 commit into
mainfrom
andyleejordan-diagnose-flaky-ci

Conversation

@andyleejordan

Copy link
Copy Markdown
Member

What

CI's windows-latest leg still intermittently wedges and rides its
timeout-minutes: 60 cap to a cancelled CI Tests run, even after #2318's
discovery-time skips. This PR doesn't add another skip — it instruments CI so
the next hang is actionable
instead of burning an hour blind.

Where it wedges now

Pulling the Windows test-result artifacts from the last two hung runs and a
passing one (TestFull order is TestPS74 → TestE2EPwsh → TestPS51 → …):

Run Result Legs that produced a .trx net462 TestPS51?
#2319 e34a3ffee ✅ ~11 min 7 (incl. net462 TestPS51, 338 tests) yes
#2325 dbd5dfc ❌ 1 h cap 2 (TestPS74, TestE2EPwsh) no
#2319 7e0aa34 ❌ 1 h cap 2 (TestPS74, TestE2EPwsh) no

Both hung runs stop right after TestE2EPwsh and never emit a TestPS51
.trx, so they wedge in the net462 / Windows PowerShell 5.1 unit leg
not the E2E server path #2318 skipped, and not covered by any
SkippableFactOnWindowsPowerShell guard (those all live in the E2E project).
All three commits contain #2318's skip (verified via merge-base), so this is
not a missing rebase — the 20260614 runner-image regression still races,
just less often. See #2323 for the image root cause.

Changes

  • PowerShellEditorServices.build.ps1 — add
    --blame-hang --blame-hang-timeout 10m --blame-hang-dump-type full to the CI
    dotnet test invocations, gated on $env:GITHUB_ACTIONS (local runs are
    byte-identical). Any single test that wedges past 10 minutes is dumped and
    its host process tree terminated, so the leg fails fast and names the
    test
    instead of riding the cap.
  • .github/workflows/ci-test.yml — best-effort ProcDump install on the
    Windows leg (VSTest's built-in hang dumper only handles .NET Core hosts, so
    capturing a dump of the net462 host needs ProcDump, which isn't on the runner
    image; a download failure only warns and never fails CI), and extend the
    test-results upload to **/*.dmp + **/*_Sequence.xml.

Caveat

If the wedge is in host startup/discovery rather than a running test, the
per-test timer may not fire — but _Sequence.xml will still show how far
discovery got, which is itself diagnostic. Once a hung run identifies the
offending net462 unit test, we can give it the same targeted
Skip.If(Desktop) / polling fix we used for #2307 and #2314 instead of a
broad leg-wide skip.

Validation

Local arg construction verified in both modes (CI inserts the flags with
--framework still last; local is unchanged); build script pwsh-parsed and
workflow YAML validated. The instrumentation itself can only be exercised on
the GitHub-hosted Windows image where the hang reproduces.

Drafted by Copilot (Claude Opus 4.8) on Andy's behalf.

The `windows-latest` `CI Tests` leg still intermittently rides its
`timeout-minutes` cap to a cancelled run even after #2318's skips. Pulling
the test-result artifacts from the last two hung runs shows they stop right
after `TestE2EPwsh` and never emit a `TestPS51` `.trx`, so the wedge is now
in the net462 / Windows PowerShell 5.1 unit leg — not the E2E server path
#2318 skipped. The earlier assumption in #2323 that `TestPS51` was unaffected
no longer holds, and `dotnetTestArgs` had no `--blame-hang`, so a stuck unit
host produced no dump and no test name; it just burned the hour.

This doesn't skip anything — it instruments CI so the next hang is actionable:

- Add `--blame-hang --blame-hang-timeout 10m --blame-hang-dump-type full` to
  the CI `dotnet test` invocations (gated to `GITHUB_ACTIONS`, so local runs
  are byte-identical). Any single test that wedges past 10 minutes is dumped
  and its host tree terminated, failing the leg fast and naming the test.
- Install ProcDump (best-effort) on the Windows leg. VSTest's built-in hang
  dumper only handles .NET Core hosts, so dumping the net462 host needs
  ProcDump, which isn't on the runner image. A download failure only warns.
- Upload `**/*.dmp` and `**/*_Sequence.xml` alongside the `.trx`.

Caveat: if the wedge is in host startup/discovery rather than a running test,
the per-test timer may not fire — but `_Sequence.xml` still shows how far
discovery got. Once a hung run names the net462 test, we can give it the same
targeted treatment as #2307 and #2314 instead of skipping the whole leg.

Drafted by Copilot (Claude Opus 4.8).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@andyleejordan andyleejordan added Area-Test Ignore Exclude from the changelog. labels Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area-Test Ignore Exclude from the changelog.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant