Collect a hang dump on the wedging Windows PowerShell CI leg#2326
Draft
andyleejordan wants to merge 1 commit into
Draft
Collect a hang dump on the wedging Windows PowerShell CI leg#2326andyleejordan wants to merge 1 commit into
andyleejordan wants to merge 1 commit into
Conversation
The `windows-latest` `CI Tests` leg still intermittently rides its `timeout-minutes` cap to a cancelled run even after #2318's skips. Pulling the test-result artifacts from the last two hung runs shows they stop right after `TestE2EPwsh` and never emit a `TestPS51` `.trx`, so the wedge is now in the net462 / Windows PowerShell 5.1 unit leg — not the E2E server path #2318 skipped. The earlier assumption in #2323 that `TestPS51` was unaffected no longer holds, and `dotnetTestArgs` had no `--blame-hang`, so a stuck unit host produced no dump and no test name; it just burned the hour. This doesn't skip anything — it instruments CI so the next hang is actionable: - Add `--blame-hang --blame-hang-timeout 10m --blame-hang-dump-type full` to the CI `dotnet test` invocations (gated to `GITHUB_ACTIONS`, so local runs are byte-identical). Any single test that wedges past 10 minutes is dumped and its host tree terminated, failing the leg fast and naming the test. - Install ProcDump (best-effort) on the Windows leg. VSTest's built-in hang dumper only handles .NET Core hosts, so dumping the net462 host needs ProcDump, which isn't on the runner image. A download failure only warns. - Upload `**/*.dmp` and `**/*_Sequence.xml` alongside the `.trx`. Caveat: if the wedge is in host startup/discovery rather than a running test, the per-test timer may not fire — but `_Sequence.xml` still shows how far discovery got. Once a hung run names the net462 test, we can give it the same targeted treatment as #2307 and #2314 instead of skipping the whole leg. Drafted by Copilot (Claude Opus 4.8). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
CI's
windows-latestleg still intermittently wedges and rides itstimeout-minutes: 60cap to a cancelledCI Testsrun, even after #2318'sdiscovery-time skips. This PR doesn't add another skip — it instruments CI so
the next hang is actionable instead of burning an hour blind.
Where it wedges now
Pulling the Windows test-result artifacts from the last two hung runs and a
passing one (
TestFullorder isTestPS74 → TestE2EPwsh → TestPS51 → …):.trxTestPS51?e34a3ffeeTestPS51, 338 tests)dbd5dfcTestPS74,TestE2EPwsh)7e0aa34TestPS74,TestE2EPwsh)Both hung runs stop right after
TestE2EPwshand never emit aTestPS51.trx, so they wedge in the net462 / Windows PowerShell 5.1 unit leg —not the E2E server path #2318 skipped, and not covered by any
SkippableFactOnWindowsPowerShellguard (those all live in the E2E project).All three commits contain #2318's skip (verified via
merge-base), so this isnot a missing rebase — the
20260614runner-image regression still races,just less often. See #2323 for the image root cause.
Changes
PowerShellEditorServices.build.ps1— add--blame-hang --blame-hang-timeout 10m --blame-hang-dump-type fullto the CIdotnet testinvocations, gated on$env:GITHUB_ACTIONS(local runs arebyte-identical). Any single test that wedges past 10 minutes is dumped and
its host process tree terminated, so the leg fails fast and names the
test instead of riding the cap.
.github/workflows/ci-test.yml— best-effort ProcDump install on theWindows leg (VSTest's built-in hang dumper only handles .NET Core hosts, so
capturing a dump of the net462 host needs ProcDump, which isn't on the runner
image; a download failure only warns and never fails CI), and extend the
test-results upload to
**/*.dmp+**/*_Sequence.xml.Caveat
If the wedge is in host startup/discovery rather than a running test, the
per-test timer may not fire — but
_Sequence.xmlwill still show how fardiscovery got, which is itself diagnostic. Once a hung run identifies the
offending net462 unit test, we can give it the same targeted
Skip.If(Desktop)/ polling fix we used for #2307 and #2314 instead of abroad leg-wide skip.
Validation
Local arg construction verified in both modes (CI inserts the flags with
--frameworkstill last; local is unchanged); build script pwsh-parsed andworkflow YAML validated. The instrumentation itself can only be exercised on
the GitHub-hosted Windows image where the hang reproduces.
Drafted by Copilot (Claude Opus 4.8) on Andy's behalf.