[MGPU] App: pin Kit renderer to single GPU under ISAACLAB_PIN_KIT_GPU by hujc7 · Pull Request #5933 · isaac-sim/IsaacLab

hujc7 · 2026-06-03T02:52:04Z

[MGPU] App: pin Kit renderer to single GPU under ISAACLAB_PIN_KIT_GPU

Summary

Adds opt-in ISAACLAB_PIN_KIT_GPU env var. When truthy, AppLauncher appends --/renderer/multiGpu/enabled=False, --/renderer/multiGpu/autoEnable=False, --/renderer/multiGpu/maxGpuCount=1 to the Kit command line, pinning each Kit process to a single GPU.
Addresses the documented upstream cubric / PhysX-fabric GPU-interop race that surfaces in concurrent multi-GPU CI as [Error] [omni.physx.plugin] Stage X already attached mid-test and SimulationApp.close hanging >52s in teardown.
Off by default; single-GPU and user-facing rendering paths are unchanged.

1. Problem

The default apps/isaaclab.python.headless.kit sets:

renderer.multiGpu.enabled = true
renderer.multiGpu.autoEnable = true

So every Kit process enumerates every visible GPU at startup. AppLauncher pins active_gpu + physics_gpu to the requested device, but renderer.multiGpu.enabled stays true.

Under multi-GPU CI shards (--gpus all per container, one Kit per non-default cuda device), three sibling Kit processes each scan all 4 visible GPUs. cubric's GPU-interop state and PhysX-fabric attach paths then race across processes on the same host. Direct evidence from CI run 26698100037:

[cuda:1] [Error] [omni.physx.plugin] Stage 9223032 already attached.
[cuda:1] [Error] [omni.physx.plugin] Stage 9223035 already attached.
[cuda:1] source/isaaclab_physx/test/assets/test_articulation.py: shutdown hanged (killed after 52s, test had completed)
[cuda:2] source/isaaclab_newton/test/assets/test_articulation.py: Process killed by signal 1 (SIGHUP — session leader exit or orphaned process cleanup)

Same shape as NVBug 5687364 ("PhysX Fabric Manager SIGSEGV crash during multi-GPU distributed training on Kit 109", Gerald Gainant: "corrupted cuda context (crashed device) detected by cubric lib but occurring upstream in physics or fabric"). Tracked at #3475.

2. Fix

Kelly Guo's documented workaround from #omni-kit thread (Slack, 2024-2025):

"we set --/renderer/multiGpu/enabled to False if training on multi-GPUs. by default we have renderer.multiGpu.enabled = true and renderer.multiGpu.autoEnable = true in the app files, but renderer/multiGpu/enabled is overridden with distributed training, where we then also set /renderer/activeGpu and /physics/cudaDevice to the same device ID. For rendering app files, we also have renderer.multiGpu.maxGpuCount=1"

Implementation: AppLauncher checks ISAACLAB_PIN_KIT_GPU. When truthy (1, true, yes, etc.), three flags are appended to sys.argv before SimulationApp boots:

if os.environ.get("ISAACLAB_PIN_KIT_GPU", "0").lower() not in {"", "0", "false", "no", "off"}:
    sys.argv.append("--/renderer/multiGpu/enabled=False")
    sys.argv.append("--/renderer/multiGpu/autoEnable=False")
    sys.argv.append("--/renderer/multiGpu/maxGpuCount=1")

Off by default. Existing single-GPU CI and user code that wants multi-GPU rendering (rare for IsaacLab use cases) are untouched.

3. Validation

3.1 Local

Local Blackwell (RTX PRO 6000) cannot reproduce the upstream hang at the >=3 concurrent Kit threshold due to MIG topology limitations: only 3 torch-visible cuda devices (cuda:0/cuda:1/cuda:2) are reachable inside one container, not the 4 needed to match CI's cuda:1/2/3 shard layout. Earlier audit's "3-MIG cross-GPU FAILS identically to CI" was on prior L40 hardware.

Smoke test confirmed the env var parsing and Kit cmdline injection work correctly: ISAACLAB_PIN_KIT_GPU={1,true,yes,TRUE} → pin=True; ={"", 0, false, no, off} → pin=False.

3.2 CI A/B (the actual validation)

Pending: cherry-pick this PR onto #5875's branch, add -e ISAACLAB_PIN_KIT_GPU=1 to the per-shard docker run, temporarily drop one MULTI_GPU_SKIP_REASON marker, and observe whether the 52s shutdown hang + SIGHUP cascade clears.

4. Non-scope

This PR does NOT modify the default .kit file; the change is opt-in only. Users running standalone Kit with multi-GPU rendering are unaffected.
This PR does NOT fix the upstream Kit cubric/fabric race itself — it works around it by limiting each Kit's GPU enumeration. The upstream fix (Isaac Sim MR !8539, Kelly Guo, May 2026: "should fix our hanging issue") is tracked separately.
Complementary to #5886 ([MGPU] App: bounded shutdown): [MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang #5886 bounds how long the hang can stall teardown via ISAACLAB_FORCE_EXIT_TIMEOUT; this PR addresses the precondition that triggers the hang in the first place.

The default ``apps/isaaclab.python.headless.kit`` sets ``renderer.multiGpu.enabled = true`` + ``renderer.multiGpu.autoEnable = true``, so each Kit process enumerates every visible GPU at startup. Under concurrent multi-GPU CI shards (``--gpus all`` per container, one Kit per non-default cuda device), that produces a shared cubric / PhysX-fabric GPU-interop context across sibling processes -- surfacing as ``[Error] [omni.physx.plugin] Stage X already attached`` mid-test and ``SimulationApp.close`` hanging >52s in teardown. Tracked upstream at IsaacLab isaac-sim#3475 / NVBug 5687364. Kelly Guo's documented WAR (#omni-kit thread, 2024-2025): set ``renderer.multiGpu.enabled = false`` + ``maxGpuCount = 1`` so each Kit only touches its assigned GPU. Adds opt-in ``ISAACLAB_PIN_KIT_GPU`` env var. When truthy, AppLauncher appends three flags to the Kit command line: - ``--/renderer/multiGpu/enabled=False`` - ``--/renderer/multiGpu/autoEnable=False`` - ``--/renderer/multiGpu/maxGpuCount=1`` Off by default; single-GPU and user-facing rendering paths are unchanged. CI workflows that need bounded resource visibility set ``ISAACLAB_PIN_KIT_GPU=1`` on the runner. Local validation: Blackwell hardware (current Horde) does not reproduce the upstream hang due to MIG topology limitations (only 3 torch-visible cuda devices), so the change is shipped as a CI A/B hypothesis test rather than a verified fix. The implementation is small, opt-in, and reversible.

isaaclab-review-bot

🤖 IsaacLab Review Bot — PR #5933

[MGPU] App: pin Kit renderer to single GPU under ISAACLAB_PIN_KIT_GPU

✅ Overall Assessment: LGTM (minor suggestions)

This is a clean, well-scoped, opt-in workaround for a documented upstream Kit bug. The implementation is minimal, reversible, and placed correctly.

📋 Summary of Changes

File	Change
`source/isaaclab/isaaclab/app/app_launcher.py`	+19 lines: env var check + 3 Kit CLI flags + log message
`source/isaaclab/changelog.d/jichuanh-mgpu-pin-kit-resources.rst`	+14 lines: changelog fragment

👍 Strengths

Opt-in by default — Zero risk to existing single-GPU or user-facing paths. The env var must be explicitly set.
Correct placement — The pinning logic sits in _resolve_device_settings() right after active_gpu/physics_gpu assignment, which is the logical place for renderer GPU configuration.
Robust truthiness parsing — The not in {"", "0", "false", "no", "off"} pattern after .lower() covers common shell conventions.
Good logging — logger.info(...) emits a clear signal when the pin is active.
Thorough PR description — Excellent traceability to upstream bugs (NVBug 5687364, #3475) and documented WAR source (Kelly Guo).
Changelog present — Properly formatted RST fragment.

💡 Minor Suggestions (non-blocking)

Idempotency guard: If a user already passes --/renderer/multiGpu/enabled=False manually via CLI, this would append a duplicate. Consider checking sys.argv for existing overrides before appending:
```
if not any("--/renderer/multiGpu/enabled" in arg for arg in sys.argv):
    sys.argv.append("--/renderer/multiGpu/enabled=False")
```
(Low priority — Kit likely handles duplicates gracefully by last-wins, but explicit > implicit.)
Consider logging the trigger source: The log message could include the env var value for debugging:
```
logger.info("ISAACLAB_PIN_KIT_GPU=%s: pinning Kit renderer to a single GPU", raw_val)
```
Test coverage: The PR description mentions smoke testing locally, but there's no unit test for the env var parsing logic. A small parametrized test (mocking os.environ and checking sys.argv after _resolve_device_settings) would be valuable for regression, especially since this interacts with CI infrastructure.

🔍 Potential Concerns

CI validation pending — The PR body notes that the actual A/B test against the multi-GPU hang is still pending (cherry-pick onto #5875). This is appropriate for merge given it's opt-in, but the CI experiment should be tracked to confirm efficacy.

CI Status

✅ pre-commit, labeler, changelog check, build wheel: pass
⏳ Installation tests, Docker builds, docs: pending (expected for fresh push)

Verdict: Ship it. Clean opt-in workaround with proper documentation. The minor suggestions above are non-blocking improvements that could be addressed in a follow-up.

Per per-PR minimum-needed analysis: - isaac-sim#5886 (bounded shutdown) is closed (audit verdict nice-to-have; isaac-sim#5933 prevents the hang upstream so the force-exit timer is moot). Reverts SIGHUP handler + ISAACLAB_FORCE_EXIT_TIMEOUT timer in AppLauncher; drops the workflow env var. - isaac-sim#5883 (kitless newton) kept open as a separate PR but left out of this diagnostic bundle to test whether isaac-sim#5933 alone is enough for newton test_articulation (which calls build_simulation_context(sim_cfg=, device=) at line 2427, so still needs isaac-sim#5881 for the cross-device kwarg fix). Reverts the newton test_articulation kitless conversion and the schemas.py _create_fixed_joint_to_world helper. Bundle now contains: isaac-sim#5823 + isaac-sim#5875 base + isaac-sim#5881 + isaac-sim#5933 + the JUnit XML path-collision fix in conftest. If green, confirms only 4 PRs are needed for multi-GPU CI green (with test_articulation un-gated).

github-actions Bot added the isaac-lab Related to Isaac Lab team label Jun 3, 2026

isaaclab-review-bot Bot reviewed Jun 3, 2026

View reviewed changes

This was referenced Jun 3, 2026

[DO-NOT-MERGE][MGPU][TEST] Integration bundle: device kwarg + kitless newton + bounded shutdown + Kit-pin #5934

Draft

[MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang #5886

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MGPU] App: pin Kit renderer to single GPU under ISAACLAB_PIN_KIT_GPU#5933

[MGPU] App: pin Kit renderer to single GPU under ISAACLAB_PIN_KIT_GPU#5933
hujc7 wants to merge 1 commit into
isaac-sim:developfrom
hujc7:jichuanh/mgpu-pin-kit-resources

hujc7 commented Jun 3, 2026

Uh oh!

isaaclab-review-bot Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hujc7 commented Jun 3, 2026

[MGPU] App: pin Kit renderer to single GPU under ISAACLAB_PIN_KIT_GPU

Summary

1. Problem

2. Fix

3. Validation

3.1 Local

3.2 CI A/B (the actual validation)

4. Non-scope

Uh oh!

isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

🤖 IsaacLab Review Bot — PR #5933

✅ Overall Assessment: LGTM (minor suggestions)

📋 Summary of Changes

👍 Strengths

💡 Minor Suggestions (non-blocking)

🔍 Potential Concerns

CI Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant