Skip to content

[MGPU] App: pin Kit renderer to single GPU under ISAACLAB_PIN_KIT_GPU#5933

Draft
hujc7 wants to merge 1 commit into
isaac-sim:developfrom
hujc7:jichuanh/mgpu-pin-kit-resources
Draft

[MGPU] App: pin Kit renderer to single GPU under ISAACLAB_PIN_KIT_GPU#5933
hujc7 wants to merge 1 commit into
isaac-sim:developfrom
hujc7:jichuanh/mgpu-pin-kit-resources

Conversation

@hujc7
Copy link
Copy Markdown
Collaborator

@hujc7 hujc7 commented Jun 3, 2026

[MGPU] App: pin Kit renderer to single GPU under ISAACLAB_PIN_KIT_GPU

Summary

  • Adds opt-in ISAACLAB_PIN_KIT_GPU env var. When truthy, AppLauncher appends --/renderer/multiGpu/enabled=False, --/renderer/multiGpu/autoEnable=False, --/renderer/multiGpu/maxGpuCount=1 to the Kit command line, pinning each Kit process to a single GPU.
  • Addresses the documented upstream cubric / PhysX-fabric GPU-interop race that surfaces in concurrent multi-GPU CI as [Error] [omni.physx.plugin] Stage X already attached mid-test and SimulationApp.close hanging >52s in teardown.
  • Off by default; single-GPU and user-facing rendering paths are unchanged.

1. Problem

The default apps/isaaclab.python.headless.kit sets:

renderer.multiGpu.enabled = true
renderer.multiGpu.autoEnable = true

So every Kit process enumerates every visible GPU at startup. AppLauncher pins active_gpu + physics_gpu to the requested device, but renderer.multiGpu.enabled stays true.

Under multi-GPU CI shards (--gpus all per container, one Kit per non-default cuda device), three sibling Kit processes each scan all 4 visible GPUs. cubric's GPU-interop state and PhysX-fabric attach paths then race across processes on the same host. Direct evidence from CI run 26698100037:

[cuda:1] [Error] [omni.physx.plugin] Stage 9223032 already attached.
[cuda:1] [Error] [omni.physx.plugin] Stage 9223035 already attached.
[cuda:1] source/isaaclab_physx/test/assets/test_articulation.py: shutdown hanged (killed after 52s, test had completed)
[cuda:2] source/isaaclab_newton/test/assets/test_articulation.py: Process killed by signal 1 (SIGHUP — session leader exit or orphaned process cleanup)

Same shape as NVBug 5687364 ("PhysX Fabric Manager SIGSEGV crash during multi-GPU distributed training on Kit 109", Gerald Gainant: "corrupted cuda context (crashed device) detected by cubric lib but occurring upstream in physics or fabric"). Tracked at #3475.

2. Fix

Kelly Guo's documented workaround from #omni-kit thread (Slack, 2024-2025):

"we set --/renderer/multiGpu/enabled to False if training on multi-GPUs. by default we have renderer.multiGpu.enabled = true and renderer.multiGpu.autoEnable = true in the app files, but renderer/multiGpu/enabled is overridden with distributed training, where we then also set /renderer/activeGpu and /physics/cudaDevice to the same device ID. For rendering app files, we also have renderer.multiGpu.maxGpuCount=1"

Implementation: AppLauncher checks ISAACLAB_PIN_KIT_GPU. When truthy (1, true, yes, etc.), three flags are appended to sys.argv before SimulationApp boots:

if os.environ.get("ISAACLAB_PIN_KIT_GPU", "0").lower() not in {"", "0", "false", "no", "off"}:
    sys.argv.append("--/renderer/multiGpu/enabled=False")
    sys.argv.append("--/renderer/multiGpu/autoEnable=False")
    sys.argv.append("--/renderer/multiGpu/maxGpuCount=1")

Off by default. Existing single-GPU CI and user code that wants multi-GPU rendering (rare for IsaacLab use cases) are untouched.

3. Validation

3.1 Local

Local Blackwell (RTX PRO 6000) cannot reproduce the upstream hang at the >=3 concurrent Kit threshold due to MIG topology limitations: only 3 torch-visible cuda devices (cuda:0/cuda:1/cuda:2) are reachable inside one container, not the 4 needed to match CI's cuda:1/2/3 shard layout. Earlier audit's "3-MIG cross-GPU FAILS identically to CI" was on prior L40 hardware.

Smoke test confirmed the env var parsing and Kit cmdline injection work correctly: ISAACLAB_PIN_KIT_GPU={1,true,yes,TRUE} → pin=True; ={"", 0, false, no, off} → pin=False.

3.2 CI A/B (the actual validation)

Pending: cherry-pick this PR onto #5875's branch, add -e ISAACLAB_PIN_KIT_GPU=1 to the per-shard docker run, temporarily drop one MULTI_GPU_SKIP_REASON marker, and observe whether the 52s shutdown hang + SIGHUP cascade clears.

4. Non-scope

  • This PR does NOT modify the default .kit file; the change is opt-in only. Users running standalone Kit with multi-GPU rendering are unaffected.
  • This PR does NOT fix the upstream Kit cubric/fabric race itself — it works around it by limiting each Kit's GPU enumeration. The upstream fix (Isaac Sim MR !8539, Kelly Guo, May 2026: "should fix our hanging issue") is tracked separately.
  • Complementary to #5886 ([MGPU] App: bounded shutdown): [MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang #5886 bounds how long the hang can stall teardown via ISAACLAB_FORCE_EXIT_TIMEOUT; this PR addresses the precondition that triggers the hang in the first place.

The default ``apps/isaaclab.python.headless.kit`` sets
``renderer.multiGpu.enabled = true`` + ``renderer.multiGpu.autoEnable
= true``, so each Kit process enumerates every visible GPU at startup.
Under concurrent multi-GPU CI shards (``--gpus all`` per container, one
Kit per non-default cuda device), that produces a shared cubric /
PhysX-fabric GPU-interop context across sibling processes -- surfacing
as ``[Error] [omni.physx.plugin] Stage X already attached`` mid-test
and ``SimulationApp.close`` hanging >52s in teardown.

Tracked upstream at IsaacLab isaac-sim#3475 / NVBug 5687364. Kelly Guo's
documented WAR (#omni-kit thread, 2024-2025): set
``renderer.multiGpu.enabled = false`` + ``maxGpuCount = 1`` so each
Kit only touches its assigned GPU.

Adds opt-in ``ISAACLAB_PIN_KIT_GPU`` env var. When truthy, AppLauncher
appends three flags to the Kit command line:
- ``--/renderer/multiGpu/enabled=False``
- ``--/renderer/multiGpu/autoEnable=False``
- ``--/renderer/multiGpu/maxGpuCount=1``

Off by default; single-GPU and user-facing rendering paths are
unchanged. CI workflows that need bounded resource visibility set
``ISAACLAB_PIN_KIT_GPU=1`` on the runner.

Local validation: Blackwell hardware (current Horde) does not
reproduce the upstream hang due to MIG topology limitations (only 3
torch-visible cuda devices), so the change is shipped as a CI A/B
hypothesis test rather than a verified fix. The implementation is
small, opt-in, and reversible.
@github-actions github-actions Bot added the isaac-lab Related to Isaac Lab team label Jun 3, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 IsaacLab Review Bot — PR #5933

[MGPU] App: pin Kit renderer to single GPU under ISAACLAB_PIN_KIT_GPU

✅ Overall Assessment: LGTM (minor suggestions)

This is a clean, well-scoped, opt-in workaround for a documented upstream Kit bug. The implementation is minimal, reversible, and placed correctly.


📋 Summary of Changes

File Change
source/isaaclab/isaaclab/app/app_launcher.py +19 lines: env var check + 3 Kit CLI flags + log message
source/isaaclab/changelog.d/jichuanh-mgpu-pin-kit-resources.rst +14 lines: changelog fragment

👍 Strengths

  1. Opt-in by default — Zero risk to existing single-GPU or user-facing paths. The env var must be explicitly set.
  2. Correct placement — The pinning logic sits in _resolve_device_settings() right after active_gpu/physics_gpu assignment, which is the logical place for renderer GPU configuration.
  3. Robust truthiness parsing — The not in {"", "0", "false", "no", "off"} pattern after .lower() covers common shell conventions.
  4. Good logginglogger.info(...) emits a clear signal when the pin is active.
  5. Thorough PR description — Excellent traceability to upstream bugs (NVBug 5687364, #3475) and documented WAR source (Kelly Guo).
  6. Changelog present — Properly formatted RST fragment.

💡 Minor Suggestions (non-blocking)

  1. Idempotency guard: If a user already passes --/renderer/multiGpu/enabled=False manually via CLI, this would append a duplicate. Consider checking sys.argv for existing overrides before appending:

    if not any("--/renderer/multiGpu/enabled" in arg for arg in sys.argv):
        sys.argv.append("--/renderer/multiGpu/enabled=False")

    (Low priority — Kit likely handles duplicates gracefully by last-wins, but explicit > implicit.)

  2. Consider logging the trigger source: The log message could include the env var value for debugging:

    logger.info("ISAACLAB_PIN_KIT_GPU=%s: pinning Kit renderer to a single GPU", raw_val)
  3. Test coverage: The PR description mentions smoke testing locally, but there's no unit test for the env var parsing logic. A small parametrized test (mocking os.environ and checking sys.argv after _resolve_device_settings) would be valuable for regression, especially since this interacts with CI infrastructure.


🔍 Potential Concerns

  • CI validation pending — The PR body notes that the actual A/B test against the multi-GPU hang is still pending (cherry-pick onto #5875). This is appropriate for merge given it's opt-in, but the CI experiment should be tracked to confirm efficacy.

CI Status

  • ✅ pre-commit, labeler, changelog check, build wheel: pass
  • ⏳ Installation tests, Docker builds, docs: pending (expected for fresh push)

Verdict: Ship it. Clean opt-in workaround with proper documentation. The minor suggestions above are non-blocking improvements that could be addressed in a follow-up.

hujc7 added a commit to hujc7/IsaacLab that referenced this pull request Jun 3, 2026
Per per-PR minimum-needed analysis:
- isaac-sim#5886 (bounded shutdown) is closed (audit verdict nice-to-have;
  isaac-sim#5933 prevents the hang upstream so the force-exit timer is moot).
  Reverts SIGHUP handler + ISAACLAB_FORCE_EXIT_TIMEOUT timer in
  AppLauncher; drops the workflow env var.
- isaac-sim#5883 (kitless newton) kept open as a separate PR but left out of
  this diagnostic bundle to test whether isaac-sim#5933 alone is enough for
  newton test_articulation (which calls
  build_simulation_context(sim_cfg=, device=) at line 2427, so still
  needs isaac-sim#5881 for the cross-device kwarg fix). Reverts the newton
  test_articulation kitless conversion and the schemas.py
  _create_fixed_joint_to_world helper.

Bundle now contains: isaac-sim#5823 + isaac-sim#5875 base + isaac-sim#5881 + isaac-sim#5933 + the JUnit
XML path-collision fix in conftest. If green, confirms only 4 PRs
are needed for multi-GPU CI green (with test_articulation un-gated).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant