[MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang by hujc7 · Pull Request #5886 · isaac-sim/IsaacLab

hujc7 · 2026-05-31T00:10:27Z

[MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang

Summary

Registers a SIGHUP handler in AppLauncher so the controlling session leader's exit no longer terminates the simulation app with the default disposition (skipping atexit_close and leaving USD/PhysX state attached for sibling shards).
Adds ISAACLAB_FORCE_EXIT_TIMEOUT (integer seconds): when set, arms a daemon thread that fires SIGKILL on the current process via a raw libc kill(2) syscall through ctypes after the deadline, from both the atexit hook and the abort handler. Bounds the upstream Kit shutdown hang in quickReleaseFrameworkAndTerminate.
Fixes the abort signal callback to actually exit (sys.exit(128 + signum)) after closing — it previously returned silently and let Python resume past SIGTERM/SIGABRT/SIGSEGV.

0. Why ctypes-libc-kill instead of `os._exit`

The Kit shutdown hang is a GIL deadlock — upstream NVBug 5948099, confirmed by deep-research workflow against current Isaac Sim develop (43 sub-agents, 228 sourced findings). Inside quickReleaseFrameworkAndTerminate, Kit's C++ teardown frames hold the Python GIL while joining carb.tasking worker threads that themselves wait for the GIL. Direct evidence:

simulation_app.py:918-928 in nvcr.io/nvidian/isaac-lab:latest-develop (6.0.0-alpha.202+develop) explicitly names NVBug 5948099 in an inline comment and switches from the older shutdown_and_release_framework to app.shutdown() to dodge it — partial upstream mitigation present in our CI image.
YuTeh Shen 8-GPU benchmark gdb backtrace: sem_wait in libcarb.cudainterop.plugin.so under omni.kit.app.update — proves carb.tasking is the waiting site under multi-GPU pressure.
Jeff Hough's standalone-scripts-all-hang report (Apr 2026 on develop) is the same fingerprint as our 52s CI SIGKILL.

Under that worst case, os._exit(0) from a daemon thread is not sound: the daemon needs the bytecode dispatcher (which needs the GIL) to even reach the os._exit line. The thread blocks on PyEval_AcquireThread and the timer never fires.

Switch to ctypes.CDLL("libc.so.6").kill(os.getpid(), 9). ctypes releases the GIL around C calls by default, so the raw kill(2) syscall fires even when the interpreter is GIL-blocked. The daemon thread is created and parks in time.sleep (also a GIL-released wait) BEFORE the close() call so it is already in the GIL-released state when the deadlock begins.

Local smoke: daemon armed for 3s, busy main loop, process exits at t+3s with returncode 137 (SIGKILL = 128+9). NOT yet validated against the real GIL-deadlock in CI; pair with an external wrapper-script watchdog at the runner level for the most robust bounding.

1. Problem

Three coupled gaps in AppLauncher.__init__ surface under concurrent multi-GPU pytest CI:

1.1 SIGHUP is unhandled

Kit is launched with --/app/installSignalHandlers=0 (visible in any multi-GPU CI log), so signal handling is on Python's side. AppLauncher registers handlers for SIGTERM, SIGABRT, SIGSEGV but not SIGHUP. When the controlling session leader exits — for example, the parent bash supervising 3 sibling shard containers terminates after a hang on one shard — child Kit processes inherit SIGHUP with its default disposition: terminate. atexit._atexit_close does NOT run.

Direct evidence from a multi-GPU pytest run:

[cuda:2] ⚠️  source/isaaclab_newton/test/assets/test_articulation.py: Process killed by signal 1 (SIGHUP — session leader exit or orphaned process cleanup)
[cuda:1] [Error] [omni.physx.plugin] Stage 9223032 already attached.

The orphan's state stays attached for whatever sibling re-attaches next.

1.2 `_abort_signal_handle_callback` swallows terminate semantics

def _abort_signal_handle_callback(self, signal, frame):
    self._app.close()

self._app.close() then return. The OS-default disposition for SIGTERM/SIGABRT/SIGSEGV would have terminated; the Python handler replaced that disposition and did NOT call sys.exit, so after the handler returns Python resumes past the signal frame.

1.3 `app.close()` hangs in `shutdown_and_release_framework`

Step-by-step trace of close() on a single Kit boot (Isaac Sim 6.0.0-rc.22): all python-level steps (replicator.stop, context.close_stage, tracy.shutdown, carb.logging.set_log_enabled) are <1ms or skipped. Only shutdown_and_release_framework is non-trivial — and it sits inside Kit binary code. Under multi-process contention this step doesn't return for >52s in CI, which then SIGKILLs through tools/conftest.py's SHUTDOWN_GRACE_PERIOD. The SIGKILL cascades through the parent shell's session group and SIGHUPs siblings (loop back to 1.1).

Because the hang is Kit binary code holding the GIL, neither skip_cleanup=True (the existing fast path also calls shutdown_and_release_framework) nor a Python signal handler can interrupt it. The only python-side escape is os._exit, which bypasses any threads still holding the GIL.

2. Fix

2.1 Register SIGHUP handler

signal.signal(signal.SIGHUP, self._abort_signal_handle_callback)

Same handler as SIGTERM/SIGABRT/SIGSEGV. The handler closes the app and sys.exit(128 + signum)s.

2.2 Force-exit timer (opt-in)

def _arm_force_exit_timer(seconds: int, label: str) -> None:
    def _timeout_kill():
        time.sleep(seconds)
        os._exit(0)
    threading.Thread(target=_timeout_kill, daemon=True).start()

Armed from both _atexit_close and _abort_signal_handle_callback when ISAACLAB_FORCE_EXIT_TIMEOUT is set (positive integer seconds). Off by default — interactive user code that wants graceful teardown is unchanged.

2.3 Actually exit from the abort handler

def _abort_signal_handle_callback(self, signum, frame):
    if force_exit_seconds > 0:
        _arm_force_exit_timer(force_exit_seconds, f"abort_signal_{signum}")
    with contextlib.suppress(Exception):
        self._app.close()
    sys.exit(128 + signum)

contextlib.suppress so the exit is reached even if close() raises inside a signal handler.

3. Validation

3.1 Local smoke (synthetic 30s hang)

With ISAACLAB_FORCE_EXIT_TIMEOUT=3 and app.close() monkeypatched to sleep 30s:

[smoke] simulated Kit hang START at t=4.87s
[isaaclab.app] ISAACLAB_FORCE_EXIT_TIMEOUT=3s expired during atexit_close; os._exit(0)
wall=8.51s exit=0

Without the env var: full 30s hang, then exit. Confirms the timer fires at the deadline and exit code is clean (0, not the SIGKILL-driven signal=1).

3.2 Env var parsing

Input	Result
`""`, `"0"`, `"false"`, `"no"`, `"off"`	0 (off)
`"1"`, `"10"`, `"300"`	n
`"abc"`, `"-5"`	0 + warning

3.3 CI A/B (pending)

Companion to the multi-GPU pytest workflow in #5875: set ISAACLAB_FORCE_EXIT_TIMEOUT=10 on the shard docker run, drop the MULTI_GPU_SKIP_REASON marker on source/isaaclab_physx/test/assets/test_articulation.py, expect the 3-shard run to pass cleanly without the 52s shutdown_hang or the sibling SIGHUP cascade.

4. Non-scope

This PR does NOT fix the upstream Kit hang itself — it bounds how long the hang can delay process exit. The hang is tracked at [Question] Simulation App hangs when closing as does simulation context. #3475 (still open as of Jan 2026; PeterL-NV bisected to sim.reset()).
Force-exit leaves GPU memory mappings, file descriptors, etc. dangling until kernel reap — fine in throwaway CI containers, not appropriate for production user code (which is why the env var is off by default).

Two coupled bugs in :class:`isaaclab.app.AppLauncher`: 1. SIGHUP was unhandled. Kit launches with ``--/app/installSignalHandlers=0``, so when a controlling session leader exits (e.g. the parent shell that supervises sibling shards in multi-GPU CI), child Kit processes receive SIGHUP with default disposition: terminate. ``_atexit_close`` does not run, so ``SimulationApp.close`` is skipped and USD/PhysX state is left attached. The next sibling shard then trips ``[Error] [omni.physx.plugin] Stage X already attached`` and Kit shutdown subsequently hangs on the orphan's state. Register the same handler used for SIGTERM/SIGABRT/SIGSEGV. 2. ``_abort_signal_handle_callback`` swallowed the signal's terminate semantics. After calling ``self._app.close()`` it returned, so Python resumed execution past the signal as if nothing happened. The replaced OS-default disposition would have killed the process; the Python handler did not. Wrap ``_app.close()`` in ``contextlib.suppress(Exception)`` and call ``sys.exit(128 + signum)`` to preserve the conventional signal-exit encoding and actually terminate.

isaaclab-review-bot

Review: [App] Handle SIGHUP and force exit in AppLauncher abort handler

Summary: Well-motivated fix for a real multi-GPU CI issue. The SIGHUP registration and the forced exit are both correct in principle. The PR description is excellent — clear root-cause analysis with CI log evidence. A few observations below, mostly around edge-case safety in the signal handler.

Verdict: 👍 Approve with minor suggestions. The core logic is sound and addresses the stated bug.

Update (726439c): New commits add ISAACLAB_FORCE_EXIT_TIMEOUT env var with daemon thread force-exit. The implementation is clean: good parser with falsy-value handling, proper closure capture in _atexit_close, and safe getattr fallback in signal handler. No new issues — the force-exit feature is well-designed for the CI hang scenario.

Update (5a169ab): Force-exit mechanism upgraded from os._exit(0) to a raw libc kill(getpid(), SIGKILL) via ctypes. This is a correct and well-reasoned change: the previous approach relied on the Python bytecode dispatcher (which needs the GIL), but the Kit shutdown hang is itself a GIL deadlock — so os._exit could never fire in the worst case. The new approach uses ctypes.CDLL("libc.so.6").kill() which releases the GIL across the C call, guaranteeing the syscall fires regardless of interpreter state. The os._exit(0) is retained as a last-resort fallback in the except branch.

Changelog updated to explain the GIL deadlock root cause more precisely and references NVBug 5948099 / OMPE-75416.

No concerns — this is strictly more robust than the prior version. 👍

isaaclab-review-bot · 2026-05-31T00:13:09Z

+        """
+        with contextlib.suppress(Exception):
+            self._app.close()
+        sys.exit(128 + signum)


🟡 Warning: sys.exit() may be unsafe after SIGSEGV

sys.exit() raises SystemExit, which unwinds the stack through finally blocks and triggers atexit handlers. After a segmentation fault, the process heap/stack may be corrupted, making this unwinding dangerous (it could deadlock in a finalizer or corrupt data further).

For SIGSEGV specifically, os._exit(128 + signum) would be safer — it terminates immediately without running cleanup hooks. You could branch:

with contextlib.suppress(Exception): self._app.close() if signum == signal.SIGSEGV: os._exit(128 + signum) sys.exit(128 + signum)

Alternatively, if you want to keep it simple and accept the risk (SIGSEGV in Python-land is already UB), the current approach is defensible — just noting the trade-off.

isaaclab-review-bot · 2026-05-31T00:13:09Z

+        after the handler returns (since we replaced the OS-default
+        disposition), and Kit would be left half-torn-down.
+        """
+        with contextlib.suppress(Exception):


🔵 Suggestion: Double-close with _atexit_close

sys.exit() raises SystemExit → Python runs atexit handlers → _atexit_close calls app.close() again. Both call sites are guarded by contextlib.suppress(Exception), so this is safe but redundant.

If SimulationApp.close() is not idempotent and logs warnings on double-close (some Kit versions do), you could either:

Call atexit.unregister(_atexit_close) before sys.exit() (requires making _atexit_close accessible, e.g. storing it as self._atexit_close), or

Add a guard flag (e.g. self._app_closed = True) checked by _atexit_close.

Not blocking — the suppress handles it — but worth considering if you see noisy double-close warnings in CI logs.

isaaclab-review-bot · 2026-05-31T00:13:09Z

+        # supervises sibling shards) exits: SIGHUP cascades to children, and
+        # without a handler the default action would terminate before
+        # ``_atexit_close`` could run.
+        signal.signal(signal.SIGHUP, self._abort_signal_handle_callback)


🔵 Suggestion: Platform guard for SIGHUP

signal.SIGHUP is not available on Windows. If Isaac Lab ever supports Windows (or if someone imports this module on Windows for testing), this line would raise AttributeError.

A minimal guard:

if hasattr(signal, 'SIGHUP'): signal.signal(signal.SIGHUP, self._abort_signal_handle_callback)

Low priority since Isaac Lab targets Linux, but good defensive practice.

Adds opt-in ``ISAACLAB_FORCE_EXIT_TIMEOUT`` (integer seconds) env var. When set, AppLauncher arms a daemon thread that calls ``os._exit(0)`` after the deadline, from both the ``atexit`` hook and the abort signal handler. Pairs with the SIGHUP handler in the prior commit: SIGHUP brings the shutdown sequence to a clean entry point, the timer bounds how long that sequence can stall inside Kit's binary ``shutdown_and_release_framework`` (the only step proven non-trivial by a step-by-step ``close()`` trace on Isaac Sim 6.0.0-rc.22). The hang is tracked upstream at isaac-sim#3475 and held by Kit binary code, so neither ``skip_cleanup=True`` nor Python signal handlers can interrupt it — only ``os._exit`` actually returns control. Off by default. CI workflows that need bounded shutdown set ``ISAACLAB_FORCE_EXIT_TIMEOUT=10`` on the runner; interactive user code still gets the graceful teardown path. Local validation: with ``ISAACLAB_FORCE_EXIT_TIMEOUT=3`` and an ``app.close()`` monkeypatched to sleep 30s, the process exits at ~3s after close starts, returncode=0, log shows ``ISAACLAB_FORCE_EXIT_TIMEOUT=3s expired during atexit_close; os._exit(0)``.

The Kit shutdown hang is a GIL deadlock (upstream NVBug 5948099 / OMPE-75416, confirmed by deep-research workflow against current Isaac Sim develop): the C++ teardown frames hold the Python GIL while joining ``carb.tasking`` worker threads that themselves wait for the GIL. Under that worst case, ``os._exit(0)`` from a daemon thread is NOT sound: the daemon needs the bytecode dispatcher (which needs the GIL) to even reach the ``os._exit`` line. The thread blocks on ``PyEval_AcquireThread`` and the timer never fires. Switch to ``ctypes.CDLL("libc.so.6").kill(os.getpid(), 9)``. ``ctypes`` releases the GIL around C calls by default, so the raw ``kill(2)`` syscall fires even when the interpreter is GIL-blocked. Smoke test on a synthetic busy loop: daemon armed for 3s, process exits with returncode 137 (SIGKILL = 128+9) at exactly t+3s. The daemon thread parks in ``time.sleep`` (also a GIL-released wait) BEFORE the close() call starts, so it is already in the GIL-released state when the deadlock begins. Documents the limitation in the docstring and changelog. Caveat: SIGKILL means no atexit / no JUnit cleanup — callers must have written their JUnit XML before triggering close(). For absolutely guaranteed bounding under all hang shapes, pair with an external wrapper-script watchdog at the CI runner level.

hujc7 · 2026-06-03T05:42:51Z

Closing as per the integration audit + the validated #5933 fix.

#5933 ([MGPU] App: pin Kit renderer to single GPU under ISAACLAB_PIN_KIT_GPU) was validated 3× green in CI by addressing the precondition of the upstream cubric/PhysX-fabric GPU-interop race directly (IsaacLab #3475 / NVBug 5687364). With the race no longer firing, the bounded-shutdown timer here is no longer needed for the multi-GPU CI path.

The two arms of this PR (SIGHUP signal handler + sys.exit(128+signum) fix to the abort callback) are real latent-bug fixes independent of the multi-GPU work, but bundling them with the unproven force-exit timer made the audit verdict "nice-to-have, droppable". Closing to reduce surface; happy to reopen with a slimmer scope if those small fixes are wanted standalone.

Cross-reference: #5933 (the validated fix).

Per per-PR minimum-needed analysis: - isaac-sim#5886 (bounded shutdown) is closed (audit verdict nice-to-have; isaac-sim#5933 prevents the hang upstream so the force-exit timer is moot). Reverts SIGHUP handler + ISAACLAB_FORCE_EXIT_TIMEOUT timer in AppLauncher; drops the workflow env var. - isaac-sim#5883 (kitless newton) kept open as a separate PR but left out of this diagnostic bundle to test whether isaac-sim#5933 alone is enough for newton test_articulation (which calls build_simulation_context(sim_cfg=, device=) at line 2427, so still needs isaac-sim#5881 for the cross-device kwarg fix). Reverts the newton test_articulation kitless conversion and the schemas.py _create_fixed_joint_to_world helper. Bundle now contains: isaac-sim#5823 + isaac-sim#5875 base + isaac-sim#5881 + isaac-sim#5933 + the JUnit XML path-collision fix in conftest. If green, confirms only 4 PRs are needed for multi-GPU CI green (with test_articulation un-gated).

github-actions Bot added the isaac-lab Related to Isaac Lab team label May 31, 2026

isaaclab-review-bot Bot reviewed May 31, 2026

View reviewed changes

hujc7 changed the title ~~[App] Handle SIGHUP and force exit in AppLauncher abort handler~~ [MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang Jun 2, 2026

This was referenced Jun 3, 2026

[CI] Multi-GPU pytest: dynamic work-stealing + shutdown-hang stack capture #5875

Draft

[MGPU] Sim: honor device kwarg over sim_cfg.device in build_simulation_context #5881

Draft

[MGPU] Tests: make newton-only tests kitless (drop AppLauncher boot) #5883

Draft

This was referenced Jun 3, 2026

[MGPU] App: pin Kit renderer to single GPU under ISAACLAB_PIN_KIT_GPU #5933

Draft

[DO-NOT-MERGE][MGPU][TEST] Integration bundle: device kwarg + kitless newton + bounded shutdown + Kit-pin #5934

Draft

hujc7 closed this Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang#5886

[MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang#5886
hujc7 wants to merge 3 commits into
isaac-sim:developfrom
hujc7:jichuanh/applauncher-sighup-handler

hujc7 commented May 31, 2026 •

edited

Loading

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

isaaclab-review-bot Bot May 31, 2026

Uh oh!

isaaclab-review-bot Bot May 31, 2026

Uh oh!

isaaclab-review-bot Bot May 31, 2026

Uh oh!

hujc7 commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hujc7 commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang

Summary

0. Why ctypes-libc-kill instead of os._exit

1. Problem

1.1 SIGHUP is unhandled

1.2 _abort_signal_handle_callback swallows terminate semantics

1.3 app.close() hangs in shutdown_and_release_framework

2. Fix

2.1 Register SIGHUP handler

2.2 Force-exit timer (opt-in)

2.3 Actually exit from the abort handler

3. Validation

3.1 Local smoke (synthetic 30s hang)

3.2 Env var parsing

3.3 CI A/B (pending)

4. Non-scope

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review: [App] Handle SIGHUP and force exit in AppLauncher abort handler

Uh oh!

isaaclab-review-bot Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

isaaclab-review-bot Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

isaaclab-review-bot Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

hujc7 commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hujc7 commented May 31, 2026 •

edited

Loading

0. Why ctypes-libc-kill instead of `os._exit`

1.2 `_abort_signal_handle_callback` swallows terminate semantics

1.3 `app.close()` hangs in `shutdown_and_release_framework`

isaaclab-review-bot Bot left a comment •

edited

Loading