[MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang#5886
[MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang#5886hujc7 wants to merge 3 commits into
Conversation
Two coupled bugs in :class:`isaaclab.app.AppLauncher`: 1. SIGHUP was unhandled. Kit launches with ``--/app/installSignalHandlers=0``, so when a controlling session leader exits (e.g. the parent shell that supervises sibling shards in multi-GPU CI), child Kit processes receive SIGHUP with default disposition: terminate. ``_atexit_close`` does not run, so ``SimulationApp.close`` is skipped and USD/PhysX state is left attached. The next sibling shard then trips ``[Error] [omni.physx.plugin] Stage X already attached`` and Kit shutdown subsequently hangs on the orphan's state. Register the same handler used for SIGTERM/SIGABRT/SIGSEGV. 2. ``_abort_signal_handle_callback`` swallowed the signal's terminate semantics. After calling ``self._app.close()`` it returned, so Python resumed execution past the signal as if nothing happened. The replaced OS-default disposition would have killed the process; the Python handler did not. Wrap ``_app.close()`` in ``contextlib.suppress(Exception)`` and call ``sys.exit(128 + signum)`` to preserve the conventional signal-exit encoding and actually terminate.
There was a problem hiding this comment.
Review: [App] Handle SIGHUP and force exit in AppLauncher abort handler
Summary: Well-motivated fix for a real multi-GPU CI issue. The SIGHUP registration and the forced exit are both correct in principle. The PR description is excellent — clear root-cause analysis with CI log evidence. A few observations below, mostly around edge-case safety in the signal handler.
Verdict: 👍 Approve with minor suggestions. The core logic is sound and addresses the stated bug.
Update (726439c): New commits add ISAACLAB_FORCE_EXIT_TIMEOUT env var with daemon thread force-exit. The implementation is clean: good parser with falsy-value handling, proper closure capture in _atexit_close, and safe getattr fallback in signal handler. No new issues — the force-exit feature is well-designed for the CI hang scenario.
Update (5a169ab): Force-exit mechanism upgraded from os._exit(0) to a raw libc kill(getpid(), SIGKILL) via ctypes. This is a correct and well-reasoned change: the previous approach relied on the Python bytecode dispatcher (which needs the GIL), but the Kit shutdown hang is itself a GIL deadlock — so os._exit could never fire in the worst case. The new approach uses ctypes.CDLL("libc.so.6").kill() which releases the GIL across the C call, guaranteeing the syscall fires regardless of interpreter state. The os._exit(0) is retained as a last-resort fallback in the except branch.
Changelog updated to explain the GIL deadlock root cause more precisely and references NVBug 5948099 / OMPE-75416.
No concerns — this is strictly more robust than the prior version. 👍
| """ | ||
| with contextlib.suppress(Exception): | ||
| self._app.close() | ||
| sys.exit(128 + signum) |
There was a problem hiding this comment.
🟡 Warning: sys.exit() may be unsafe after SIGSEGV
sys.exit() raises SystemExit, which unwinds the stack through finally blocks and triggers atexit handlers. After a segmentation fault, the process heap/stack may be corrupted, making this unwinding dangerous (it could deadlock in a finalizer or corrupt data further).
For SIGSEGV specifically, os._exit(128 + signum) would be safer — it terminates immediately without running cleanup hooks. You could branch:
with contextlib.suppress(Exception):
self._app.close()
if signum == signal.SIGSEGV:
os._exit(128 + signum)
sys.exit(128 + signum)Alternatively, if you want to keep it simple and accept the risk (SIGSEGV in Python-land is already UB), the current approach is defensible — just noting the trade-off.
| after the handler returns (since we replaced the OS-default | ||
| disposition), and Kit would be left half-torn-down. | ||
| """ | ||
| with contextlib.suppress(Exception): |
There was a problem hiding this comment.
🔵 Suggestion: Double-close with _atexit_close
sys.exit() raises SystemExit → Python runs atexit handlers → _atexit_close calls app.close() again. Both call sites are guarded by contextlib.suppress(Exception), so this is safe but redundant.
If SimulationApp.close() is not idempotent and logs warnings on double-close (some Kit versions do), you could either:
- Call
atexit.unregister(_atexit_close)beforesys.exit()(requires making_atexit_closeaccessible, e.g. storing it asself._atexit_close), or - Add a guard flag (e.g.
self._app_closed = True) checked by_atexit_close.
Not blocking — the suppress handles it — but worth considering if you see noisy double-close warnings in CI logs.
| # supervises sibling shards) exits: SIGHUP cascades to children, and | ||
| # without a handler the default action would terminate before | ||
| # ``_atexit_close`` could run. | ||
| signal.signal(signal.SIGHUP, self._abort_signal_handle_callback) |
There was a problem hiding this comment.
🔵 Suggestion: Platform guard for SIGHUP
signal.SIGHUP is not available on Windows. If Isaac Lab ever supports Windows (or if someone imports this module on Windows for testing), this line would raise AttributeError.
A minimal guard:
if hasattr(signal, 'SIGHUP'):
signal.signal(signal.SIGHUP, self._abort_signal_handle_callback)Low priority since Isaac Lab targets Linux, but good defensive practice.
Adds opt-in ``ISAACLAB_FORCE_EXIT_TIMEOUT`` (integer seconds) env var. When set, AppLauncher arms a daemon thread that calls ``os._exit(0)`` after the deadline, from both the ``atexit`` hook and the abort signal handler. Pairs with the SIGHUP handler in the prior commit: SIGHUP brings the shutdown sequence to a clean entry point, the timer bounds how long that sequence can stall inside Kit's binary ``shutdown_and_release_framework`` (the only step proven non-trivial by a step-by-step ``close()`` trace on Isaac Sim 6.0.0-rc.22). The hang is tracked upstream at isaac-sim#3475 and held by Kit binary code, so neither ``skip_cleanup=True`` nor Python signal handlers can interrupt it — only ``os._exit`` actually returns control. Off by default. CI workflows that need bounded shutdown set ``ISAACLAB_FORCE_EXIT_TIMEOUT=10`` on the runner; interactive user code still gets the graceful teardown path. Local validation: with ``ISAACLAB_FORCE_EXIT_TIMEOUT=3`` and an ``app.close()`` monkeypatched to sleep 30s, the process exits at ~3s after close starts, returncode=0, log shows ``ISAACLAB_FORCE_EXIT_TIMEOUT=3s expired during atexit_close; os._exit(0)``.
The Kit shutdown hang is a GIL deadlock (upstream NVBug 5948099 /
OMPE-75416, confirmed by deep-research workflow against current Isaac
Sim develop): the C++ teardown frames hold the Python GIL while
joining ``carb.tasking`` worker threads that themselves wait for the
GIL.
Under that worst case, ``os._exit(0)`` from a daemon thread is NOT
sound: the daemon needs the bytecode dispatcher (which needs the GIL)
to even reach the ``os._exit`` line. The thread blocks on
``PyEval_AcquireThread`` and the timer never fires.
Switch to ``ctypes.CDLL("libc.so.6").kill(os.getpid(), 9)``.
``ctypes`` releases the GIL around C calls by default, so the raw
``kill(2)`` syscall fires even when the interpreter is GIL-blocked.
Smoke test on a synthetic busy loop: daemon armed for 3s, process
exits with returncode 137 (SIGKILL = 128+9) at exactly t+3s. The
daemon thread parks in ``time.sleep`` (also a GIL-released wait)
BEFORE the close() call starts, so it is already in the GIL-released
state when the deadlock begins.
Documents the limitation in the docstring and changelog. Caveat:
SIGKILL means no atexit / no JUnit cleanup — callers must have
written their JUnit XML before triggering close(). For absolutely
guaranteed bounding under all hang shapes, pair with an external
wrapper-script watchdog at the CI runner level.
|
Closing as per the integration audit + the validated #5933 fix. #5933 ( The two arms of this PR (SIGHUP signal handler + Cross-reference: #5933 (the validated fix). |
Per per-PR minimum-needed analysis: - isaac-sim#5886 (bounded shutdown) is closed (audit verdict nice-to-have; isaac-sim#5933 prevents the hang upstream so the force-exit timer is moot). Reverts SIGHUP handler + ISAACLAB_FORCE_EXIT_TIMEOUT timer in AppLauncher; drops the workflow env var. - isaac-sim#5883 (kitless newton) kept open as a separate PR but left out of this diagnostic bundle to test whether isaac-sim#5933 alone is enough for newton test_articulation (which calls build_simulation_context(sim_cfg=, device=) at line 2427, so still needs isaac-sim#5881 for the cross-device kwarg fix). Reverts the newton test_articulation kitless conversion and the schemas.py _create_fixed_joint_to_world helper. Bundle now contains: isaac-sim#5823 + isaac-sim#5875 base + isaac-sim#5881 + isaac-sim#5933 + the JUnit XML path-collision fix in conftest. If green, confirms only 4 PRs are needed for multi-GPU CI green (with test_articulation un-gated).
[MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang
Summary
AppLauncherso the controlling session leader's exit no longer terminates the simulation app with the default disposition (skippingatexit_closeand leaving USD/PhysX state attached for sibling shards).ISAACLAB_FORCE_EXIT_TIMEOUT(integer seconds): when set, arms a daemon thread that fires SIGKILL on the current process via a raw libckill(2)syscall throughctypesafter the deadline, from both the atexit hook and the abort handler. Bounds the upstream Kit shutdown hang inquickReleaseFrameworkAndTerminate.sys.exit(128 + signum)) after closing — it previously returned silently and let Python resume past SIGTERM/SIGABRT/SIGSEGV.0. Why ctypes-libc-kill instead of
os._exitThe Kit shutdown hang is a GIL deadlock — upstream NVBug 5948099, confirmed by deep-research workflow against current Isaac Sim develop (43 sub-agents, 228 sourced findings). Inside
quickReleaseFrameworkAndTerminate, Kit's C++ teardown frames hold the Python GIL while joiningcarb.taskingworker threads that themselves wait for the GIL. Direct evidence:simulation_app.py:918-928innvcr.io/nvidian/isaac-lab:latest-develop(6.0.0-alpha.202+develop) explicitly names NVBug 5948099 in an inline comment and switches from the oldershutdown_and_release_frameworktoapp.shutdown()to dodge it — partial upstream mitigation present in our CI image.sem_waitinlibcarb.cudainterop.plugin.sounderomni.kit.app.update— provescarb.taskingis the waiting site under multi-GPU pressure.Under that worst case,
os._exit(0)from a daemon thread is not sound: the daemon needs the bytecode dispatcher (which needs the GIL) to even reach theos._exitline. The thread blocks onPyEval_AcquireThreadand the timer never fires.Switch to
ctypes.CDLL("libc.so.6").kill(os.getpid(), 9).ctypesreleases the GIL around C calls by default, so the rawkill(2)syscall fires even when the interpreter is GIL-blocked. The daemon thread is created and parks intime.sleep(also a GIL-released wait) BEFORE the close() call so it is already in the GIL-released state when the deadlock begins.Local smoke: daemon armed for 3s, busy main loop, process exits at t+3s with returncode 137 (SIGKILL = 128+9). NOT yet validated against the real GIL-deadlock in CI; pair with an external wrapper-script watchdog at the runner level for the most robust bounding.
1. Problem
Three coupled gaps in
AppLauncher.__init__surface under concurrent multi-GPU pytest CI:1.1 SIGHUP is unhandled
Kit is launched with
--/app/installSignalHandlers=0(visible in any multi-GPU CI log), so signal handling is on Python's side.AppLauncherregisters handlers forSIGTERM,SIGABRT,SIGSEGVbut notSIGHUP. When the controlling session leader exits — for example, the parent bash supervising 3 sibling shard containers terminates after a hang on one shard — child Kit processes inheritSIGHUPwith its default disposition: terminate.atexit._atexit_closedoes NOT run.Direct evidence from a multi-GPU pytest run:
The orphan's state stays attached for whatever sibling re-attaches next.
1.2
_abort_signal_handle_callbackswallows terminate semanticsself._app.close()then return. The OS-default disposition for SIGTERM/SIGABRT/SIGSEGV would have terminated; the Python handler replaced that disposition and did NOT callsys.exit, so after the handler returns Python resumes past the signal frame.1.3
app.close()hangs inshutdown_and_release_frameworkStep-by-step trace of
close()on a single Kit boot (Isaac Sim 6.0.0-rc.22): all python-level steps (replicator.stop,context.close_stage,tracy.shutdown,carb.logging.set_log_enabled) are <1ms or skipped. Onlyshutdown_and_release_frameworkis non-trivial — and it sits inside Kit binary code. Under multi-process contention this step doesn't return for >52s in CI, which then SIGKILLs throughtools/conftest.py'sSHUTDOWN_GRACE_PERIOD. The SIGKILL cascades through the parent shell's session group and SIGHUPs siblings (loop back to 1.1).Because the hang is Kit binary code holding the GIL, neither
skip_cleanup=True(the existing fast path also callsshutdown_and_release_framework) nor a Python signal handler can interrupt it. The only python-side escape isos._exit, which bypasses any threads still holding the GIL.2. Fix
2.1 Register SIGHUP handler
Same handler as
SIGTERM/SIGABRT/SIGSEGV. The handler closes the app andsys.exit(128 + signum)s.2.2 Force-exit timer (opt-in)
Armed from both
_atexit_closeand_abort_signal_handle_callbackwhenISAACLAB_FORCE_EXIT_TIMEOUTis set (positive integer seconds). Off by default — interactive user code that wants graceful teardown is unchanged.2.3 Actually exit from the abort handler
contextlib.suppressso the exit is reached even ifclose()raises inside a signal handler.3. Validation
3.1 Local smoke (synthetic 30s hang)
With
ISAACLAB_FORCE_EXIT_TIMEOUT=3andapp.close()monkeypatched to sleep 30s:Without the env var: full 30s hang, then exit. Confirms the timer fires at the deadline and exit code is clean (0, not the SIGKILL-driven signal=1).
3.2 Env var parsing
"","0","false","no","off""1","10","300""abc","-5"3.3 CI A/B (pending)
Companion to the multi-GPU pytest workflow in #5875: set
ISAACLAB_FORCE_EXIT_TIMEOUT=10on the sharddocker run, drop theMULTI_GPU_SKIP_REASONmarker onsource/isaaclab_physx/test/assets/test_articulation.py, expect the 3-shard run to pass cleanly without the 52sshutdown_hangor the sibling SIGHUP cascade.4. Non-scope
sim.reset()).