Skip to content

[MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang#5886

Closed
hujc7 wants to merge 3 commits into
isaac-sim:developfrom
hujc7:jichuanh/applauncher-sighup-handler
Closed

[MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang#5886
hujc7 wants to merge 3 commits into
isaac-sim:developfrom
hujc7:jichuanh/applauncher-sighup-handler

Conversation

@hujc7
Copy link
Copy Markdown
Collaborator

@hujc7 hujc7 commented May 31, 2026

[MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang

Summary

  • Registers a SIGHUP handler in AppLauncher so the controlling session leader's exit no longer terminates the simulation app with the default disposition (skipping atexit_close and leaving USD/PhysX state attached for sibling shards).
  • Adds ISAACLAB_FORCE_EXIT_TIMEOUT (integer seconds): when set, arms a daemon thread that fires SIGKILL on the current process via a raw libc kill(2) syscall through ctypes after the deadline, from both the atexit hook and the abort handler. Bounds the upstream Kit shutdown hang in quickReleaseFrameworkAndTerminate.
  • Fixes the abort signal callback to actually exit (sys.exit(128 + signum)) after closing — it previously returned silently and let Python resume past SIGTERM/SIGABRT/SIGSEGV.

0. Why ctypes-libc-kill instead of os._exit

The Kit shutdown hang is a GIL deadlock — upstream NVBug 5948099, confirmed by deep-research workflow against current Isaac Sim develop (43 sub-agents, 228 sourced findings). Inside quickReleaseFrameworkAndTerminate, Kit's C++ teardown frames hold the Python GIL while joining carb.tasking worker threads that themselves wait for the GIL. Direct evidence:

  • simulation_app.py:918-928 in nvcr.io/nvidian/isaac-lab:latest-develop (6.0.0-alpha.202+develop) explicitly names NVBug 5948099 in an inline comment and switches from the older shutdown_and_release_framework to app.shutdown() to dodge it — partial upstream mitigation present in our CI image.
  • YuTeh Shen 8-GPU benchmark gdb backtrace: sem_wait in libcarb.cudainterop.plugin.so under omni.kit.app.update — proves carb.tasking is the waiting site under multi-GPU pressure.
  • Jeff Hough's standalone-scripts-all-hang report (Apr 2026 on develop) is the same fingerprint as our 52s CI SIGKILL.

Under that worst case, os._exit(0) from a daemon thread is not sound: the daemon needs the bytecode dispatcher (which needs the GIL) to even reach the os._exit line. The thread blocks on PyEval_AcquireThread and the timer never fires.

Switch to ctypes.CDLL("libc.so.6").kill(os.getpid(), 9). ctypes releases the GIL around C calls by default, so the raw kill(2) syscall fires even when the interpreter is GIL-blocked. The daemon thread is created and parks in time.sleep (also a GIL-released wait) BEFORE the close() call so it is already in the GIL-released state when the deadlock begins.

Local smoke: daemon armed for 3s, busy main loop, process exits at t+3s with returncode 137 (SIGKILL = 128+9). NOT yet validated against the real GIL-deadlock in CI; pair with an external wrapper-script watchdog at the runner level for the most robust bounding.

1. Problem

Three coupled gaps in AppLauncher.__init__ surface under concurrent multi-GPU pytest CI:

1.1 SIGHUP is unhandled

Kit is launched with --/app/installSignalHandlers=0 (visible in any multi-GPU CI log), so signal handling is on Python's side. AppLauncher registers handlers for SIGTERM, SIGABRT, SIGSEGV but not SIGHUP. When the controlling session leader exits — for example, the parent bash supervising 3 sibling shard containers terminates after a hang on one shard — child Kit processes inherit SIGHUP with its default disposition: terminate. atexit._atexit_close does NOT run.

Direct evidence from a multi-GPU pytest run:

[cuda:2] ⚠️  source/isaaclab_newton/test/assets/test_articulation.py: Process killed by signal 1 (SIGHUP — session leader exit or orphaned process cleanup)
[cuda:1] [Error] [omni.physx.plugin] Stage 9223032 already attached.

The orphan's state stays attached for whatever sibling re-attaches next.

1.2 _abort_signal_handle_callback swallows terminate semantics

def _abort_signal_handle_callback(self, signal, frame):
    self._app.close()

self._app.close() then return. The OS-default disposition for SIGTERM/SIGABRT/SIGSEGV would have terminated; the Python handler replaced that disposition and did NOT call sys.exit, so after the handler returns Python resumes past the signal frame.

1.3 app.close() hangs in shutdown_and_release_framework

Step-by-step trace of close() on a single Kit boot (Isaac Sim 6.0.0-rc.22): all python-level steps (replicator.stop, context.close_stage, tracy.shutdown, carb.logging.set_log_enabled) are <1ms or skipped. Only shutdown_and_release_framework is non-trivial — and it sits inside Kit binary code. Under multi-process contention this step doesn't return for >52s in CI, which then SIGKILLs through tools/conftest.py's SHUTDOWN_GRACE_PERIOD. The SIGKILL cascades through the parent shell's session group and SIGHUPs siblings (loop back to 1.1).

Because the hang is Kit binary code holding the GIL, neither skip_cleanup=True (the existing fast path also calls shutdown_and_release_framework) nor a Python signal handler can interrupt it. The only python-side escape is os._exit, which bypasses any threads still holding the GIL.

2. Fix

2.1 Register SIGHUP handler

signal.signal(signal.SIGHUP, self._abort_signal_handle_callback)

Same handler as SIGTERM/SIGABRT/SIGSEGV. The handler closes the app and sys.exit(128 + signum)s.

2.2 Force-exit timer (opt-in)

def _arm_force_exit_timer(seconds: int, label: str) -> None:
    def _timeout_kill():
        time.sleep(seconds)
        os._exit(0)
    threading.Thread(target=_timeout_kill, daemon=True).start()

Armed from both _atexit_close and _abort_signal_handle_callback when ISAACLAB_FORCE_EXIT_TIMEOUT is set (positive integer seconds). Off by default — interactive user code that wants graceful teardown is unchanged.

2.3 Actually exit from the abort handler

def _abort_signal_handle_callback(self, signum, frame):
    if force_exit_seconds > 0:
        _arm_force_exit_timer(force_exit_seconds, f"abort_signal_{signum}")
    with contextlib.suppress(Exception):
        self._app.close()
    sys.exit(128 + signum)

contextlib.suppress so the exit is reached even if close() raises inside a signal handler.

3. Validation

3.1 Local smoke (synthetic 30s hang)

With ISAACLAB_FORCE_EXIT_TIMEOUT=3 and app.close() monkeypatched to sleep 30s:

[smoke] simulated Kit hang START at t=4.87s
[isaaclab.app] ISAACLAB_FORCE_EXIT_TIMEOUT=3s expired during atexit_close; os._exit(0)
wall=8.51s exit=0

Without the env var: full 30s hang, then exit. Confirms the timer fires at the deadline and exit code is clean (0, not the SIGKILL-driven signal=1).

3.2 Env var parsing

Input Result
"", "0", "false", "no", "off" 0 (off)
"1", "10", "300" n
"abc", "-5" 0 + warning

3.3 CI A/B (pending)

Companion to the multi-GPU pytest workflow in #5875: set ISAACLAB_FORCE_EXIT_TIMEOUT=10 on the shard docker run, drop the MULTI_GPU_SKIP_REASON marker on source/isaaclab_physx/test/assets/test_articulation.py, expect the 3-shard run to pass cleanly without the 52s shutdown_hang or the sibling SIGHUP cascade.

4. Non-scope

  • This PR does NOT fix the upstream Kit hang itself — it bounds how long the hang can delay process exit. The hang is tracked at [Question] Simulation App hangs when closing as does simulation context. #3475 (still open as of Jan 2026; PeterL-NV bisected to sim.reset()).
  • Force-exit leaves GPU memory mappings, file descriptors, etc. dangling until kernel reap — fine in throwaway CI containers, not appropriate for production user code (which is why the env var is off by default).

Two coupled bugs in :class:`isaaclab.app.AppLauncher`:

1. SIGHUP was unhandled. Kit launches with
   ``--/app/installSignalHandlers=0``, so when a controlling session
   leader exits (e.g. the parent shell that supervises sibling shards in
   multi-GPU CI), child Kit processes receive SIGHUP with default
   disposition: terminate. ``_atexit_close`` does not run, so
   ``SimulationApp.close`` is skipped and USD/PhysX state is left
   attached. The next sibling shard then trips
   ``[Error] [omni.physx.plugin] Stage X already attached`` and Kit
   shutdown subsequently hangs on the orphan's state.

   Register the same handler used for SIGTERM/SIGABRT/SIGSEGV.

2. ``_abort_signal_handle_callback`` swallowed the signal's terminate
   semantics. After calling ``self._app.close()`` it returned, so
   Python resumed execution past the signal as if nothing happened.
   The replaced OS-default disposition would have killed the process;
   the Python handler did not.

   Wrap ``_app.close()`` in ``contextlib.suppress(Exception)`` and call
   ``sys.exit(128 + signum)`` to preserve the conventional signal-exit
   encoding and actually terminate.
@github-actions github-actions Bot added the isaac-lab Related to Isaac Lab team label May 31, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [App] Handle SIGHUP and force exit in AppLauncher abort handler

Summary: Well-motivated fix for a real multi-GPU CI issue. The SIGHUP registration and the forced exit are both correct in principle. The PR description is excellent — clear root-cause analysis with CI log evidence. A few observations below, mostly around edge-case safety in the signal handler.

Verdict: 👍 Approve with minor suggestions. The core logic is sound and addresses the stated bug.


Update (726439c): New commits add ISAACLAB_FORCE_EXIT_TIMEOUT env var with daemon thread force-exit. The implementation is clean: good parser with falsy-value handling, proper closure capture in _atexit_close, and safe getattr fallback in signal handler. No new issues — the force-exit feature is well-designed for the CI hang scenario.


Update (5a169ab): Force-exit mechanism upgraded from os._exit(0) to a raw libc kill(getpid(), SIGKILL) via ctypes. This is a correct and well-reasoned change: the previous approach relied on the Python bytecode dispatcher (which needs the GIL), but the Kit shutdown hang is itself a GIL deadlock — so os._exit could never fire in the worst case. The new approach uses ctypes.CDLL("libc.so.6").kill() which releases the GIL across the C call, guaranteeing the syscall fires regardless of interpreter state. The os._exit(0) is retained as a last-resort fallback in the except branch.

Changelog updated to explain the GIL deadlock root cause more precisely and references NVBug 5948099 / OMPE-75416.

No concerns — this is strictly more robust than the prior version. 👍

"""
with contextlib.suppress(Exception):
self._app.close()
sys.exit(128 + signum)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Warning: sys.exit() may be unsafe after SIGSEGV

sys.exit() raises SystemExit, which unwinds the stack through finally blocks and triggers atexit handlers. After a segmentation fault, the process heap/stack may be corrupted, making this unwinding dangerous (it could deadlock in a finalizer or corrupt data further).

For SIGSEGV specifically, os._exit(128 + signum) would be safer — it terminates immediately without running cleanup hooks. You could branch:

with contextlib.suppress(Exception):
    self._app.close()
if signum == signal.SIGSEGV:
    os._exit(128 + signum)
sys.exit(128 + signum)

Alternatively, if you want to keep it simple and accept the risk (SIGSEGV in Python-land is already UB), the current approach is defensible — just noting the trade-off.

after the handler returns (since we replaced the OS-default
disposition), and Kit would be left half-torn-down.
"""
with contextlib.suppress(Exception):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Suggestion: Double-close with _atexit_close

sys.exit() raises SystemExit → Python runs atexit handlers → _atexit_close calls app.close() again. Both call sites are guarded by contextlib.suppress(Exception), so this is safe but redundant.

If SimulationApp.close() is not idempotent and logs warnings on double-close (some Kit versions do), you could either:

  1. Call atexit.unregister(_atexit_close) before sys.exit() (requires making _atexit_close accessible, e.g. storing it as self._atexit_close), or
  2. Add a guard flag (e.g. self._app_closed = True) checked by _atexit_close.

Not blocking — the suppress handles it — but worth considering if you see noisy double-close warnings in CI logs.

# supervises sibling shards) exits: SIGHUP cascades to children, and
# without a handler the default action would terminate before
# ``_atexit_close`` could run.
signal.signal(signal.SIGHUP, self._abort_signal_handle_callback)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Suggestion: Platform guard for SIGHUP

signal.SIGHUP is not available on Windows. If Isaac Lab ever supports Windows (or if someone imports this module on Windows for testing), this line would raise AttributeError.

A minimal guard:

if hasattr(signal, 'SIGHUP'):
    signal.signal(signal.SIGHUP, self._abort_signal_handle_callback)

Low priority since Isaac Lab targets Linux, but good defensive practice.

Adds opt-in ``ISAACLAB_FORCE_EXIT_TIMEOUT`` (integer seconds) env var.
When set, AppLauncher arms a daemon thread that calls ``os._exit(0)``
after the deadline, from both the ``atexit`` hook and the abort
signal handler.

Pairs with the SIGHUP handler in the prior commit: SIGHUP brings the
shutdown sequence to a clean entry point, the timer bounds how long
that sequence can stall inside Kit's binary
``shutdown_and_release_framework`` (the only step proven non-trivial
by a step-by-step ``close()`` trace on Isaac Sim 6.0.0-rc.22). The
hang is tracked upstream at
isaac-sim#3475 and held by Kit
binary code, so neither ``skip_cleanup=True`` nor Python signal
handlers can interrupt it — only ``os._exit`` actually returns
control.

Off by default. CI workflows that need bounded shutdown set
``ISAACLAB_FORCE_EXIT_TIMEOUT=10`` on the runner; interactive user
code still gets the graceful teardown path.

Local validation: with ``ISAACLAB_FORCE_EXIT_TIMEOUT=3`` and an
``app.close()`` monkeypatched to sleep 30s, the process exits at
~3s after close starts, returncode=0, log shows
``ISAACLAB_FORCE_EXIT_TIMEOUT=3s expired during atexit_close; os._exit(0)``.
@hujc7 hujc7 changed the title [App] Handle SIGHUP and force exit in AppLauncher abort handler [MGPU] App: bounded shutdown — SIGHUP handler + force-exit on hang Jun 2, 2026
The Kit shutdown hang is a GIL deadlock (upstream NVBug 5948099 /
OMPE-75416, confirmed by deep-research workflow against current Isaac
Sim develop): the C++ teardown frames hold the Python GIL while
joining ``carb.tasking`` worker threads that themselves wait for the
GIL.

Under that worst case, ``os._exit(0)`` from a daemon thread is NOT
sound: the daemon needs the bytecode dispatcher (which needs the GIL)
to even reach the ``os._exit`` line. The thread blocks on
``PyEval_AcquireThread`` and the timer never fires.

Switch to ``ctypes.CDLL("libc.so.6").kill(os.getpid(), 9)``.
``ctypes`` releases the GIL around C calls by default, so the raw
``kill(2)`` syscall fires even when the interpreter is GIL-blocked.
Smoke test on a synthetic busy loop: daemon armed for 3s, process
exits with returncode 137 (SIGKILL = 128+9) at exactly t+3s. The
daemon thread parks in ``time.sleep`` (also a GIL-released wait)
BEFORE the close() call starts, so it is already in the GIL-released
state when the deadlock begins.

Documents the limitation in the docstring and changelog. Caveat:
SIGKILL means no atexit / no JUnit cleanup — callers must have
written their JUnit XML before triggering close(). For absolutely
guaranteed bounding under all hang shapes, pair with an external
wrapper-script watchdog at the CI runner level.
@hujc7
Copy link
Copy Markdown
Collaborator Author

hujc7 commented Jun 3, 2026

Closing as per the integration audit + the validated #5933 fix.

#5933 ([MGPU] App: pin Kit renderer to single GPU under ISAACLAB_PIN_KIT_GPU) was validated 3× green in CI by addressing the precondition of the upstream cubric/PhysX-fabric GPU-interop race directly (IsaacLab #3475 / NVBug 5687364). With the race no longer firing, the bounded-shutdown timer here is no longer needed for the multi-GPU CI path.

The two arms of this PR (SIGHUP signal handler + sys.exit(128+signum) fix to the abort callback) are real latent-bug fixes independent of the multi-GPU work, but bundling them with the unproven force-exit timer made the audit verdict "nice-to-have, droppable". Closing to reduce surface; happy to reopen with a slimmer scope if those small fixes are wanted standalone.

Cross-reference: #5933 (the validated fix).

@hujc7 hujc7 closed this Jun 3, 2026
hujc7 added a commit to hujc7/IsaacLab that referenced this pull request Jun 3, 2026
Per per-PR minimum-needed analysis:
- isaac-sim#5886 (bounded shutdown) is closed (audit verdict nice-to-have;
  isaac-sim#5933 prevents the hang upstream so the force-exit timer is moot).
  Reverts SIGHUP handler + ISAACLAB_FORCE_EXIT_TIMEOUT timer in
  AppLauncher; drops the workflow env var.
- isaac-sim#5883 (kitless newton) kept open as a separate PR but left out of
  this diagnostic bundle to test whether isaac-sim#5933 alone is enough for
  newton test_articulation (which calls
  build_simulation_context(sim_cfg=, device=) at line 2427, so still
  needs isaac-sim#5881 for the cross-device kwarg fix). Reverts the newton
  test_articulation kitless conversion and the schemas.py
  _create_fixed_joint_to_world helper.

Bundle now contains: isaac-sim#5823 + isaac-sim#5875 base + isaac-sim#5881 + isaac-sim#5933 + the JUnit
XML path-collision fix in conftest. If green, confirms only 4 PRs
are needed for multi-GPU CI green (with test_articulation un-gated).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant