Skip to content

Add repro script for ManagerBasedEnv.close memory leak#5898

Closed
cvolkcvolk wants to merge 1 commit into
developfrom
cvolk/repro-manager-based-env-close-leak
Closed

Add repro script for ManagerBasedEnv.close memory leak#5898
cvolkcvolk wants to merge 1 commit into
developfrom
cvolk/repro-manager-based-env-close-leak

Conversation

@cvolkcvolk
Copy link
Copy Markdown
Collaborator

Description

Adds a runnable check script that reproduces the ManagerBasedEnv.close memory leak fixed by #5896. Useful for verifying the fix is in place and as a starting point for a future regression test.

The script builds the smallest manager-based RL task with an image observation (Isaac-Cartpole-RGB-v0), bumps its camera to 720×1280 so each gym.spaces.Box bounds array is large enough to be visible above RSS noise, and runs a tight construct → step → close loop. After each close it counts live gym.spaces.Box instances and sums the bytes pinned by their low / high / bounded_below / bounded_above numpy attributes.

Default invocation monkey-patches ManagerBasedEnv.close back to its pre-#5896 form, so the leak reproduces even when the installed IsaacLab is already patched. --patched leaves the installed close() in place.

Depends on (but does not block) #5896. The script works on either branch — the --patched flag toggles the close() body the script uses.

Fixes # (no associated issue — diagnostic script, no behaviour change)

Repro output

Pre-fix (default):

--- cycle 04 ---
   after cycle 04  RSS=  5491.4 MB   live Boxes=  4   bounds-arrays held:   131.8 MB

Post-fix (--patched):

--- cycle 04 ---
   after cycle 04  RSS=  5438.0 MB   live Boxes=  0   bounds-arrays held:     0.0 MB

Type of change

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation — N/A
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works — This is a runnable repro; happy to convert to a pytest under source/isaaclab/test/envs/ if maintainers prefer
  • I have updated the changelog and the corresponding version — .skip fragment added (test-only, no version bump)
  • I have added my name to the CONTRIBUTORS.md

Runnable check script that demonstrates the leak fixed by PR #5896.
Builds Isaac-Cartpole-RGB-v0 in a tight construct->step->close loop and
counts live gym.spaces.Box objects plus the bytes their bounds arrays
pin. Default invocation monkey-patches close() back to the pre-fix
form so the leak reproduces on a patched IsaacLab; --patched uses the
installed close() to show the fix releases everything.

No library code changes; changelog fragment is .skip.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
@github-actions github-actions Bot added the isaac-lab Related to Isaac Lab team label Jun 1, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR adds a standalone repro/diagnostic script that demonstrates the ManagerBasedEnv.close memory leak fixed by PR #5896. The script creates a tight construct-step-close loop with a high-resolution camera env (Isaac-Cartpole-RGB-v0 at 720x1280) and tracks live gymnasium.spaces.Box instances and their pinned numpy bytes across cycles. It monkey-patches the pre-fix close() by default so the leak reproduces even on a patched install.

Design Assessment

File placement: Correctly placed under source/isaaclab/test/envs/ with the check_ prefix, following the project convention for standalone verification scripts (vs. test_ prefix for pytest-integrated tests).

Changelog: .skip fragment in source/isaaclab/changelog.d/ is appropriate - no library behavior change.

Architecture: The script is well-structured:

  • Clean separation of concerns (monkey-patch setup, snapshot utility, cycle runner, delta report)
  • Uses AppLauncher and parse_env_cfg correctly
  • Proper SPDX header and docstring with usage examples
  • Configurable via argparse (--iterations, --num_envs, --camera_width, --camera_height, --patched)

Findings

Minor Suggestions (non-blocking)

  1. gc.get_objects() iteration can be slow/fragile - Walking all GC-tracked objects in snapshot() is fine for a diagnostic script but would be expensive in CI. The try/except guard around type(obj) is Box is a good defensive pattern.

  2. No assertion / exit-code signaling - The script prints results but always exits 0. If this were to be integrated as a CI regression test in the future, it would need a pass/fail threshold (e.g., sys.exit(1) if live_boxes > 0 after close on --patched mode). The author already notes willingness to convert to pytest if maintainers prefer.

  3. pre_fix_close may drift - The monkey-patched pre-fix close() is a frozen snapshot of the old implementation. If ManagerBasedEnv.close gains new cleanup logic in future PRs, this frozen version will not reflect that. A code comment noting the specific commit/PR it mimics would help future maintainers.

Positive Observations

  • Excellent docstring with exact pre-fix vs post-fix expected output - makes it trivial to verify behavior
  • torch.cuda.empty_cache() after each cycle avoids GPU OOM confounding the RSS measurement
  • Parameterized camera resolution means the script can be tuned for faster runs on CI vs. more dramatic leak demonstration
  • Delta table at the end makes the leak trend immediately visible

Test Coverage Assessment

This script serves as a manual verification tool rather than an automated regression test. It requires Isaac Sim runtime and GPU, making it unsuitable for lightweight CI. However, it would be valuable as:

  • A nightly/weekly GPU-CI gate if the team runs sim-dependent tests
  • A reference for writing a future lightweight mock-based unit test that verifies close() cleans up wrappers without needing the full simulator

The companion PR #5896 (the actual fix) is where the real regression test should live. This PR adds diagnostic value.

Verdict

No issues found - Well-crafted diagnostic script with correct placement, proper conventions, and clear documentation. The minor suggestions above are non-blocking improvements the author could address if desired.

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR adds a diagnostic script that reproduces the ManagerBasedEnv.close memory leak addressed by PR #5896. The script uses an elegant monkey-patching approach to demonstrate the leak with pre-fix behavior by default, while --patched tests the fixed behavior. The implementation is well-structured with clear documentation and useful output formatting.

Design Assessment

Design is sound. The file placement in source/isaaclab/test/envs/ is appropriate for a diagnostic/test utility. The approach of monkey-patching the close() method to reproduce pre-fix behavior is clever and allows the script to be useful on both patched and unpatched IsaacLab installations. The gc-based introspection for counting Box instances is the correct approach since numpy arrays aren't gc-tracked directly.

Findings

🟡 Warning: Broad exception handler in gc inspectionsource/isaaclab/test/envs/check_manager_based_env_close_leak.py:103-107

The except Exception: continue pattern in the snapshot() function could silently hide unexpected errors during gc object inspection. While this is somewhat defensible for gc introspection (objects can be in unusual states), it would be better to at least log unexpected exceptions to aid debugging.

🔵 Suggestion: Consider adding return code for CI integrationsource/isaaclab/test/envs/check_manager_based_env_close_leak.py:155-162

The script currently only prints results. For potential CI integration or automated regression testing, consider adding a non-zero exit code when the leak is detected (i.e., when --patched is used but live boxes remain after close). This would make it usable in automated pipelines.

🔵 Suggestion: Minor — unused variable in step loopsource/isaaclab/test/envs/check_manager_based_env_close_leak.py:138

The obs variable from env.step() is reassigned but the value from env.reset() (line 135) is immediately overwritten. This is harmless but the initial assignment could use _ to signal intent.

Test Coverage

  • Bug fix: N/A — This PR is a diagnostic tool for #5896, not the fix itself
  • Script itself: The script IS a runnable regression test; it successfully demonstrates the leak when run with default flags vs --patched
  • Pytest integration: The author correctly notes this could be converted to a pytest if maintainers prefer. For a memory leak test that requires full Isaac Sim initialization, a standalone script may actually be more practical.

CI Status

Some checks are still running (Installation Tests, Docker builds). Core checks (Load Config, Detect Changes) have passed.

Verdict

Minor fixes needed

The implementation is solid and serves its purpose well. The broad exception handler is the only notable concern, and even that is defensible given gc introspection quirks. The suggestions for CI integration and minor cleanup are optional improvements. This is a useful diagnostic asset for verifying the memory leak fix in #5896.

makes via ``np.full`` and ``astype``.
"""
gc.collect()
rss_mb = process.memory_info().rss / 1024 / 1024
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Warning: Broad exception handler

This except Exception: continue could silently hide unexpected errors during gc object inspection. While defensible for gc introspection (objects can be in unusual states mid-collection), consider at least logging unexpected exceptions:

Suggested change
rss_mb = process.memory_info().rss / 1024 / 1024
except ReferenceError:
# Object was collected between gc.get_objects() and our access
continue
except Exception as e:
# Unexpected error - log but continue to avoid blocking diagnostics
print(f"Warning: unexpected error inspecting gc object: {e}", flush=True)
continue

Alternatively, if you want to keep it silent, a comment explaining why would help future readers.

d_bounds = history[cycle_index][2] - history[cycle_index - 1][2]
print(f" {cycle_index:>6} {d_rss:>+11.1f} {d_box:>+11d} {d_bounds:>+13.1f}", flush=True)

simulation_app.close()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Suggestion: Add exit code for CI integration

For automated regression testing, consider exiting non-zero when the leak is detected:

# After the per-cycle deltas printout:
final_boxes = history[-1][1]
if args_cli.patched and final_boxes > 0:
    print(f"\n❌ LEAK DETECTED: {final_boxes} Box instances still live after close", flush=True)
    sys.exit(1)
elif not args_cli.patched and final_boxes == 0:
    print(f"\n⚠️  Expected leak not reproduced (is the monkey-patch working?)", flush=True)

This would allow the script to be used in CI pipelines to catch regressions.

# RSS noise. Bump to 720x1280 so each Box's bounds array is ~33 MB.
env_cfg.scene.tiled_camera.width = args_cli.camera_width
env_cfg.scene.tiled_camera.height = args_cli.camera_height
env = gym.make("Isaac-Cartpole-RGB-v0", cfg=env_cfg)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Suggestion: Minor — clarify intent with _

The obs from reset is immediately overwritten by the step loop. Consider using _ to signal this value is intentionally discarded:

Suggested change
env = gym.make("Isaac-Cartpole-RGB-v0", cfg=env_cfg)
_, _ = env.reset()

Or keep obs only in the step loop where it's actually used for the del cleanup.

@cvolkcvolk
Copy link
Copy Markdown
Collaborator Author

Closing — the formal regression test landed on #5896 itself (test_manager_based_env_close_leak.py). The standalone runnable repro was useful for diagnosing the leak internally but is redundant alongside the pytest.

@cvolkcvolk cvolkcvolk closed this Jun 1, 2026
@cvolkcvolk cvolkcvolk deleted the cvolk/repro-manager-based-env-close-leak branch June 2, 2026 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant