Skip to content

Fix memory leak in ManagerBasedEnv.close#5896

Closed
cvolkcvolk wants to merge 2 commits into
developfrom
cvolk/fix-manager-based-env-close-leaks
Closed

Fix memory leak in ManagerBasedEnv.close#5896
cvolkcvolk wants to merge 2 commits into
developfrom
cvolk/fix-manager-based-env-close-leaks

Conversation

@cvolkcvolk
Copy link
Copy Markdown
Collaborator

@cvolkcvolk cvolkcvolk commented Jun 1, 2026

Description

ManagerBasedEnv.close() does not release two categories of per-instance state, both of which then survive close() because gymnasium's wrapper chain (OrderEnforcing plus the make() registry) keeps the wrapped env reachable:

  1. self.obs_buf — the dict of cached observation tensors populated by every step() / reset(). For image sensors at (num_envs, H, W, C) this is tens of MB per camera.
  2. self.observation_space / self.action_space (and the vectorised-API single_*_space variants set by ManagerBasedRLEnv). Each gym.spaces.Box for an image observation at (4, 720, 1280, 3) float32 pins ~110 MB of bounds arrays (low, high as float32 + bounded_below, bounded_above as bool).

Net effect: ~one env's worth of gym.spaces.Box bounds arrays plus the last observation buffer are retained for the lifetime of the process even after close() returns. In evaluation loops that rebuild envs across tasks — or in any context where module reloads orphan multiple env instances against type-equality replacement — the leak compounds into multi-GB territory and ultimately OOMs the process.

No associated issue (found while debugging an evaluation-sweep OOM in a downstream project).

Fixes # (no associated issue)

Repro & regression test

Two commits in this PR:

  1. The fix in ManagerBasedEnv.close — clears obs_buf, nulls space attributes.
  2. A regression test at source/isaaclab/test/envs/test_manager_based_env_close_leak.py that builds Isaac-Cartpole-v0 via gym.make, runs reset + one step, calls close(), and asserts obs_buf is empty and the space attributes are None. Parametrised over cuda:0 and cpu.

Verified the test:

  • Fails on the pre-fix close() with AssertionError: obs_buf was not cleared on close.
  • Passes with the fix applied.

A standalone runnable repro that demonstrates the leak magnitude visually (live gym.spaces.Box count and total bounds-array MB across cycles, with a --patched toggle) is in companion draft PR #5898. Run it as::

./isaaclab.sh -p source/isaaclab/test/envs/check_manager_based_env_close_leak.py
# pre-fix: live Boxes=4, bounds-arrays held: 131.8 MB
./isaaclab.sh -p source/isaaclab/test/envs/check_manager_based_env_close_leak.py --patched
# post-fix: live Boxes=0, bounds-arrays held: 0.0 MB

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation — N/A, internal lifecycle method
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file — followed the new changelog.d/ workflow in AGENTS.md; happy to add a fragment if maintainers want a patch-bump entry
  • I have added my name to the CONTRIBUTORS.md — happy to add if maintainers want

@github-actions github-actions Bot added bug Something isn't working documentation Improvements or additions to documentation asset New asset feature or request isaac-sim Related to Isaac Sim team isaac-mimic Related to Isaac Mimic team infrastructure labels Jun 1, 2026
@cvolkcvolk cvolkcvolk changed the base branch from main to develop June 1, 2026 13:57
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR fixes a memory leak in ManagerBasedEnv.close() where cached observation buffers (self.obs_buf) and gym.spaces.Box bounds arrays (observation_space, action_space, and their single_*_space variants) were not released during teardown. Since Gymnasium's wrapper chain keeps the env reachable past close(), these allocations survived and accumulated across construct/teardown cycles (~440 MB host + ~33 MB GPU per cycle for typical image-observation environments).

Design Assessment

Architecture: The fix is well-placed within the existing close() lifecycle. Clearing obs_buf before manager destruction ensures tensor references are dropped before their parent managers are deleted, which is the correct order. Nulling the space attributes after manager destruction is also appropriate since no downstream code should reference spaces after managers are gone.

Design Pattern: The approach of explicit cleanup in close() is the correct pattern for Gymnasium environments — the library design intentionally keeps env instances reachable through wrapper chains, so resource owners must actively release their references.

Scope: The change is minimal and surgical (4 actual logic insertions), which is appropriate for a bug fix in a core environment lifecycle method.

Findings

✅ No Issues — Correctness

  • obs_buf.clear() correctly empties the dict while preserving the attribute (avoids AttributeError if anything inspects it after close)
  • Setting spaces to None rather than del avoids AttributeError on subsequent access attempts (defensive)
  • hasattr guard for single_observation_space/single_action_space correctly handles the case where close() is called on a ManagerBasedEnv that hasn't been wrapped as an RL env

✅ No Issues — Cleanup Order

  • obs_buf cleared before manager deletion (tensors may hold refs to manager-owned buffers)
  • Spaces nulled after manager deletion (correct — no manager depends on spaces)
  • Both happen before sim.clear_instance() (correct ordering)

💡 Suggestion (Non-blocking)

  • Consider wrapping the space-nulling in a try/except: While unlikely, if close() is called on a partially-initialized env (e.g., an exception during __init__ after super().__init__() but before space construction), self.observation_space = None would succeed harmlessly (setting a new attribute), so this is actually safe as-is. No action needed.

💡 Suggestion (Non-blocking)

  • GPU tensor release: The PR description mentions ~33 MB GPU per cycle. obs_buf.clear() drops Python references but CUDA memory may not be reclaimed until torch.cuda.empty_cache(). Consider adding a note in the changelog that GPU memory may require an explicit torch.cuda.empty_cache() call after close() if immediate reclamation is needed. This is not a bug in the PR — it's standard PyTorch behavior.

Test Coverage

The PR includes a comprehensive standalone repro script (isaaclab_close_leak_repro.py) in the PR description that demonstrates the before/after behavior. The author offers to convert this to a pytest if maintainers prefer. Recommendation: A lightweight regression test under source/isaaclab/test/envs/ that asserts env.obs_buf == {} and env.observation_space is None after close() would be valuable for CI, but the repro script is sufficient to validate the fix.

Verdict

No issues found — This is a clean, well-documented, minimal bug fix that correctly addresses a real memory leak in the environment lifecycle. The cleanup ordering is correct, edge cases are handled defensively, and the changelog entry is thorough. Ready to merge.


Update (9327f33): Reviewed incremental changes since 5ec2c26. The new commit adds a clarifying comment above obs_buf.clear() — no functional changes to the fix logic. All previous findings still apply. ✅ Still LGTM — ready to merge.


Update (d835ee5): ✅ Regression test added at source/isaaclab/test/envs/test_manager_based_env_close_leak.py — this directly addresses my earlier recommendation for CI coverage. The test is well-structured: parametrized across cuda:0/cpu, verifies preconditions before close(), and asserts obs_buf, observation_space, and action_space are properly cleaned up. No new issues. Still LGTM.


Update (cda5aeb): Minor defensive improvement: obs_buf.clear() now guarded by isinstance(getattr(self, "obs_buf", None), dict) to safely handle partially-initialized envs. Space-nulling simplified (removed hasattr guards since attrs are always set by _configure_gym_env_spaces). No functional changes to the fix logic. ✅ Still LGTM.

@cvolkcvolk cvolkcvolk force-pushed the cvolk/fix-manager-based-env-close-leaks branch from 5ec2c26 to 9327f33 Compare June 1, 2026 14:44
@github-actions github-actions Bot added the isaac-lab Related to Isaac Lab team label Jun 1, 2026
@cvolkcvolk cvolkcvolk marked this pull request as draft June 1, 2026 14:48
self.obs_buf and self.observation_space/action_space (plus single_*_space
variants set by ManagerBasedRLEnv) were never released by close(). For
image-observation envs at 720x1280x3 with num_envs=4 that retains tens of
megabytes of cached observation tensors plus ~110 MB per gym.spaces.Box
bounds-array set, accumulating one full env's worth on each
construct/teardown cycle since gymnasium's wrapper chain (OrderEnforcing
plus the make() registry) keeps the env reachable past close.

Measured 440 MB host + 33 MB GPU retained per cycle on a typical
three-image-camera setup; after the fix the same cycle releases the
spaces (-370 MB) and obs tensors (-74 MB CUDA) at close-time, eliminating
the per-cycle accumulation.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Builds Isaac-Cartpole-v0 via gym.make, runs reset + a step to populate
obs_buf and exercise gym space construction, then asserts after close()
that obs_buf is cleared and observation_space/action_space are nulled.

Verified to fail on the pre-fix close() with AssertionError on the
obs_buf assertion (the spaces assertion would also fail), and to pass
with the fix applied. Parametrized over cuda:0 and cpu devices.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
@cvolkcvolk
Copy link
Copy Markdown
Collaborator Author

Re-raising this from my fork to avoid adding branches to the main repo (per maintainer request): superseded by #5925. Closing this one. The branch here will be deleted.

@cvolkcvolk cvolkcvolk closed this Jun 2, 2026
@cvolkcvolk cvolkcvolk deleted the cvolk/fix-manager-based-env-close-leaks branch June 2, 2026 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

asset New asset feature or request bug Something isn't working documentation Improvements or additions to documentation infrastructure isaac-lab Related to Isaac Lab team isaac-mimic Related to Isaac Mimic team isaac-sim Related to Isaac Sim team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant