Fix memory leak in ManagerBasedEnv.close#5896
Conversation
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This PR fixes a memory leak in ManagerBasedEnv.close() where cached observation buffers (self.obs_buf) and gym.spaces.Box bounds arrays (observation_space, action_space, and their single_*_space variants) were not released during teardown. Since Gymnasium's wrapper chain keeps the env reachable past close(), these allocations survived and accumulated across construct/teardown cycles (~440 MB host + ~33 MB GPU per cycle for typical image-observation environments).
Design Assessment
Architecture: The fix is well-placed within the existing close() lifecycle. Clearing obs_buf before manager destruction ensures tensor references are dropped before their parent managers are deleted, which is the correct order. Nulling the space attributes after manager destruction is also appropriate since no downstream code should reference spaces after managers are gone.
Design Pattern: The approach of explicit cleanup in close() is the correct pattern for Gymnasium environments — the library design intentionally keeps env instances reachable through wrapper chains, so resource owners must actively release their references.
Scope: The change is minimal and surgical (4 actual logic insertions), which is appropriate for a bug fix in a core environment lifecycle method.
Findings
✅ No Issues — Correctness
obs_buf.clear()correctly empties the dict while preserving the attribute (avoidsAttributeErrorif anything inspects it after close)- Setting spaces to
Nonerather thandelavoidsAttributeErroron subsequent access attempts (defensive) hasattrguard forsingle_observation_space/single_action_spacecorrectly handles the case whereclose()is called on aManagerBasedEnvthat hasn't been wrapped as an RL env
✅ No Issues — Cleanup Order
obs_bufcleared before manager deletion (tensors may hold refs to manager-owned buffers)- Spaces nulled after manager deletion (correct — no manager depends on spaces)
- Both happen before
sim.clear_instance()(correct ordering)
💡 Suggestion (Non-blocking)
- Consider wrapping the space-nulling in a
try/except: While unlikely, ifclose()is called on a partially-initialized env (e.g., an exception during__init__aftersuper().__init__()but before space construction),self.observation_space = Nonewould succeed harmlessly (setting a new attribute), so this is actually safe as-is. No action needed.
💡 Suggestion (Non-blocking)
- GPU tensor release: The PR description mentions ~33 MB GPU per cycle.
obs_buf.clear()drops Python references but CUDA memory may not be reclaimed untiltorch.cuda.empty_cache(). Consider adding a note in the changelog that GPU memory may require an explicittorch.cuda.empty_cache()call afterclose()if immediate reclamation is needed. This is not a bug in the PR — it's standard PyTorch behavior.
Test Coverage
The PR includes a comprehensive standalone repro script (isaaclab_close_leak_repro.py) in the PR description that demonstrates the before/after behavior. The author offers to convert this to a pytest if maintainers prefer. Recommendation: A lightweight regression test under source/isaaclab/test/envs/ that asserts env.obs_buf == {} and env.observation_space is None after close() would be valuable for CI, but the repro script is sufficient to validate the fix.
Verdict
No issues found — This is a clean, well-documented, minimal bug fix that correctly addresses a real memory leak in the environment lifecycle. The cleanup ordering is correct, edge cases are handled defensively, and the changelog entry is thorough. Ready to merge.
Update (9327f33): Reviewed incremental changes since 5ec2c26. The new commit adds a clarifying comment above obs_buf.clear() — no functional changes to the fix logic. All previous findings still apply. ✅ Still LGTM — ready to merge.
Update (d835ee5): ✅ Regression test added at source/isaaclab/test/envs/test_manager_based_env_close_leak.py — this directly addresses my earlier recommendation for CI coverage. The test is well-structured: parametrized across cuda:0/cpu, verifies preconditions before close(), and asserts obs_buf, observation_space, and action_space are properly cleaned up. No new issues. Still LGTM.
Update (cda5aeb): Minor defensive improvement: obs_buf.clear() now guarded by isinstance(getattr(self, "obs_buf", None), dict) to safely handle partially-initialized envs. Space-nulling simplified (removed hasattr guards since attrs are always set by _configure_gym_env_spaces). No functional changes to the fix logic. ✅ Still LGTM.
5ec2c26 to
9327f33
Compare
self.obs_buf and self.observation_space/action_space (plus single_*_space variants set by ManagerBasedRLEnv) were never released by close(). For image-observation envs at 720x1280x3 with num_envs=4 that retains tens of megabytes of cached observation tensors plus ~110 MB per gym.spaces.Box bounds-array set, accumulating one full env's worth on each construct/teardown cycle since gymnasium's wrapper chain (OrderEnforcing plus the make() registry) keeps the env reachable past close. Measured 440 MB host + 33 MB GPU retained per cycle on a typical three-image-camera setup; after the fix the same cycle releases the spaces (-370 MB) and obs tensors (-74 MB CUDA) at close-time, eliminating the per-cycle accumulation. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Builds Isaac-Cartpole-v0 via gym.make, runs reset + a step to populate obs_buf and exercise gym space construction, then asserts after close() that obs_buf is cleared and observation_space/action_space are nulled. Verified to fail on the pre-fix close() with AssertionError on the obs_buf assertion (the spaces assertion would also fail), and to pass with the fix applied. Parametrized over cuda:0 and cpu devices. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
d835ee5 to
cda5aeb
Compare
|
Re-raising this from my fork to avoid adding branches to the main repo (per maintainer request): superseded by #5925. Closing this one. The branch here will be deleted. |
Description
ManagerBasedEnv.close()does not release two categories of per-instance state, both of which then surviveclose()because gymnasium's wrapper chain (OrderEnforcingplus themake()registry) keeps the wrapped env reachable:self.obs_buf— the dict of cached observation tensors populated by everystep()/reset(). For image sensors at(num_envs, H, W, C)this is tens of MB per camera.self.observation_space/self.action_space(and the vectorised-APIsingle_*_spacevariants set byManagerBasedRLEnv). Eachgym.spaces.Boxfor an image observation at(4, 720, 1280, 3)float32 pins ~110 MB of bounds arrays (low,highas float32 +bounded_below,bounded_aboveas bool).Net effect: ~one env's worth of
gym.spaces.Boxbounds arrays plus the last observation buffer are retained for the lifetime of the process even afterclose()returns. In evaluation loops that rebuild envs across tasks — or in any context where module reloads orphan multiple env instances against type-equality replacement — the leak compounds into multi-GB territory and ultimately OOMs the process.No associated issue (found while debugging an evaluation-sweep OOM in a downstream project).
Fixes # (no associated issue)
Repro & regression test
Two commits in this PR:
ManagerBasedEnv.close— clearsobs_buf, nulls space attributes.source/isaaclab/test/envs/test_manager_based_env_close_leak.pythat buildsIsaac-Cartpole-v0viagym.make, runs reset + one step, callsclose(), and assertsobs_bufis empty and the space attributes areNone. Parametrised overcuda:0andcpu.Verified the test:
close()withAssertionError: obs_buf was not cleared on close.A standalone runnable repro that demonstrates the leak magnitude visually (live
gym.spaces.Boxcount and total bounds-array MB across cycles, with a--patchedtoggle) is in companion draft PR #5898. Run it as::Type of change
Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfile — followed the newchangelog.d/workflow in AGENTS.md; happy to add a fragment if maintainers want a patch-bump entryCONTRIBUTORS.md— happy to add if maintainers want