Skip to content

Fix cartpole-camera frame-stacking performance regression#5849

Draft
jmart-nv wants to merge 11 commits into
isaac-sim:developfrom
jmart-nv:jmart/framestack-perf
Draft

Fix cartpole-camera frame-stacking performance regression#5849
jmart-nv wants to merge 11 commits into
isaac-sim:developfrom
jmart-nv:jmart/framestack-perf

Conversation

@jmart-nv
Copy link
Copy Markdown

@jmart-nv jmart-nv commented May 28, 2026

Description

Frame-stacking was added to cartpole-camera by #5574 to make the task solvable on the Newton+Warp backend. Single-frame RGB observations don't carry pole velocity, so a single-image policy can only recover ~100/300 mean episode reward. On PhysX+RTX, implicit damping in the dynamics plus temporal antialiasing in the renderer leak enough adjacent-frame correlation that a single-frame policy still hits ~296. Newton+Warp has neither, so explicit frame stacking is required for the task to be solvable on this backend.

The implementation regressed throughput by 42% (54.6 → 31.7 FPS) and inflated peak VRAM 2.7× (3.13 → 8.37 GB) at R256, 1024 envs. PR #5574 added a CircularBuffer ring + a stacked_image MDP term that:

  1. Stored frames as float32 even when the camera output was uint8 → 4× ring memory and 4× shift bandwidth.
  2. Ran the per-frame normalize (x/255 - mean(x)) before the buffer, forcing the buffer to be float32.
  3. Re-rolled the buffer with torch.roll on every append → ring-sized temporary per env.step.
  4. Built the channel-stacked policy obs via permute + reshape + clone → fresh (B, H, W, K*C) float32 tensor per env.step.
  5. Inlined the normalize math at 5 call sites across DirectRLEnv and ManagerBasedEnv paths.

This PR rewrites the frame-stacking hot path; the result runs 24% faster than the pre-framestack baseline with identical PPO convergence.

Change Why
uint8 ring buffer. Defer the x/255 − mean normalize past the frame-stack buffer for RGB-like data types. Ring is 4× smaller; shift bandwidth per env.step drops 4×.
Free contiguous stacked output. New CircularBuffer.stack_dim arg + .stacked property arrange storage so the channel-stacked output is a contiguous view of internal storage. Eliminates one (B, H, W, K*C)-sized allocation and a permute kernel per env.step.
Drop torch.roll on append. Replace with a front-to-back memmove. One fewer ring-sized allocation and one fewer kernel launch per env.step.
Fused Warp normalize kernel (normalize_image_uint8). One kernel: (uint8 / 255) − per-(batch, channel) mean → float32. Replaces upcast + reduce + subtract (three PyTorch passes) with one fused launch over the K-stacked input.
Tiled int32 partial-sum reduction kernel (spatial_sum_uint8_tiled). Warp kernel writes (B, NUM_TILES, C) int32 partials along H; PyTorch collapses the partial dim. int32 accumulator is safe up to H·W·255 (128× headroom at R256) and ~3× faster than int64 / float32 on PyTorch's auto-tuned reduce path. C-innermost launch shape gives stride-1 reads on src's contiguous trailing dim.
Shared isaaclab.utils.images.normalize_camera_image dispatch. One helper used by both image() (DirectRLEnv) and stacked_image.__call__ (ManagerBasedEnv); routes uint8 contiguous inputs through the Warp fast path and falls back to PyTorch otherwise. Removes inline normalize math duplicated across 5 sites; the fast path picks itself up automatically.
image(clone=False) opt-out. Callers that immediately copy into their own storage skip the defensive clone. Saves one obs-sized allocation per env.step at the consumer site (frame-stack buffer).
Allocator-managed obs lifetime. _get_observations allocates the normalize output fresh each call (no persistent out= buffer). The RL trainer holds a reference to the previous-iteration observation until record_transition returns; reusing one buffer across env.step boundaries aliases the trainer's prior obs. Fresh allocation lets PyTorch's caching allocator hand out a different block while the prior is still referenced.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Results

R256 (256×256), 1024 envs, PPO via skrl 2.1.0, 32k training steps, seed 42, perf measured over n=3 warm trials:

Stage FPS step_ms Peak VRAM Mean reward (32k)
Pre-framestack baseline 54.62 18.67 3.13 GB ~102 (unsolvable — no temporal info)
Frame stacking from #5574 31.72 32.07 8.37 GB 297.68
This PR 67.87 15.32 4.41 GB 297.60

Screenshots

pr_figure

Environment

NVIDIA L40 (48 GB), driver 570.158.01. Newton+Warp backend (presets=newton_mjwarp,newton_renderer). skrl 2.1.0 trainer.

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation (N/A - frame stacking already documented, these changes are implementation details)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

jmart-nv added 10 commits May 28, 2026 15:16
…isaac-sim#5574.

- circular_buffer.py: replace torch.roll with O(1) write-index append; add CPU mirror of max_len and CPU bool gate to skip GPU sync on the steady-state path; rotate on read via the `buffer` property.
- observations.py (stacked_image): drop trailing `.clone()` from the permute().reshape() chain (reshape on non-contig already allocates).
- cartpole_camera_presets_env.py: same `.clone()` drop in the direct env.
- delay_buffer.py: same `.clone()` drop on the DelayBuffer.compute return.
Skip per-frame /255 + mean-subtract on the stacking path; apply once on the stacked output. Ring buffer now holds native uint8 (4x cheaper per-step copies). Math is identical because K frames live in disjoint channel slices. Affects stacked_image MDP term and the cartpole DirectRLEnv parent (new `normalize: bool = True` kwarg) + subclass.
Replace the three-pass PyTorch normalize on the camera-observation hot path with a single fused Warp kernel + small mean reduction. Consolidate the per-data-type dispatch into a shared helper used by image(), stacked_image, CartpoleCameraEnv, and CartpoleCameraPresetsEnv.

Frame-stacking subclasses pre-allocate the float32 output once and reuse it across steps, eliminating the per-step transient that the previous deferred-normalize path incurred.
The Warp kernel produces int32 partial sums along H; PyTorch collapses the partials and divides once. Channels-innermost launch shape gives coalesced reads on src's contiguous trailing dim. Partials scratch is cached per (shape, device).
- stacked_image / CartpoleCameraPresetsEnv: don't reuse a pre-allocated normalize-output buffer across env.step; the previous-iteration obs (held by the RL trainer) was overwritten before record_transition could read it, causing PPO to record degenerate transitions
- normalize_image_uint8: docstring warning on the out= aliasing hazard
- CircularBuffer.reset: setitem (= 0.0) instead of advanced-getitem + .zero_() so partial-batch resets actually zero the storage
- CircularBuffer.stacked: drop unreachable empty-buffer guard
- _uint8_sum_partials_cache: store the tensor directly, not a tuple
- Tests: regressions for both bugs, image() clone kwarg, and DelayBuffer.compute() no-aliasing contract
@github-actions github-actions Bot added the isaac-lab Related to Isaac Lab team label May 28, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR addresses a significant 42% throughput regression introduced by frame-stacking in #5574 for the cartpole-camera task. The fix implements several well-designed optimizations: (1) storing uint8 frames in the ring buffer instead of float32 (4× smaller), (2) deferring normalization past the frame-stack buffer, (3) replacing torch.roll with memmove-style shifting, (4) adding a fused Warp kernel for RGB normalize, and (5) providing a free contiguous stacked view via the new stack_dim parameter. The result is a 24% improvement over the pre-framestack baseline (67.87 FPS vs 54.62 FPS) with identical PPO convergence.

Incremental Update (a86c6a7)

The author addressed review feedback in commit a86c6a7:

Docstring improvementsstack_dim parameter now includes a concrete example (stack_dim=-1 on (B, H, W, C)(B, H, W, K*C)) making the API much more intuitive.

In-place subtract in normalize fallbackimages.py now uses images -= torch.mean(...) instead of allocating a new tensor, addressing the memory optimization suggestion.

Cache growth documented — The _uint8_sum_partials_cache comment now states it "Typically holds one entry per training run (camera resolution and device are fixed)," making the bounded growth clear.

Tile size rationale_UINT8_SUM_TILE_HW comment now documents it was "Tuned on L40; robust across modern NVIDIA arches (Ampere/Ada/Hopper) at R256," addressing the hardcoded value concern.

Shift loop documented — Added "Cheap at the typical frame-stack K=2-4" comment clarifying expected usage.

Minor fix — Corrected TiledCameraCfgCameraCfg in is_rgb_like docstring.

Test Coverage

Excellent coverage (unchanged from previous push). The PR includes comprehensive tests for the circular buffer stack_dim mode, normalize dispatch paths, Warp kernel correctness, and critically a consecutive-call aliasing regression test.

CI Status

In progress — 9 checks passed, 6 running, 2 skipped on latest commit. No failures.

Verdict

Approve — All review suggestions have been addressed cleanly in the follow-up commit. The changes are documentation and minor optimization improvements that don't affect correctness. Ready to merge once CI completes.

Comment thread source/isaaclab/isaaclab/utils/images.py Outdated
Comment thread source/isaaclab/isaaclab/utils/buffers/circular_buffer.py
# Lazily allocated on the first ``append``.
self._buffer: torch.Tensor = None # type: ignore

self._stack_dim_arg: int | None = stack_dim
Copy link
Copy Markdown
Collaborator

@pbarejko pbarejko May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me some time to understand stacking terminology here. Why this is called stacking? This is like gluing frames side-by-side. Initially, I was under the impression that you actually have n-buffers represent stack, but what happens here is frame concatenation based on stack dimension?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. "Frame stacking" is the conventional naming used for this technique. Under the hood, you're right that it's more of a concatenation. If we truly "stacked" on a new dimension, that would change the CNN's input shape, forcing a different architecture with higher complexity. Concatenating along the channel axis (and keeping K adjacent in ring buffer memory) allows us to avoid an expensive reshape when the policy asks for the obs.

…y-up)

- images.py: TiledCameraCfg -> CameraCfg in is_rgb_like docstring (deprecated class)
- images.py: PyTorch normalize fallback uses in-place subtract (one fewer alloc on the cold path)
- circular_buffer.py: clarified stack_dim docstring with a (B,H,W,C) example; noted the typical K=2-4 range on the append shift loop
- ops.py: noted the realistic 1-entry bound on _uint8_sum_partials_cache; noted L40 tuning provenance of _UINT8_SUM_TILE_HW=32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants