Fix cartpole-camera frame-stacking performance regression by jmart-nv · Pull Request #5849 · isaac-sim/IsaacLab

jmart-nv · 2026-05-28T21:44:39Z

Description

Frame-stacking was added to cartpole-camera by #5574 to make the task solvable on the Newton+Warp backend. Single-frame RGB observations don't carry pole velocity, so a single-image policy can only recover ~100/300 mean episode reward. On PhysX+RTX, implicit damping in the dynamics plus temporal antialiasing in the renderer leak enough adjacent-frame correlation that a single-frame policy still hits ~296. Newton+Warp has neither, so explicit frame stacking is required for the task to be solvable on this backend.

The implementation regressed throughput by 42% (54.6 → 31.7 FPS) and inflated peak VRAM 2.7× (3.13 → 8.37 GB) at R256, 1024 envs. PR #5574 added a CircularBuffer ring + a stacked_image MDP term that:

Stored frames as float32 even when the camera output was uint8 → 4× ring memory and 4× shift bandwidth.
Ran the per-frame normalize (x/255 - mean(x)) before the buffer, forcing the buffer to be float32.
Re-rolled the buffer with torch.roll on every append → ring-sized temporary per env.step.
Built the channel-stacked policy obs via permute + reshape + clone → fresh (B, H, W, K*C) float32 tensor per env.step.
Inlined the normalize math at 5 call sites across DirectRLEnv and ManagerBasedEnv paths.

This PR rewrites the frame-stacking hot path; the result runs 24% faster than the pre-framestack baseline with identical PPO convergence.

Change	Why
uint8 ring buffer. Defer the `x/255 − mean` normalize past the frame-stack buffer for RGB-like data types.	Ring is 4× smaller; shift bandwidth per env.step drops 4×.
Free contiguous stacked output. New `CircularBuffer.stack_dim` arg + `.stacked` property arrange storage so the channel-stacked output is a contiguous view of internal storage.	Eliminates one `(B, H, W, K*C)`-sized allocation and a permute kernel per env.step.
Drop `torch.roll` on append. Replace with a front-to-back memmove.	One fewer ring-sized allocation and one fewer kernel launch per env.step.
Fused Warp normalize kernel (`normalize_image_uint8`). One kernel: `(uint8 / 255) − per-(batch, channel) mean → float32`.	Replaces upcast + reduce + subtract (three PyTorch passes) with one fused launch over the K-stacked input.
Tiled int32 partial-sum reduction kernel (`spatial_sum_uint8_tiled`). Warp kernel writes `(B, NUM_TILES, C)` int32 partials along H; PyTorch collapses the partial dim.	int32 accumulator is safe up to H·W·255 (128× headroom at R256) and ~3× faster than int64 / float32 on PyTorch's auto-tuned reduce path. C-innermost launch shape gives stride-1 reads on src's contiguous trailing dim.
Shared `isaaclab.utils.images.normalize_camera_image` dispatch. One helper used by both `image()` (DirectRLEnv) and `stacked_image.__call__` (ManagerBasedEnv); routes uint8 contiguous inputs through the Warp fast path and falls back to PyTorch otherwise.	Removes inline normalize math duplicated across 5 sites; the fast path picks itself up automatically.
`image(clone=False)` opt-out. Callers that immediately copy into their own storage skip the defensive clone.	Saves one obs-sized allocation per env.step at the consumer site (frame-stack buffer).
Allocator-managed obs lifetime. `_get_observations` allocates the normalize output fresh each call (no persistent `out=` buffer).	The RL trainer holds a reference to the previous-iteration observation until `record_transition` returns; reusing one buffer across env.step boundaries aliases the trainer's prior obs. Fresh allocation lets PyTorch's caching allocator hand out a different block while the prior is still referenced.

Type of change

Bug fix (non-breaking change which fixes an issue)

Results

R256 (256×256), 1024 envs, PPO via skrl 2.1.0, 32k training steps, seed 42, perf measured over n=3 warm trials:

Stage	FPS	step_ms	Peak VRAM	Mean reward (32k)
Pre-framestack baseline	54.62	18.67	3.13 GB	~102 (unsolvable — no temporal info)
Frame stacking from #5574	31.72	32.07	8.37 GB	297.68
This PR	67.87	15.32	4.41 GB	297.60

Screenshots

Environment

NVIDIA L40 (48 GB), driver 570.158.01. Newton+Warp backend (presets=newton_mjwarp,newton_renderer). skrl 2.1.0 trainer.

Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation (N/A - frame stacking already documented, these changes are implementation details)
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have updated the changelog and the corresponding version in the extension's config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

…isaac-sim#5574. - circular_buffer.py: replace torch.roll with O(1) write-index append; add CPU mirror of max_len and CPU bool gate to skip GPU sync on the steady-state path; rotate on read via the `buffer` property. - observations.py (stacked_image): drop trailing `.clone()` from the permute().reshape() chain (reshape on non-contig already allocates). - cartpole_camera_presets_env.py: same `.clone()` drop in the direct env. - delay_buffer.py: same `.clone()` drop on the DelayBuffer.compute return.

…t on append

…stack consumers

…ut reshape

Skip per-frame /255 + mean-subtract on the stacking path; apply once on the stacked output. Ring buffer now holds native uint8 (4x cheaper per-step copies). Math is identical because K frames live in disjoint channel slices. Affects stacked_image MDP term and the cartpole DirectRLEnv parent (new `normalize: bool = True` kwarg) + subclass.

Replace the three-pass PyTorch normalize on the camera-observation hot path with a single fused Warp kernel + small mean reduction. Consolidate the per-data-type dispatch into a shared helper used by image(), stacked_image, CartpoleCameraEnv, and CartpoleCameraPresetsEnv. Frame-stacking subclasses pre-allocate the float32 output once and reuse it across steps, eliminating the per-step transient that the previous deferred-normalize path incurred.

The Warp kernel produces int32 partial sums along H; PyTorch collapses the partials and divides once. Channels-innermost launch shape gives coalesced reads on src's contiguous trailing dim. Partials scratch is cached per (shape, device).

- stacked_image / CartpoleCameraPresetsEnv: don't reuse a pre-allocated normalize-output buffer across env.step; the previous-iteration obs (held by the RL trainer) was overwritten before record_transition could read it, causing PPO to record degenerate transitions - normalize_image_uint8: docstring warning on the out= aliasing hazard - CircularBuffer.reset: setitem (= 0.0) instead of advanced-getitem + .zero_() so partial-batch resets actually zero the storage - CircularBuffer.stacked: drop unreachable empty-buffer guard - _uint8_sum_partials_cache: store the tensor directly, not a tuple - Tests: regressions for both bugs, image() clone kwarg, and DelayBuffer.compute() no-aliasing contract

isaaclab-review-bot

🤖 Isaac Lab Review Bot

Summary

This PR addresses a significant 42% throughput regression introduced by frame-stacking in #5574 for the cartpole-camera task. The fix implements several well-designed optimizations: (1) storing uint8 frames in the ring buffer instead of float32 (4× smaller), (2) deferring normalization past the frame-stack buffer, (3) replacing torch.roll with memmove-style shifting, (4) adding a fused Warp kernel for RGB normalize, and (5) providing a free contiguous stacked view via the new stack_dim parameter. The result is a 24% improvement over the pre-framestack baseline (67.87 FPS vs 54.62 FPS) with identical PPO convergence.

Incremental Update (`a86c6a7`)

The author addressed review feedback in commit a86c6a7:

✅ Docstring improvements — stack_dim parameter now includes a concrete example (stack_dim=-1 on (B, H, W, C) → (B, H, W, K*C)) making the API much more intuitive.

✅ In-place subtract in normalize fallback — images.py now uses images -= torch.mean(...) instead of allocating a new tensor, addressing the memory optimization suggestion.

✅ Cache growth documented — The _uint8_sum_partials_cache comment now states it "Typically holds one entry per training run (camera resolution and device are fixed)," making the bounded growth clear.

✅ Tile size rationale — _UINT8_SUM_TILE_HW comment now documents it was "Tuned on L40; robust across modern NVIDIA arches (Ampere/Ada/Hopper) at R256," addressing the hardcoded value concern.

✅ Shift loop documented — Added "Cheap at the typical frame-stack K=2-4" comment clarifying expected usage.

✅ Minor fix — Corrected TiledCameraCfg → CameraCfg in is_rgb_like docstring.

Test Coverage

✅ Excellent coverage (unchanged from previous push). The PR includes comprehensive tests for the circular buffer stack_dim mode, normalize dispatch paths, Warp kernel correctness, and critically a consecutive-call aliasing regression test.

CI Status

⏳ In progress — 9 checks passed, 6 running, 2 skipped on latest commit. No failures.

Verdict

Approve — All review suggestions have been addressed cleanly in the follow-up commit. The changes are documentation and minor optimization improvements that don't affect correctness. Ready to merge once CI completes.

pbarejko · 2026-05-29T21:21:22Z

+        # Lazily allocated on the first ``append``.
        self._buffer: torch.Tensor = None  # type: ignore

+        self._stack_dim_arg: int | None = stack_dim


It took me some time to understand stacking terminology here. Why this is called stacking? This is like gluing frames side-by-side. Initially, I was under the impression that you actually have n-buffers represent stack, but what happens here is frame concatenation based on stack dimension?

Yes. "Frame stacking" is the conventional naming used for this technique. Under the hood, you're right that it's more of a concatenation. If we truly "stacked" on a new dimension, that would change the CNN's input shape, forcing a different architecture with higher complexity. Concatenating along the channel axis (and keeping K adjacent in ring buffer memory) allows us to avoid an expensive reshape when the policy asks for the obs.

…y-up) - images.py: TiledCameraCfg -> CameraCfg in is_rgb_like docstring (deprecated class) - images.py: PyTorch normalize fallback uses in-place subtract (one fewer alloc on the cold path) - circular_buffer.py: clarified stack_dim docstring with a (B,H,W,C) example; noted the typical K=2-4 range on the append shift loop - ops.py: noted the realistic 1-entry bound on _uint8_sum_partials_cache; noted L40 tuning provenance of _UINT8_SUM_TILE_HW=32

jmart-nv added 10 commits May 28, 2026 15:16

Drop read-side torch.roll from CircularBuffer.buffer via memmove-shif…

32024a4

…t on append

Add CircularBuffer.copy_to_stacked and clone=False opt-out for frame-…

347f976

…stack consumers

Add stack_dim mode to CircularBuffer for free contiguous stacked-outp…

2dad31b

…ut reshape

Changelog fragments for framestack performance regression fix

8222de6

Restore original comments that were accidentally trimmed

c554513

github-actions Bot added the isaac-lab Related to Isaac Lab team label May 28, 2026

isaaclab-review-bot Bot reviewed May 28, 2026

View reviewed changes

pbarejko reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cartpole-camera frame-stacking performance regression#5849

Fix cartpole-camera frame-stacking performance regression#5849
jmart-nv wants to merge 11 commits into
isaac-sim:developfrom
jmart-nv:jmart/framestack-perf

jmart-nv commented May 28, 2026 •

edited

Loading

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

pbarejko May 29, 2026 •

edited

Loading

Uh oh!

jmart-nv Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jmart-nv commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Results

Screenshots

Environment

Checklist

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

🤖 Isaac Lab Review Bot

Summary

Incremental Update (a86c6a7)

Test Coverage

CI Status

Verdict

Uh oh!

Uh oh!

Uh oh!

pbarejko May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmart-nv Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jmart-nv commented May 28, 2026 •

edited

Loading

isaaclab-review-bot Bot left a comment •

edited

Loading

Incremental Update (`a86c6a7`)

pbarejko May 29, 2026 •

edited

Loading