Add chunked-dispatch mode to eval_runner by cvolkcvolk · Pull Request #751 · isaac-sim/IsaacLab-Arena

cvolkcvolk · 2026-06-02T08:55:51Z

Summary

Adds --chunk_size N to eval_runner. When set and the jobs config has more than N jobs, the runner splits the config into chunks and launches one fresh subprocess per chunk. Unset → unchanged single-process behavior.

Why

Long evaluation sweeps OOM. Each env build/teardown cycle leaves ~72 MB of host memory unreclaimable from Python. Restarting the process is the only way to reclaim that memory

The in-process leak will be partly fixed upstream (isaac-sim/IsaacLab#5896 cutting it from ~540 MB/cycle to ~72 MB/cycle).
Chunking is the workaround until what's left is identified and fixed.

greptile-apps · 2026-06-02T08:59:00Z

Greptile Summary

This PR adds a --chunk_size N flag to eval_runner that splits a large jobs config into chunks and launches one fresh subprocess per chunk, reclaiming host memory that leaks from Python's per-cycle env build/teardown. When unset, behaviour is unchanged.

eval_runner_cli.py adds the --chunk_size int argument (default None).
eval_runner.py adds _run_in_chunks(), which writes per-chunk temp configs, spawns sequential subprocesses forwarding the original sys.argv with the temp path appended, and exits immediately if any chunk fails.

Confidence Score: 4/5

The chunked-dispatch mechanism is well-structured and the core subprocess-forwarding logic is correct; the main gap is observability rather than correctness.

The chunk_size <= 0 guard, temp-file cleanup in finally, and the argparse last-value-wins override for --eval_jobs_config are all handled correctly. The one structural issue is that per-chunk subprocesses each emit their own job-status summary but the orchestrating parent emits nothing after all chunks complete, meaning a user running with --continue_on_error has no consolidated view of failures.

isaaclab_arena/evaluation/eval_runner.py — specifically the _run_in_chunks function and how it reports overall completion status.

Important Files Changed

Filename	Overview
isaaclab_arena/evaluation/eval_runner.py	Adds _run_in_chunks() and dispatches to it from main(). Core logic is correct; the parent emits no aggregate status after all chunks complete, making --continue_on_error failures invisible at the orchestrator level.
isaaclab_arena/evaluation/eval_runner_cli.py	Adds the --chunk_size int argument with a clear help string; straightforward and correct.

Sequence Diagram

sequenceDiagram
    participant User
    participant Parent as eval_runner (parent)
    participant Tmp as temp JSON file
    participant Child as eval_runner (subprocess per chunk)

    User->>Parent: python eval_runner.py --chunk_size N --eval_jobs_config master.json
    Parent->>Parent: "load master_cfg, len(jobs) > N?"
    alt chunked mode
        loop for each chunk_idx in range(n_chunks)
            Parent->>Tmp: "json.dump({jobs: chunk_jobs})"
            Parent->>Child: subprocess.run with appended --eval_jobs_config chunk_path
            Child->>Child: load chunk config, SimulationAppContext, run jobs
            Child->>Child: job_manager.print_jobs_info() + metrics_logger.print_metrics()
            Child-->>Parent: returncode
            Parent->>Tmp: unlink(chunk_path)
            alt "returncode != 0"
                Parent->>User: sys.exit(returncode)
            end
        end
        Parent-->>User: return (silent)
    else single-process mode
        Parent->>Parent: SimulationAppContext, run all jobs inline
        Parent-->>User: return
    end

_{Reviews (5): Last reviewed commit: "Add chunked-dispatch mode to eval_runner" | Re-trigger Greptile}

greptile-apps · 2026-06-02T08:59:04Z

+    chunk_size = args_cli.chunk_size
+    start_chunk = args_cli.start_chunk
+    n_total = len(jobs)
+    n_chunks = (n_total + chunk_size - 1) // chunk_size


ZeroDivisionError on --chunk_size 0

n_chunks = (n_total + chunk_size - 1) // chunk_size raises ZeroDivisionError when chunk_size=0 because len(jobs) > 0 is true for any non-empty config, so _run_in_chunks is entered. Negative values reach the same path and produce a negative n_chunks, causing range(start_chunk, n_chunks) to be empty while the final print incorrectly reports "all N chunks completed successfully". Neither value is validated before reaching this arithmetic. A guard like if args_cli.chunk_size is not None and args_cli.chunk_size < 1: raise ValueError("--chunk_size must be >= 1") at the top of main() — or type=lambda n: int(n) if int(n) >= 1 else parser.error(...) in the CLI — would prevent both cases.

greptile-apps · 2026-06-02T08:59:04Z

+    for chunk_idx in range(start_chunk, n_chunks):
+        start = chunk_idx * chunk_size


start_chunk >= n_chunks silently skips all chunks but still exits with code 0 and prints "all N chunks completed successfully", which would mislead a user who accidentally supplies an out-of-range resume index. A quick bounds check makes the mistake obvious.

Suggested change

for chunk_idx in range(start_chunk, n_chunks):

start = chunk_idx * chunk_size

if start_chunk >= n_chunks:

print(

f"[eval_runner] WARNING: --start_chunk {start_chunk} is out of range"

f" (0..{n_chunks - 1}); no chunks will run.",

flush=True,

)

return 1

for chunk_idx in range(start_chunk, n_chunks):

start = chunk_idx * chunk_size

greptile-apps · 2026-06-02T08:59:06Z

+    for chunk_idx in range(start_chunk, n_chunks):
+        start = chunk_idx * chunk_size
+        end = min(start + chunk_size, n_total)
+        chunk_cfg = {"jobs": jobs[start:end]}


The chunk config drops any top-level keys other than jobs from the master config. If the config schema gains (or already has) global keys — e.g. metadata, default overrides — those are silently stripped from every child invocation. Preserving the full structure and only replacing the jobs list is a safer default.

Suggested change

chunk_cfg = {"jobs": jobs[start:end]}

chunk_cfg = {**master_cfg, "jobs": jobs[start:end]}

isaaclab-review-bot

Code Review: Add chunked-dispatch mode to eval_runner

The approach is well-motivated — process restart is the correct strategy for reclaiming C-level memory (USD/Hydra state) that survives Python teardown. The implementation is clean and minimal: dispatcher logic stays in the runner, children reuse the existing in-process flow without code duplication. Resume support via --start_chunk is a nice touch for long sweeps.

A few issues worth addressing before merge:

Summary of Findings

#	Severity	File	Issue
1	🟡 Medium	eval_runner.py	`sys.argv[0]` not reliable for module-style invocation
2	🟡 Medium	eval_runner.py	`--start_chunk` ≥ actual chunk count silently succeeds
3	🟡 Medium	eval_runner.py	`--chunk_size 0` causes `ZeroDivisionError`
4	🟢 Low	eval_runner.py	`raise SystemExit(0)` vs implicit `None` return
5	🟢 Low	eval_runner.py	No unit tests for `_argv_without_dispatcher_flags`

Overall: solid implementation with minor hardening needed. The core dispatch logic is correct and the fail-fast + resume pattern is well thought out.

isaaclab-review-bot · 2026-06-02T08:59:15Z

+        end = min(start + chunk_size, n_total)
+        chunk_cfg = {"jobs": jobs[start:end]}
+        # Don't ``delete=True`` — eval_runner reopens the file in its own subprocess.
+        with tempfile.NamedTemporaryFile(mode="w", suffix=f"_chunk{chunk_idx:02d}.json", delete=False) as tmp:


🟡 Medium: sys.argv[0] is unreliable for module-style invocation

If the runner is invoked via python -m isaaclab_arena.evaluation.eval_runner, then sys.argv[0] will be the module path (e.g., /path/to/isaaclab_arena/evaluation/eval_runner.py) which happens to work. However, if invoked through an entry-point wrapper script or via runpy, the value may not be what you expect.

More importantly, using sys.executable + sys.argv[0] won't reproduce any special environment setup (e.g., isaaclab launcher scripts that set LD_LIBRARY_PATH or PYTHONPATH).

Consider documenting this assumption or using [sys.executable, "-m", "isaaclab_arena.evaluation.eval_runner"] as an alternative that's more robust for module-based invocation:

# Option: detect module vs script invocation import __main__ if hasattr(__main__, '__spec__') and __main__.__spec__ is not None: cmd = [sys.executable, "-m", __main__.__spec__.name, ...] else: cmd = [sys.executable, sys.argv[0], ...]

isaaclab-review-bot · 2026-06-02T08:59:15Z

+    forwarded_argv = _argv_without_dispatcher_flags(sys.argv[1:])
+
+    print(
+        f"[eval_runner] chunked dispatch: {n_total} jobs → {n_chunks} chunks of <= {chunk_size}"


🟡 Medium: --start_chunk ≥ chunk count silently reports success

If a user passes --start_chunk 10 but there are only 4 chunks, range(10, 4) is empty and the function prints "all 4 chunks completed successfully" — which is misleading since nothing actually ran.

Suggested fix:

if start_chunk >= n_chunks: print( f"[eval_runner] warning: --start_chunk {start_chunk} >= total chunks {n_chunks}." f" Nothing to dispatch.", flush=True, ) return 0

isaaclab-review-bot

🤖 Isaac Lab Review Bot

Summary

This PR adds a chunked-dispatch mode to eval_runner that splits large job sweeps across subprocess boundaries, allowing the OS to reclaim memory leaks between chunks. The implementation is clean and follows standard subprocess patterns.

Design Assessment

Design is sound. Process-restart is the only reliable way to reclaim C-level memory leaks from USD/Hydra. The implementation correctly:

Preserves single-process behavior when --chunk_size is unset or job count is small
Cleans up temp files in finally blocks
Propagates exit codes for CI integration

Findings

🟡 Warning: chunk_size ≤ 0 validation missing — isaaclab_arena/evaluation/eval_runner.py:135

The validation if chunk_size <= 0 raises ValueError, which is good. However, since chunk_size comes from argparse with type=int, a non-integer value would fail at CLI parsing time. The current validation is sufficient.

🟡 Warning: subprocess signal handling — isaaclab_arena/evaluation/eval_runner.py:154-155

When a subprocess is killed by a signal (e.g., SIGKILL from OOM killer), result.returncode will be negative (-signal). The current message "exit code -9" is technically correct but could be more informative for users debugging OOM issues.

Consider:

if result.returncode < 0:
    sig = -result.returncode
    print(f"[eval_runner] chunk {chunk_idx} killed by signal {sig} (possibly OOM).", flush=True)
else:
    print(f"[eval_runner] chunk {chunk_idx} failed (exit {result.returncode}).", flush=True)

🔵 Suggestion: non-jobs keys dropped — isaaclab_arena/evaluation/eval_runner.py:143

The chunked config only preserves the "jobs" key: json.dump({"jobs": jobs[start:end]}, tmp). If the master config has other top-level keys (e.g., metadata, version), they won't be forwarded to child processes. This is fine if the schema only has jobs, but worth noting if the schema might expand.

Test Coverage

Feature PR: No tests included
Coverage: Partial — existing test_eval_runner.py tests single-process flow but not chunked dispatch
Gaps: The chunked dispatch logic is untested. Consider adding an integration test verifying chunked dispatch produces the same results as single-process execution (a 3-job config with --chunk_size 1 should produce identical metrics to running without chunking).

CI Status

Pre-commit check queued.

Verdict

Minor fixes needed

The implementation is correct and well-structured. The signal-handling improvement is nice-to-have for UX but not blocking.

Update (eb022d2): Reviewed incremental commit that rewrites the chunked-dispatch implementation from scratch, simplifying it significantly:

Changes:

Removed --start_chunk — The previous version had resume support via --start_chunk; this is now removed, simplifying the implementation. Resume can be achieved externally if needed.
Removed _argv_without_dispatcher_flags helper — The new approach forwards sys.argv[1:] directly and appends --eval_jobs_config <chunk_path> at the end, relying on argparse's last-value-wins semantics. This is cleaner and eliminates potential bugs in flag-stripping logic.
Simplified loop — Straightforward for chunk_idx in range(n_chunks) without start_chunk offset.

Previous concerns resolved:

✅ --start_chunk validation concern is now moot (parameter removed)

Remaining observations:

The signal-aware error messaging suggestion still applies
The non-jobs keys observation still applies (low priority)

For long evaluation sweeps that build and tear down hundreds of envs in one process, a residual host-RSS leak below the Python layer (e.g. USD asset cache, Hydra render delegate state) eventually OOM-kills the process even after every Python-side cleanup has been done. The only reliable way to reclaim that memory is to let the OS release it on process exit. Adds --chunk_size N. When set and the jobs config has more than N jobs, eval_runner splits the config into chunks of N jobs and dispatches one subprocess per chunk; each child boots a fresh SimulationApp and runs only its chunk in-process. If --chunk_size is unset, or len(jobs) <= chunk_size, the dispatcher is inert: behavior matches the original single-process flow exactly. A child subprocess receives at most chunk_size jobs so it also runs in-process, even with --chunk_size still on the command line. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

xyao-nv · 2026-06-02T16:23:54Z

+    parser.add_argument(
+        "--chunk_size",
+        type=int,
+        default=None,


Do we have a recommendation to user how many jobs to run per chunk based on how large gpu memory?

xyao-nv · 2026-06-02T16:25:41Z

        _close_env(env)


+def _run_in_chunks(args_cli: argparse.Namespace, master_cfg: dict) -> None:


Shall eval_runner also include aggregating chunked results into one centralized view at the end?

xyao-nv · 2026-06-02T17:07:28Z

Do you know where this 72 MB comes from?

cvolkcvolk requested review from alexmillane, peterd-NV, qianl-nv, viiik-inside, xyao-nv and zhx06 as code owners June 2, 2026 08:55

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

isaaclab-review-bot Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread isaaclab_arena/evaluation/eval_runner.py Outdated

Comment thread isaaclab_arena/evaluation/eval_runner.py

cvolkcvolk force-pushed the cvolk/feature/eval-runner-chunked-dispatch branch 11 times, most recently from e686814 to 4793daf Compare June 2, 2026 09:52

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread isaaclab_arena/evaluation/eval_runner.py

cvolkcvolk force-pushed the cvolk/feature/eval-runner-chunked-dispatch branch from 4793daf to eb022d2 Compare June 2, 2026 09:58

xyao-nv reviewed Jun 2, 2026

View reviewed changes

xyao-nv approved these changes Jun 2, 2026

View reviewed changes

		for chunk_idx in range(start_chunk, n_chunks):
		start = chunk_idx * chunk_size

-    for chunk_idx in range(start_chunk, n_chunks):
-        start = chunk_idx * chunk_size
+    if start_chunk >= n_chunks:
+        print(
+            f"[eval_runner] WARNING: --start_chunk {start_chunk} is out of range"
+            f" (0..{n_chunks - 1}); no chunks will run.",
+            flush=True,
+        )
+        return 1
+    for chunk_idx in range(start_chunk, n_chunks):
+        start = chunk_idx * chunk_size

	chunk_cfg = {"jobs": jobs[start:end]}
	chunk_cfg = {**master_cfg, "jobs": jobs[start:end]}

		_close_env(env)


		def _run_in_chunks(args_cli: argparse.Namespace, master_cfg: dict) -> None:

Conversation

cvolkcvolk commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Uh oh!

greptile-apps Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

Code Review: Add chunked-dispatch mode to eval_runner

Summary of Findings

Uh oh!

isaaclab-review-bot Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

isaaclab-review-bot Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

🤖 Isaac Lab Review Bot

Summary

Design Assessment

Findings

Test Coverage

CI Status

Verdict

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xyao-nv Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

xyao-nv Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xyao-nv commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cvolkcvolk commented Jun 2, 2026 •

edited

Loading

greptile-apps Bot commented Jun 2, 2026 •

edited

Loading

isaaclab-review-bot Bot left a comment •

edited

Loading

xyao-nv Jun 2, 2026 •

edited

Loading