feat: extend training report with profiling and step aggregation by blugassi · Pull Request #959 · NVIDIA/cloudai

blugassi · 2026-07-02T17:53:14Z

Summary

Extends the unified training report (training_report.json) for NeMoRun, MegatronRun, and MegatronBridge with two additions:

Profiling metadata — records profiling_enabled, profiling_start_step, and profiling_stop_step for the run (sourced from the run's CloudAI test definition).
Step aggregation — adds a top-level aggregation block with per-metric mean/min/max/std/t99/t95, computed over the steps left after excluding warmup and the profiling window. The exclusion window is configurable per test via [training_report] (exclude_start_steps, exclude_post_profiling_steps; defaults 5/2).

Test Plan

Environment: Python 3.12 venv; pytest.
Unit: tests/report_generator/ extended with cases for the new profiling and aggregation capabilities (field resolution, aggregation stats, window exclusion, defaults/overrides) — all pass.
End-to-end: ran cloudai generate-report for all three workloads and verified each training_report.json has exactly {config, steps, aggregation}, correct profiling_* fields, and aggregation stats satisfying mean ≤ t95 ≤ max. Confirmed a [training_report] TOML override flows into config.exclude_* and shifts the aggregation window.

coderabbitai · 2026-07-02T17:53:58Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: ca5d4e1c-b20c-4a17-9e7a-caf8ea07a11b

📥 Commits

Reviewing files that changed from the base of the PR and between 075fe29 and 1e978db.

📒 Files selected for processing (6)

src/cloudai/models/workload.py
src/cloudai/report_generator/training/mappings.py
src/cloudai/report_generator/training/models.py
src/cloudai/report_generator/training/parser.py
tests/report_generator/training/test_training_parser.py
tests/test_test_definitions.py

📝 Walkthrough

Walkthrough

Adds TrainingReportConfig to TestDefinition, introduces per-metric training-step aggregation, and refactors TrainingParser to merge model config with test-definition config while applying exclusion-window filtering and optional aggregation.

Changes

Training report aggregation feature

Layer / File(s)	Summary
TrainingReportConfig schema addition `src/cloudai/models/workload.py`	Adds `TrainingReportConfig` with exclusion-window fields and an optional `training_report` field on `TestDefinition`.
Aggregation statistics dataclasses `src/cloudai/report_generator/training/models.py`	Adds `MetricStats` and `StepAggregation`, extends `TrainingConfig` with profiling/exclusion fields, and makes `TrainingResults.aggregation` optional.
Config mapping renames and test-config mappings `src/cloudai/report_generator/training/mappings.py`	Renames `_CONFIG` constants to `_MODEL_CONFIG` and adds `*_TEST_CONFIG` dictionaries for profiling/exclusion fields from `TestDefinition.model_dump()`.
Dual-source config resolution and aggregation in parser `src/cloudai/report_generator/training/parser.py`	Refactors `TrainingParser` to merge model config and test config into `TrainingConfig`, filters steps and computes aggregation, rewrites dotted-path lookup with wildcard support, and updates parser subclasses to use `get_model_config` and renamed mapping attributes.
Test suite updates `tests/report_generator/training/test_training_parser.py`	Updates `_tr` helper and `_Test.model_dump`, adapts config resolution tests to mock `get_model_config`, and adds aggregation and glob-lookup coverage.
Training report config validation `tests/test_test_definitions.py`	Adds tests for `TrainingReportConfig` defaults and negative-value validation.

Estimated code review effort: 4 (Complex) | ~50 minutes

Sequence Diagram(s)

sequenceDiagram
  participant TrainingParser
  participant get_model_config
  participant tr_test
  participant TrainingConfig
  participant filter_steps
  participant aggregate

  TrainingParser->>get_model_config: resolve framework model config
  TrainingParser->>tr_test: resolve TestDefinition.model_dump()
  TrainingParser->>TrainingConfig: merge model config and test config fields
  TrainingParser->>filter_steps: apply exclude_start_steps and profiling window
  filter_steps->>aggregate: compute StepAggregation or None
  TrainingParser-->>TrainingParser: return TrainingResults(steps, aggregation)

Related issues: None specified.

Related PRs: None specified.

Suggested labels: enhancement, report-generator, needs-testing

Suggested reviewers: None specified.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: extending training reports with profiling metadata and step aggregation.
Description check	✅ Passed	The description accurately matches the changeset and explains the new profiling metadata, aggregation, and exclusion window behavior.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/models/workload.py`:
- Around line 92-99: TrainingReportConfig currently allows negative values for
exclude_start_steps and exclude_post_profiling_steps, which can silently alter
the step slice used by TrainingParser._filter_steps. Add explicit non-negative
validation to both fields in TrainingReportConfig so invalid configs fail fast
instead of changing aggregation semantics. Use the TrainingReportConfig model
(and its field definitions) as the place to enforce the constraint, and make
sure the validation applies whenever the config is instantiated.

In `@src/cloudai/report_generator/training/mappings.py`:
- Around line 128-153: The three training config mappings repeat the same
`exclude_start_steps` and `exclude_post_profiling_steps` entries, so factor
those shared `training_report.*` paths into a single reusable mapping and merge
it into `NEMO_TEST_CONFIG`, `MEGATRON_TEST_CONFIG`, and
`MEGATRON_BRIDGE_TEST_CONFIG`. Keep the existing per-framework keys in each
dict, and use the shared constant near the other mapping definitions in
`mappings.py` so future schema changes only need one update.

In `@src/cloudai/report_generator/training/parser.py`:
- Around line 167-174: The wildcard lookup in `_get_from_dict` can resolve an
arbitrary match when `fnmatch.filter` returns multiple keys, which makes
`NEMO_TEST_CONFIG` paths like `extra_cmd_args.*start_step` and `*end_step`
fragile. Update `_get_from_dict` in `parser.py` to detect when a wildcard
segment matches more than one key and either raise or emit a warning instead of
using `next(...)`, so `profiling_start_step` and `profiling_stop_step` are not
silently derived from dict insertion order. Apply the same safeguard anywhere
else this helper is used for wildcard resolution.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 264c8e1b-2c43-458f-9890-1adbdba147ba

📥 Commits

Reviewing files that changed from the base of the PR and between 8ab9048 and 075fe29.

📒 Files selected for processing (5)

src/cloudai/models/workload.py
src/cloudai/report_generator/training/mappings.py
src/cloudai/report_generator/training/models.py
src/cloudai/report_generator/training/parser.py
tests/report_generator/training/test_training_parser.py

Add profiling_enabled/profiling_start_step/profiling_stop_step to TrainingConfig, resolved from the run's CloudAI TestDefinition (a new TEST_CONFIG mapping) rather than the framework artifact (renamed to MODEL_CONFIG). NeMo's nsys-callback step bounds live in free-form extra_cmd_args, so _get_from_dict gains a '*' glob segment to match them independently of the user-chosen callback index. Signed-off-by: Ben Lugassi <blugassi@nvidia.com>

Add a top-level `aggregation` block to training_report.json with per-metric mean/min/max/std/t99/t95 (StepAggregation built from MetricStats). Two TOML-configurable window flags on TestDefinition ([training_report]: exclude_start_steps, exclude_post_profiling_steps) drop warmup and the profiling window before aggregating, and are recorded in TrainingConfig. Signed-off-by: Ben Lugassi <blugassi@nvidia.com>

podkidyshev · 2026-07-03T11:51:36Z

+    model_config = ConfigDict(extra="forbid")
+
+    exclude_start_steps: int = Field(default=5, ge=0)
+    exclude_post_profiling_steps: int = Field(default=2, ge=0)


@alexmanle fyi, I think this is what you've been talking about, right?

blugassi requested review from jeffnvidia, podkidyshev and srivatsankrishnan as code owners July 2, 2026 17:53

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread src/cloudai/models/workload.py

Comment thread src/cloudai/report_generator/training/mappings.py

Comment thread src/cloudai/report_generator/training/parser.py

blugassi added 2 commits July 2, 2026 21:32

blugassi force-pushed the training-report-profiling branch from 075fe29 to 1e978db Compare July 2, 2026 18:32

podkidyshev reviewed Jul 3, 2026

View reviewed changes

podkidyshev approved these changes Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: extend training report with profiling and step aggregation#959

feat: extend training report with profiling and step aggregation#959
blugassi wants to merge 2 commits into
NVIDIA:mainfrom
blugassi:training-report-profiling

blugassi commented Jul 2, 2026

Uh oh!

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

podkidyshev Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

blugassi commented Jul 2, 2026

Summary

Test Plan

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

podkidyshev Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading