Skip to content

feat: extend training report with profiling and step aggregation#959

Open
blugassi wants to merge 2 commits into
NVIDIA:mainfrom
blugassi:training-report-profiling
Open

feat: extend training report with profiling and step aggregation#959
blugassi wants to merge 2 commits into
NVIDIA:mainfrom
blugassi:training-report-profiling

Conversation

@blugassi

@blugassi blugassi commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Extends the unified training report (training_report.json) for NeMoRun, MegatronRun, and MegatronBridge with two additions:

  • Profiling metadata — records profiling_enabled, profiling_start_step, and profiling_stop_step for the run (sourced from the run's CloudAI test definition).
  • Step aggregation — adds a top-level aggregation block with per-metric mean/min/max/std/t99/t95, computed over the steps left after excluding warmup and the profiling window. The exclusion window is configurable per test via [training_report] (exclude_start_steps, exclude_post_profiling_steps; defaults 5/2).

Test Plan

  • Environment: Python 3.12 venv; pytest.
  • Unit: tests/report_generator/ extended with cases for the new profiling and aggregation capabilities (field resolution, aggregation stats, window exclusion, defaults/overrides) — all pass.
  • End-to-end: ran cloudai generate-report for all three workloads and verified each training_report.json has exactly {config, steps, aggregation}, correct profiling_* fields, and aggregation stats satisfying mean ≤ t95 ≤ max. Confirmed a [training_report] TOML override flows into config.exclude_* and shifts the aggregation window.

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: ca5d4e1c-b20c-4a17-9e7a-caf8ea07a11b

📥 Commits

Reviewing files that changed from the base of the PR and between 075fe29 and 1e978db.

📒 Files selected for processing (6)
  • src/cloudai/models/workload.py
  • src/cloudai/report_generator/training/mappings.py
  • src/cloudai/report_generator/training/models.py
  • src/cloudai/report_generator/training/parser.py
  • tests/report_generator/training/test_training_parser.py
  • tests/test_test_definitions.py

📝 Walkthrough

Walkthrough

Adds TrainingReportConfig to TestDefinition, introduces per-metric training-step aggregation, and refactors TrainingParser to merge model config with test-definition config while applying exclusion-window filtering and optional aggregation.

Changes

Training report aggregation feature

Layer / File(s) Summary
TrainingReportConfig schema addition
src/cloudai/models/workload.py
Adds TrainingReportConfig with exclusion-window fields and an optional training_report field on TestDefinition.
Aggregation statistics dataclasses
src/cloudai/report_generator/training/models.py
Adds MetricStats and StepAggregation, extends TrainingConfig with profiling/exclusion fields, and makes TrainingResults.aggregation optional.
Config mapping renames and test-config mappings
src/cloudai/report_generator/training/mappings.py
Renames *_CONFIG constants to *_MODEL_CONFIG and adds *_TEST_CONFIG dictionaries for profiling/exclusion fields from TestDefinition.model_dump().
Dual-source config resolution and aggregation in parser
src/cloudai/report_generator/training/parser.py
Refactors TrainingParser to merge model config and test config into TrainingConfig, filters steps and computes aggregation, rewrites dotted-path lookup with wildcard support, and updates parser subclasses to use get_model_config and renamed mapping attributes.
Test suite updates
tests/report_generator/training/test_training_parser.py
Updates _tr helper and _Test.model_dump, adapts config resolution tests to mock get_model_config, and adds aggregation and glob-lookup coverage.
Training report config validation
tests/test_test_definitions.py
Adds tests for TrainingReportConfig defaults and negative-value validation.

Estimated code review effort: 4 (Complex) | ~50 minutes

Sequence Diagram(s)

sequenceDiagram
  participant TrainingParser
  participant get_model_config
  participant tr_test
  participant TrainingConfig
  participant filter_steps
  participant aggregate

  TrainingParser->>get_model_config: resolve framework model config
  TrainingParser->>tr_test: resolve TestDefinition.model_dump()
  TrainingParser->>TrainingConfig: merge model config and test config fields
  TrainingParser->>filter_steps: apply exclude_start_steps and profiling window
  filter_steps->>aggregate: compute StepAggregation or None
  TrainingParser-->>TrainingParser: return TrainingResults(steps, aggregation)
Loading

Related issues: None specified.

Related PRs: None specified.

Suggested labels: enhancement, report-generator, needs-testing

Suggested reviewers: None specified.

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: extending training reports with profiling metadata and step aggregation.
Description check ✅ Passed The description accurately matches the changeset and explains the new profiling metadata, aggregation, and exclusion window behavior.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/models/workload.py`:
- Around line 92-99: TrainingReportConfig currently allows negative values for
exclude_start_steps and exclude_post_profiling_steps, which can silently alter
the step slice used by TrainingParser._filter_steps. Add explicit non-negative
validation to both fields in TrainingReportConfig so invalid configs fail fast
instead of changing aggregation semantics. Use the TrainingReportConfig model
(and its field definitions) as the place to enforce the constraint, and make
sure the validation applies whenever the config is instantiated.

In `@src/cloudai/report_generator/training/mappings.py`:
- Around line 128-153: The three training config mappings repeat the same
`exclude_start_steps` and `exclude_post_profiling_steps` entries, so factor
those shared `training_report.*` paths into a single reusable mapping and merge
it into `NEMO_TEST_CONFIG`, `MEGATRON_TEST_CONFIG`, and
`MEGATRON_BRIDGE_TEST_CONFIG`. Keep the existing per-framework keys in each
dict, and use the shared constant near the other mapping definitions in
`mappings.py` so future schema changes only need one update.

In `@src/cloudai/report_generator/training/parser.py`:
- Around line 167-174: The wildcard lookup in `_get_from_dict` can resolve an
arbitrary match when `fnmatch.filter` returns multiple keys, which makes
`NEMO_TEST_CONFIG` paths like `extra_cmd_args.*start_step` and `*end_step`
fragile. Update `_get_from_dict` in `parser.py` to detect when a wildcard
segment matches more than one key and either raise or emit a warning instead of
using `next(...)`, so `profiling_start_step` and `profiling_stop_step` are not
silently derived from dict insertion order. Apply the same safeguard anywhere
else this helper is used for wildcard resolution.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 264c8e1b-2c43-458f-9890-1adbdba147ba

📥 Commits

Reviewing files that changed from the base of the PR and between 8ab9048 and 075fe29.

📒 Files selected for processing (5)
  • src/cloudai/models/workload.py
  • src/cloudai/report_generator/training/mappings.py
  • src/cloudai/report_generator/training/models.py
  • src/cloudai/report_generator/training/parser.py
  • tests/report_generator/training/test_training_parser.py

Comment thread src/cloudai/models/workload.py
Comment thread src/cloudai/report_generator/training/mappings.py
Comment thread src/cloudai/report_generator/training/parser.py
blugassi added 2 commits July 2, 2026 21:32
Add profiling_enabled/profiling_start_step/profiling_stop_step to
TrainingConfig, resolved from the run's CloudAI TestDefinition (a new
TEST_CONFIG mapping) rather than the framework artifact (renamed to
MODEL_CONFIG). NeMo's nsys-callback step bounds live in free-form
extra_cmd_args, so _get_from_dict gains a '*' glob segment to match them
independently of the user-chosen callback index.

Signed-off-by: Ben Lugassi <blugassi@nvidia.com>
Add a top-level `aggregation` block to training_report.json with per-metric
mean/min/max/std/t99/t95 (StepAggregation built from MetricStats). Two
TOML-configurable window flags on TestDefinition ([training_report]:
exclude_start_steps, exclude_post_profiling_steps) drop warmup and the
profiling window before aggregating, and are recorded in TrainingConfig.

Signed-off-by: Ben Lugassi <blugassi@nvidia.com>
@blugassi blugassi force-pushed the training-report-profiling branch from 075fe29 to 1e978db Compare July 2, 2026 18:32
model_config = ConfigDict(extra="forbid")

exclude_start_steps: int = Field(default=5, ge=0)
exclude_post_profiling_steps: int = Field(default=2, ge=0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexmanle fyi, I think this is what you've been talking about, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants