feat: extend training report with profiling and step aggregation#959
feat: extend training report with profiling and step aggregation#959blugassi wants to merge 2 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughAdds ChangesTraining report aggregation feature
Estimated code review effort: 4 (Complex) | ~50 minutes Sequence Diagram(s)sequenceDiagram
participant TrainingParser
participant get_model_config
participant tr_test
participant TrainingConfig
participant filter_steps
participant aggregate
TrainingParser->>get_model_config: resolve framework model config
TrainingParser->>tr_test: resolve TestDefinition.model_dump()
TrainingParser->>TrainingConfig: merge model config and test config fields
TrainingParser->>filter_steps: apply exclude_start_steps and profiling window
filter_steps->>aggregate: compute StepAggregation or None
TrainingParser-->>TrainingParser: return TrainingResults(steps, aggregation)
Related issues: None specified. Related PRs: None specified. Suggested labels: enhancement, report-generator, needs-testing Suggested reviewers: None specified. 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/models/workload.py`:
- Around line 92-99: TrainingReportConfig currently allows negative values for
exclude_start_steps and exclude_post_profiling_steps, which can silently alter
the step slice used by TrainingParser._filter_steps. Add explicit non-negative
validation to both fields in TrainingReportConfig so invalid configs fail fast
instead of changing aggregation semantics. Use the TrainingReportConfig model
(and its field definitions) as the place to enforce the constraint, and make
sure the validation applies whenever the config is instantiated.
In `@src/cloudai/report_generator/training/mappings.py`:
- Around line 128-153: The three training config mappings repeat the same
`exclude_start_steps` and `exclude_post_profiling_steps` entries, so factor
those shared `training_report.*` paths into a single reusable mapping and merge
it into `NEMO_TEST_CONFIG`, `MEGATRON_TEST_CONFIG`, and
`MEGATRON_BRIDGE_TEST_CONFIG`. Keep the existing per-framework keys in each
dict, and use the shared constant near the other mapping definitions in
`mappings.py` so future schema changes only need one update.
In `@src/cloudai/report_generator/training/parser.py`:
- Around line 167-174: The wildcard lookup in `_get_from_dict` can resolve an
arbitrary match when `fnmatch.filter` returns multiple keys, which makes
`NEMO_TEST_CONFIG` paths like `extra_cmd_args.*start_step` and `*end_step`
fragile. Update `_get_from_dict` in `parser.py` to detect when a wildcard
segment matches more than one key and either raise or emit a warning instead of
using `next(...)`, so `profiling_start_step` and `profiling_stop_step` are not
silently derived from dict insertion order. Apply the same safeguard anywhere
else this helper is used for wildcard resolution.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 264c8e1b-2c43-458f-9890-1adbdba147ba
📒 Files selected for processing (5)
src/cloudai/models/workload.pysrc/cloudai/report_generator/training/mappings.pysrc/cloudai/report_generator/training/models.pysrc/cloudai/report_generator/training/parser.pytests/report_generator/training/test_training_parser.py
Add profiling_enabled/profiling_start_step/profiling_stop_step to TrainingConfig, resolved from the run's CloudAI TestDefinition (a new TEST_CONFIG mapping) rather than the framework artifact (renamed to MODEL_CONFIG). NeMo's nsys-callback step bounds live in free-form extra_cmd_args, so _get_from_dict gains a '*' glob segment to match them independently of the user-chosen callback index. Signed-off-by: Ben Lugassi <blugassi@nvidia.com>
Add a top-level `aggregation` block to training_report.json with per-metric mean/min/max/std/t99/t95 (StepAggregation built from MetricStats). Two TOML-configurable window flags on TestDefinition ([training_report]: exclude_start_steps, exclude_post_profiling_steps) drop warmup and the profiling window before aggregating, and are recorded in TrainingConfig. Signed-off-by: Ben Lugassi <blugassi@nvidia.com>
075fe29 to
1e978db
Compare
| model_config = ConfigDict(extra="forbid") | ||
|
|
||
| exclude_start_steps: int = Field(default=5, ge=0) | ||
| exclude_post_profiling_steps: int = Field(default=2, ge=0) |
There was a problem hiding this comment.
@alexmanle fyi, I think this is what you've been talking about, right?
Summary
Extends the unified training report (
training_report.json) forNeMoRun,MegatronRun, andMegatronBridgewith two additions:profiling_enabled,profiling_start_step, andprofiling_stop_stepfor the run (sourced from the run's CloudAI test definition).aggregationblock with per-metricmean/min/max/std/t99/t95, computed over the steps left after excluding warmup and the profiling window. The exclusion window is configurable per test via[training_report](exclude_start_steps,exclude_post_profiling_steps; defaults 5/2).Test Plan
pytest.tests/report_generator/extended with cases for the new profiling and aggregation capabilities (field resolution, aggregation stats, window exclusion, defaults/overrides) — all pass.cloudai generate-reportfor all three workloads and verified eachtraining_report.jsonhas exactly{config, steps, aggregation}, correctprofiling_*fields, and aggregation stats satisfyingmean ≤ t95 ≤ max. Confirmed a[training_report]TOML override flows intoconfig.exclude_*and shifts the aggregation window.