Save hf checkpoint at every valitation iteration during distillation.#1897
Save hf checkpoint at every valitation iteration during distillation.#1897danielkorzekwa wants to merge 9 commits into
Conversation
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
|
/claude review |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds per-validation HuggingFace exports, token-budgeted preprocessing and blend generation, plus tests and documentation for the new iterative workflows. ChangesIterative distillation tooling
Estimated code review effort: 4 (Complex) | ~60 minutes Sequence Diagram(s)sequenceDiagram
participant distill.py CLI
participant pretrain
participant _HFValidationExportCallback
participant AutoConfig
participant Filesystem
distill.py CLI->>pretrain: pretrain(config, callbacks=[callback])
pretrain->>_HFValidationExportCallback: on_validation_end(iteration)
_HFValidationExportCallback->>Filesystem: create iter_<iteration>/ export path
_HFValidationExportCallback->>AutoConfig: save_pretrained(...)
AutoConfig->>Filesystem: write config.json
_HFValidationExportCallback->>pretrain: torch.distributed.barrier()
sequenceDiagram
participant CLI
participant megatron_preprocess_data
participant process_hf_split
participant process_json_file
participant _encode_docs
CLI->>megatron_preprocess_data: --max_tokens
megatron_preprocess_data->>process_hf_split: remaining_tokens
megatron_preprocess_data->>process_json_file: remaining_tokens
process_hf_split->>_encode_docs: may_stop_early=True
process_json_file->>_encode_docs: may_stop_early=True
process_hf_split-->>megatron_preprocess_data: stop at token budget
process_json_file-->>megatron_preprocess_data: stop at token budget
Possibly related PRs
Suggested reviewers: 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Comment |
| ) | ||
| # TODO: Use distill(..., callbacks=[callback]) once Megatron-Bridge supports callbacks. | ||
| pretrain(config, forward_step_modelopt, callbacks=[callback]) | ||
| else: |
There was a problem hiding this comment.
[SUGGESTION] When --hf_validation_export_path is set, the training entrypoint switches from distill(config) to pretrain(config, forward_step_modelopt, callbacks=[callback]). This creates two distinct training code paths that must stay behaviorally identical — otherwise enabling validation export silently changes how the model is trained, not just whether checkpoints are dumped.
If distill() does any distillation-specific setup beyond pretrain + forward_step_modelopt (e.g. loss-balancer wiring, KD config injection, provider hooks), that setup would be skipped on the export path. The assert isinstance(config.model, DistillationProvider) guard suggests you've considered this, but it would be worth (1) confirming distill() is genuinely just pretrain(config, forward_step_modelopt, ...) under the hood, and (2) leaving a one-line comment near the fork noting that equivalence, so a future change to distill() doesn't quietly diverge the two paths. The existing TODO about distill(..., callbacks=...) partially covers this, but the equivalence risk is the part worth calling out explicitly.
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🧹 Nitpick comments (1)
examples/megatron_bridge/distill.py (1)
209-214: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winReload of
AutoConfigon every validation export; duplicated with the final-export path.
AutoConfig.from_pretrained(self.student_hf_path, ...)is re-executed on everyon_eval_endcall even thoughstudent_hf_path/trust_remote_codenever change across the run — this repeats file/Hub I/O unnecessarily. The same load-then-save_pretrainedpattern (with an identical comment) is also duplicated inmain()at Lines 557-560.Consider loading the config once in
__init__and reusing it inon_eval_end, and/or extracting a small shared helper used by both this callback and the final--hf_export_pathblock to avoid drift between the two copies.♻️ Proposed fix to cache the config once
def __init__( self, export_dir: str, student_hf_model: str, student_hf_path: str, trust_remote_code: bool, ) -> None: self.export_dir = Path(export_dir) - self.student_hf_path = student_hf_path - self.trust_remote_code = trust_remote_code self._last_exported_iteration: int | None = None self.bridge = AutoBridge.from_hf_pretrained( student_hf_model, trust_remote_code=trust_remote_code ) + self._student_config = AutoConfig.from_pretrained( + student_hf_path, trust_remote_code=trust_remote_code + )if dist.rank() == 0: - # Preserve the student architecture from student_hf_path, including heterogeneous - # layer changes; AutoConfig supports both local paths and Hugging Face model IDs. - AutoConfig.from_pretrained( - self.student_hf_path, trust_remote_code=self.trust_remote_code - ).save_pretrained(output_path) + self._student_config.save_pretrained(output_path)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/megatron_bridge/distill.py` around lines 209 - 214, `DistillCallback.on_eval_end` is reloading the student config on every validation export even though `student_hf_path` and `trust_remote_code` are ثابت, and the same `AutoConfig.from_pretrained(...).save_pretrained(...)` logic is duplicated in `main()`. Cache the loaded config once in `DistillCallback.__init__` (or a shared helper) and reuse it in `on_eval_end`, then have the final `--hf_export_path` export path call the same helper to keep the behavior in sync and avoid repeated I/O.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/megatron_bridge/distill.py`:
- Around line 167-218: The _HFValidationExportCallback currently saves a new
Hugging Face export on every validation with no cleanup, which can cause
unbounded disk growth. Add a retention setting such as keep_last_n to the
callback, and in on_eval_end prune older iter_* export directories after a
successful save, keeping only the most recent exports. Use the existing export
flow in _HFValidationExportCallback and mirror the retention approach already
used for main checkpointing (most_recent_k) so the logic is easy to locate and
consistent.
---
Nitpick comments:
In `@examples/megatron_bridge/distill.py`:
- Around line 209-214: `DistillCallback.on_eval_end` is reloading the student
config on every validation export even though `student_hf_path` and
`trust_remote_code` are ثابت, and the same
`AutoConfig.from_pretrained(...).save_pretrained(...)` logic is duplicated in
`main()`. Cache the loaded config once in `DistillCallback.__init__` (or a
shared helper) and reuse it in `on_eval_end`, then have the final
`--hf_export_path` export path call the same helper to keep the behavior in sync
and avoid repeated I/O.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c41374ad-ecce-4dc2-b44d-66aedfb9cdc0
📒 Files selected for processing (3)
examples/megatron_bridge/README.mdexamples/megatron_bridge/distill.pytests/examples/megatron_bridge/test_distill.py
| class _HFValidationExportCallback(Callback): | ||
| """Export the live student to Hugging Face format after each validation stage.""" | ||
|
|
||
| def __init__( | ||
| self, | ||
| export_dir: str, | ||
| student_hf_model: str, | ||
| student_hf_path: str, | ||
| trust_remote_code: bool, | ||
| ) -> None: | ||
| self.export_dir = Path(export_dir) | ||
| self.student_hf_path = student_hf_path | ||
| self.trust_remote_code = trust_remote_code | ||
| self._last_exported_iteration: int | None = None | ||
| self.bridge = AutoBridge.from_hf_pretrained( | ||
| student_hf_model, trust_remote_code=trust_remote_code | ||
| ) | ||
|
|
||
| def on_eval_end(self, context) -> None: | ||
| """Export the student at the iteration that was just validated.""" | ||
| iteration = context.state.train_state.step | ||
| # The final iteration can be validated both on its regular interval and after training. | ||
| # Avoid exporting and overwriting the same Hugging Face checkpoint twice. | ||
| if iteration == self._last_exported_iteration: | ||
| return | ||
| output_path = self.export_dir / f"iter_{iteration:07d}" | ||
| print_rank_0(f"Exporting validation checkpoint {iteration} to {output_path}") | ||
|
|
||
| # DistillationModel is the student with teacher and KD-loss modules attached. Hide the | ||
| # auxiliary modules temporarily so the Hugging Face export contains only student weights. | ||
| with contextlib.ExitStack() as stack: | ||
| for model_chunk in unwrap_model(context.model): | ||
| if isinstance(model_chunk, mtd.DistillationModel): | ||
| stack.enter_context(model_chunk.hide_teacher_model()) | ||
| stack.enter_context(model_chunk.hide_loss_modules()) | ||
| self.bridge.save_hf_pretrained( | ||
| context.model, | ||
| output_path, | ||
| show_progress=True, | ||
| strict=True, | ||
| ) | ||
|
|
||
| if dist.rank() == 0: | ||
| # Preserve the student architecture from student_hf_path, including heterogeneous | ||
| # layer changes; AutoConfig supports both local paths and Hugging Face model IDs. | ||
| AutoConfig.from_pretrained( | ||
| self.student_hf_path, trust_remote_code=self.trust_remote_code | ||
| ).save_pretrained(output_path) | ||
| torch.distributed.barrier() | ||
| self._last_exported_iteration = iteration | ||
|
|
||
|
|
There was a problem hiding this comment.
🚀 Performance & Scalability | 🟠 Major | ⚡ Quick win
No retention policy for per-validation HF exports — unbounded disk growth risk.
Every validation stage writes a full HuggingFace checkpoint to iter_<n>/ with no cap on how many are kept. For a realistic run (e.g. the README's --train_iters 15000 --eval_interval 100 example), that's ~150 full model copies with no cleanup — unlike the main Megatron checkpoint config in this same file, which bounds retention via most_recent_k=5 (Line 505). For multi-billion-parameter students this can exhaust disk and fail the job mid-training.
Consider adding a retention parameter (e.g. keep_last_n) to _HFValidationExportCallback that prunes older iter_* export directories, mirroring the most_recent_k pattern already used for the main checkpoint.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/megatron_bridge/distill.py` around lines 167 - 218, The
_HFValidationExportCallback currently saves a new Hugging Face export on every
validation with no cleanup, which can cause unbounded disk growth. Add a
retention setting such as keep_last_n to the callback, and in on_eval_end prune
older iter_* export directories after a successful save, keeping only the most
recent exports. Use the existing export flow in _HFValidationExportCallback and
mirror the retention approach already used for main checkpointing
(most_recent_k) so the logic is easy to locate and consistent.
There was a problem hiding this comment.
Claude review passed — no blocking issues found. LGTM
Scope: Examples-only PR (3 files: examples/megatron_bridge/distill.py, its README, and tests/examples/megatron_bridge/test_distill.py). megatron.bridge is an external dependency not installed in this environment, so distill()/pretrain() internals could not be introspected; review is based on the diff plus ModelOpt's DistillationModel API.
Findings: CRITICAL: 0, IMPORTANT: 0, SUGGESTION: 1
What I verified
- The export path correctly strips auxiliary modules before saving:
hide_teacher_model()andhide_loss_modules()exist inmodelopt/torch/distill/distillation_model.pyand are the same mechanism used by the minimal state-dict export path. - Distributed ordering is sound: all ranks run the collective
save_hf_pretrained, only rank 0 writesconfig.json, followed by abarrier(). - The
_last_exported_iterationguard sensibly prevents double-export/overwrite when the final iteration is validated both on its interval and post-training. - Argument validation was correctly widened to require
--student_hf_modelwhen either HF export flag is set.
Suggestion (non-blocking)
- Enabling
--hf_validation_export_pathswitches training fromdistill(config)topretrain(config, forward_step_modelopt, callbacks=[...]). Worth confirming these two paths are behaviorally identical and leaving a comment noting the equivalence, so a future change todistill()doesn't silently diverge the export path from the normal one.
Overall risk: low — additive, opt-in feature gated behind a new flag, backward compatible, with test coverage added.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1897 +/- ##
===========================================
- Coverage 75.20% 61.17% -14.03%
===========================================
Files 515 515
Lines 57245 57274 +29
===========================================
- Hits 43050 35038 -8012
- Misses 14195 22236 +8041
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)
308-368: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winAdd
max_tokensto file-backed dataset prefixes
process_json_file()returns the same.bin/.idxprefix regardless of token budget, so reruns with a differenttarget_tokenscan silently reuse stale files for"files"sources. This also leaves the file-backed path out of sync withprocess_hf_split()and the blend test’s_tokensexpectation. Consider appending the sametoken_tagthere.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py` around lines 308 - 368, `process_json_file()` currently builds output prefixes only from the input stem, so file-backed runs with different token budgets can reuse the same `.bin/.idx` artifacts. Update `process_json_file` in `megatron_preprocess_data.py` to append the same `token_tag` used by `process_hf_split()` when constructing `output_prefix`/`prefixes`, so the `process_json_file` and `process_hf_split` paths produce consistent, budget-specific dataset names. Ensure the new naming is derived from the existing `max_tokens`/token budget logic and is applied before checking for existing builders or returning skipped prefixes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/dataset/prepare_data_blend.py`:
- Around line 88-141: The weight-to-token allocation in prepare_data_blend can
produce zero or negative max_tokens for the last source when YAML weights are
misconfigured, leading to an empty prefixes list and a later ZeroDivisionError.
Add upfront validation in prepare_data_blend/load_config that source weights sum
to about 100 (or otherwise ensure source_tokens stays positive), and also guard
the prefix_weight calculation after megatron_preprocess_data so an empty
prefixes result raises a clear configuration error instead of dividing by zero.
---
Outside diff comments:
In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Around line 308-368: `process_json_file()` currently builds output prefixes
only from the input stem, so file-backed runs with different token budgets can
reuse the same `.bin/.idx` artifacts. Update `process_json_file` in
`megatron_preprocess_data.py` to append the same `token_tag` used by
`process_hf_split()` when constructing `output_prefix`/`prefixes`, so the
`process_json_file` and `process_hf_split` paths produce consistent,
budget-specific dataset names. Ensure the new naming is derived from the
existing `max_tokens`/token budget logic and is applied before checking for
existing builders or returning skipped prefixes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9f8a8b4f-8080-4109-952c-6c7d31f8c071
📒 Files selected for processing (8)
examples/dataset/MEGATRON_DATA_PREP.mdexamples/dataset/prepare_data_blend.pyexamples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.mdexamples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.mdexamples/researcher_guide/README.mdmodelopt/torch/utils/plugins/megatron_preprocess_data.pytests/examples/dataset/test_prepare_data_blend.pytests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py
✅ Files skipped from review due to trivial changes (3)
- examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md
- examples/researcher_guide/README.md
- examples/dataset/MEGATRON_DATA_PREP.md
| for index, source in enumerate(sources): | ||
| weight = float(source["weight"]) | ||
| if total_tokens is None: | ||
| source_tokens = None | ||
| elif index == len(sources) - 1: | ||
| source_tokens = total_tokens - allocated_tokens | ||
| else: | ||
| source_tokens = round(total_tokens * weight / 100) | ||
| allocated_tokens += source_tokens | ||
|
|
||
| dataset = source["hf_dataset"] | ||
| source_dir = output_dir / f"{index:02d}_{dataset.replace('/', '--')}" | ||
| content_field = source["content_field"] | ||
| input_args: dict[str, Any] | ||
| if "files" in source: | ||
| raw_dir = output_dir.parent / "raw" / dataset.replace("/", "--") | ||
| paths = [ | ||
| hf_hub_download( | ||
| repo_id=dataset, | ||
| filename=file, | ||
| repo_type="dataset", | ||
| local_dir=raw_dir, | ||
| ) | ||
| for file in source["files"] | ||
| ] | ||
| input_args = {"jsonl_paths": paths} | ||
| else: | ||
| input_args = { | ||
| "hf_dataset": dataset, | ||
| "hf_name": source.get("config"), | ||
| "hf_split": source["split"], | ||
| "hf_max_samples_per_split": source.get("max_samples"), | ||
| "hf_streaming": True, | ||
| } | ||
|
|
||
| # Each prefix is the path shared by a tokenized Megatron .bin/.idx file pair. | ||
| prefixes = megatron_preprocess_data( | ||
| **input_args, | ||
| output_dir=source_dir, | ||
| tokenizer_name_or_path=tokenizer, | ||
| json_keys=content_field, | ||
| # Plain text lacks chat-template boundary tokens, so terminate each document with EOS. | ||
| append_eod=content_field == "text", | ||
| # Join lines in text documents by replacing each newline with a space. | ||
| strip_newlines=content_field == "text", | ||
| reasoning_content="inline" if content_field == "messages" else "strip", | ||
| # Guard against pathological records by capping each tokenized document at 256K tokens. | ||
| max_sequence_length=256_000, | ||
| max_tokens=source_tokens, | ||
| workers=workers, | ||
| ) | ||
| prefix_weight = weight / len(prefixes) | ||
| blend.extend((prefix_weight, prefix) for prefix in prefixes) | ||
|
|
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Unvalidated weight allocation can crash with ZeroDivisionError on misconfigured YAML.
source_tokens for non-last sources is round(total_tokens * weight / 100) (Line 95), with the last source getting total_tokens - allocated_tokens (Line 93). If the configured weights don't sum to ~100 (or accumulate enough rounding error), the last source's source_tokens can end up <= 0. Inside megatron_preprocess_data, a non-positive max_tokens on the very first split/file causes an immediate break with zero prefixes returned (Lines 600-601, 628-629 of megatron_preprocess_data.py). Back here, prefix_weight = weight / len(prefixes) (Line 139) then raises ZeroDivisionError.
Since load_config/prepare_data_blend is the interface boundary for this user-authored YAML, consider validating that sources weights sum to 100 (within tolerance) up front, and/or guarding against an empty prefixes result with a clear error message instead of a raw ZeroDivisionError.
🛡️ Proposed validation
def prepare_data_blend(config_path: Path) -> list[tuple[float, str]]:
"""Download and tokenize the configured weighted data sources."""
config = load_config(config_path)
output_dir = Path(config["output_dir"])
output_dir.mkdir(parents=True, exist_ok=True)
target_tokens = config.get("target_tokens")
total_tokens = None if target_tokens is None else int(target_tokens)
tokenizer = str(config["tokenizer"])
+
+ total_weight = sum(float(source["weight"]) for source in config["sources"])
+ if not math.isclose(total_weight, 100, abs_tol=0.5):
+ raise ValueError(f"Source weights must sum to 100, got {total_weight}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for index, source in enumerate(sources): | |
| weight = float(source["weight"]) | |
| if total_tokens is None: | |
| source_tokens = None | |
| elif index == len(sources) - 1: | |
| source_tokens = total_tokens - allocated_tokens | |
| else: | |
| source_tokens = round(total_tokens * weight / 100) | |
| allocated_tokens += source_tokens | |
| dataset = source["hf_dataset"] | |
| source_dir = output_dir / f"{index:02d}_{dataset.replace('/', '--')}" | |
| content_field = source["content_field"] | |
| input_args: dict[str, Any] | |
| if "files" in source: | |
| raw_dir = output_dir.parent / "raw" / dataset.replace("/", "--") | |
| paths = [ | |
| hf_hub_download( | |
| repo_id=dataset, | |
| filename=file, | |
| repo_type="dataset", | |
| local_dir=raw_dir, | |
| ) | |
| for file in source["files"] | |
| ] | |
| input_args = {"jsonl_paths": paths} | |
| else: | |
| input_args = { | |
| "hf_dataset": dataset, | |
| "hf_name": source.get("config"), | |
| "hf_split": source["split"], | |
| "hf_max_samples_per_split": source.get("max_samples"), | |
| "hf_streaming": True, | |
| } | |
| # Each prefix is the path shared by a tokenized Megatron .bin/.idx file pair. | |
| prefixes = megatron_preprocess_data( | |
| **input_args, | |
| output_dir=source_dir, | |
| tokenizer_name_or_path=tokenizer, | |
| json_keys=content_field, | |
| # Plain text lacks chat-template boundary tokens, so terminate each document with EOS. | |
| append_eod=content_field == "text", | |
| # Join lines in text documents by replacing each newline with a space. | |
| strip_newlines=content_field == "text", | |
| reasoning_content="inline" if content_field == "messages" else "strip", | |
| # Guard against pathological records by capping each tokenized document at 256K tokens. | |
| max_sequence_length=256_000, | |
| max_tokens=source_tokens, | |
| workers=workers, | |
| ) | |
| prefix_weight = weight / len(prefixes) | |
| blend.extend((prefix_weight, prefix) for prefix in prefixes) | |
| tokenizer = str(config["tokenizer"]) | |
| total_weight = sum(float(source["weight"]) for source in config["sources"]) | |
| if not math.isclose(total_weight, 100, abs_tol=0.5): | |
| raise ValueError(f"Source weights must sum to 100, got {total_weight}") |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/dataset/prepare_data_blend.py` around lines 88 - 141, The
weight-to-token allocation in prepare_data_blend can produce zero or negative
max_tokens for the last source when YAML weights are misconfigured, leading to
an empty prefixes list and a later ZeroDivisionError. Add upfront validation in
prepare_data_blend/load_config that source weights sum to about 100 (or
otherwise ensure source_tokens stays positive), and also guard the prefix_weight
calculation after megatron_preprocess_data so an empty prefixes result raises a
clear configuration error instead of dividing by zero.
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
What does this PR do?
Save hf checkpoint at every valitation iteration during distillation.
Usage
Testing
Before your PR is "Ready for review"
Summary by CodeRabbit
--validate_onlymode for distillation (evaluate the student at iteration 0 without training).--hf_validation_export_pathto export student HuggingFace artifacts after each validation stage.prepare_data_blend.pyto generate token-budgeted Megatron data blends from YAML.--max_tokensto stop Megatron preprocessing after a token budget.validate_only, validation exports, blend preparation, andmax_tokenstruncation.