Create a tool for data blend preparation to enable fast experimentation with distillation#1888
Create a tool for data blend preparation to enable fast experimentation with distillation#1888danielkorzekwa wants to merge 6 commits into
Conversation
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (3)
📝 WalkthroughWalkthroughAdds ChangesToken-budgeted data blend feature
Estimated code review effort: 3 (Moderate) | ~30 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
|
/claude review |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/dataset/prepare_data_blend.py`:
- Around line 88-97: The blend token allocation in prepare_data_blend.py assumes
source weights sum to 100, which can produce zero/negative token counts and
leave prefixes empty, causing a ZeroDivisionError later in the prefix weighting
logic. Update the weight handling in the source iteration and the prefix
computation around the source_tokens and prefix_weight calculations to normalize
by the actual total of all weights instead of a hardcoded 100, and add a guard
in the prefix_weight path to skip or safely handle empty prefixes before
dividing by len(prefixes).
In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Around line 592-609: The Hugging Face split loop in the preprocessing flow is
only using the newly processed token count, so cached splits do not advance the
global token budget. Update the logic around final_enc_len and the
process_hf_split call to also account for the token count originally written by
a cached split, using the existing split-processing flow and _tokens... prefix
generation to keep remaining_tokens and cache keys stable across resumes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cb206243-e7be-40fa-9e13-4bf330ae67ae
📒 Files selected for processing (8)
CHANGELOG.rstexamples/dataset/MEGATRON_DATA_PREP.mdexamples/dataset/prepare_data_blend.pyexamples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.mdexamples/researcher_guide/README.mdmodelopt/torch/utils/plugins/megatron_preprocess_data.pytests/examples/dataset/test_prepare_data_blend.pytests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py
| for index, source in enumerate(sources): | ||
| weight = float(source["weight"]) | ||
| if total_tokens is None: | ||
| source_tokens = None | ||
| elif index == len(sources) - 1: | ||
| source_tokens = total_tokens - allocated_tokens | ||
| else: | ||
| source_tokens = round(total_tokens * weight / 100) | ||
| allocated_tokens += source_tokens | ||
|
|
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Weight math assumes weights sum to 100; can crash with a ZeroDivisionError.
source_tokens = round(total_tokens * weight / 100) hardcodes a divisor of 100, implicitly requiring sum(weight for source in sources) == 100. Nothing validates this. If weights don't sum to 100 (e.g. users pass arbitrary relative weights, as is customary for Megatron blend weights), the last source's total_tokens - allocated_tokens can go negative, or an earlier source with a small weight/small target_tokens can round down to 0.
Passing a max_tokens <= 0 into megatron_preprocess_data makes it break immediately and return an empty prefix list (per the upstream remaining_tokens <= 0: break logic). Back here, prefix_weight = weight / len(prefixes) at Line 139 then raises ZeroDivisionError, crashing the whole script.
🐛 Proposed fix: normalize by the actual weight sum and guard against empty prefixes
workers = min(32, os.cpu_count() or 1)
blend: list[tuple[float, str]] = [] # (weight, shared .bin/.idx path without extension)
allocated_tokens = 0
+ total_weight = sum(float(source["weight"]) for source in sources)
for index, source in enumerate(sources):
weight = float(source["weight"])
if total_tokens is None:
source_tokens = None
elif index == len(sources) - 1:
source_tokens = total_tokens - allocated_tokens
else:
- source_tokens = round(total_tokens * weight / 100)
+ source_tokens = round(total_tokens * weight / total_weight)
allocated_tokens += source_tokens- prefix_weight = weight / len(prefixes)
- blend.extend((prefix_weight, prefix) for prefix in prefixes)
+ if not prefixes:
+ raise RuntimeError(f"No tokenized output produced for source {dataset!r}.")
+ prefix_weight = weight / len(prefixes)
+ blend.extend((prefix_weight, prefix) for prefix in prefixes)Also applies to: 139-140
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/dataset/prepare_data_blend.py` around lines 88 - 97, The blend token
allocation in prepare_data_blend.py assumes source weights sum to 100, which can
produce zero/negative token counts and leave prefixes empty, causing a
ZeroDivisionError later in the prefix weighting logic. Update the weight
handling in the source iteration and the prefix computation around the
source_tokens and prefix_weight calculations to normalize by the actual total of
all weights instead of a hardcoded 100, and add a guard in the prefix_weight
path to skip or safely handle empty prefixes before dividing by len(prefixes).
| # Tokens written across all input files or Hugging Face splits. | ||
| final_enc_len = 0 | ||
| all_prefixes: list[str] = [] | ||
| overall_start = time.time() | ||
|
|
||
| if hf_dataset is not None: | ||
| for config, split in _enumerate_hf_splits(hf_dataset, hf_name, hf_split): | ||
| remaining_tokens = None if max_tokens is None else max_tokens - final_enc_len | ||
| if remaining_tokens is not None and remaining_tokens <= 0: | ||
| break | ||
| enc_len, prefixes = partition.process_hf_split( | ||
| output_dir, | ||
| encoder, | ||
| hf_dataset, | ||
| config, | ||
| split, | ||
| hf_max_samples_per_split, | ||
| remaining_tokens, |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect the cached-split early return and how final_enc_len feeds token_tag naming.
rg -nP -C3 'return 0, prefixes|token_tag|remaining_tokens' modelopt/torch/utils/plugins/megatron_preprocess_data.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2609
🏁 Script executed:
#!/bin/bash
sed -n '384,459p' modelopt/torch/utils/plugins/megatron_preprocess_data.py
printf '\n----\n'
sed -n '592,609p' modelopt/torch/utils/plugins/megatron_preprocess_data.pyRepository: NVIDIA/Model-Optimizer
Length of output: 4617
🏁 Script executed:
#!/bin/bash
sed -n '459,540p' modelopt/torch/utils/plugins/megatron_preprocess_data.pyRepository: NVIDIA/Model-Optimizer
Length of output: 3410
Keep cached HF splits aligned with the global token cap
When a split is skipped from cache, it still needs to contribute the number of tokens it originally wrote; otherwise later splits compute a larger remaining_tokens and get a different _tokens... prefix, forcing a re-tokenize instead of a cache hit. Returning the cached split’s written token count would keep resume behavior stable.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py` around lines 592 -
609, The Hugging Face split loop in the preprocessing flow is only using the
newly processed token count, so cached splits do not advance the global token
budget. Update the logic around final_enc_len and the process_hf_split call to
also account for the token count originally written by a cached split, using the
existing split-processing flow and _tokens... prefix generation to keep
remaining_tokens and cache keys stable across resumes.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1888 +/- ##
===========================================
+ Coverage 61.17% 76.52% +15.35%
===========================================
Files 515 515
Lines 57207 58634 +1427
===========================================
+ Hits 34994 44870 +9876
+ Misses 22213 13764 -8449
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| input_file_name: str | Path, | ||
| output_dir: str | Path, | ||
| encoder: _Encoder, | ||
| max_tokens: int | None = None, |
There was a problem hiding this comment.
[IMPORTANT Compatibility] Unlike process_hf_split (which now appends token_tag = f"_tokens{max_tokens}" to its output prefix at lines 419/427), process_json_file does not encode max_tokens into its output filename — the prefix stays {stem}_{key} (line 318), independent of the token cap.
Why it matters: the skip-if-exists logic at line 336 assumes a given output prefix always maps to the same content. That breaks for the jsonl path:
- Run with
max_tokensset intoOUTPUT_DIR→ writes a truncated.bin. - Re-run the same file into
OUTPUT_DIRwithoutmax_tokens(full run) → files exist →[SKIP], and the caller silently reuses the stale truncated file and gets0back forenc_len.
This is the exact mixed workflow the docs recommend (scale down with target_tokens, then re-run in full). In a blend mixing files: (jsonl) and split: (HF) sources, HF sources regenerate correctly (their filename changes with the token tag) while jsonl sources stay stale — producing an inconsistent blend with no error.
Fix: mirror the HF path — include a token tag in the jsonl output prefix when max_tokens is not None so a capped run and a full run don't collide.
| "hf_streaming": True, | ||
| } | ||
|
|
||
| # Each prefix is the path shared by a tokenized Megatron .bin/.idx file pair. |
There was a problem hiding this comment.
[SUGGESTION] The per-source token budget hardcodes 100 as the denominator (round(total_tokens * weight / 100)), so it silently assumes the configured weight values sum to exactly 100. The blend paths themselves (_write_data_blend) emit the raw weights, which Megatron normalizes — so nothing else in the pipeline requires a sum of 100.
If a user's weights sum to something other than 100 (e.g. 10/20/30), the token allocation is skewed: each non-final source is under/over-allocated, and the final source absorbs the entire remainder (total_tokens - allocated_tokens), which can hand it a wildly disproportionate — or even negative — budget. Total still hits target_tokens, so the failure is silent.
Consider either normalizing by the actual sum (weight / sum(all weights)) or validating up front that the weights sum to 100, so a misconfigured blend fails loudly rather than producing a skewed subset.
There was a problem hiding this comment.
Claude review summary
Reviewed all 8 changed files (they fit comfortably): the core megatron_preprocess_data.py change, the new prepare_data_blend.py example, its unit test + gpu test, and the docs. Focused on the max_tokens early-stopping logic and the new blend tool's token-budget arithmetic.
Findings by severity: CRITICAL: 0, IMPORTANT: 1, SUGGESTION: 1
Most impactful
[IMPORTANT Compatibility] Stale-cache collision on the jsonl path. process_hf_split encodes max_tokens into its output filename via token_tag, but process_json_file does not — its prefix stays {stem}_{key}. Because of the skip-if-exists guard, a capped run followed by a full run (or vice versa) into the same output dir silently [SKIP]s and reuses the stale truncated .bin, returning 0 tokens. This is precisely the scale-down-then-scale-up workflow the new docs recommend, and in a mixed jsonl+HF blend the HF sources regenerate while the jsonl sources go stale — an inconsistent blend with no error. Fix: mirror the HF token-tag in the jsonl output prefix.
[SUGGESTION] Token allocation assumes weights sum to 100. prepare_data_blend.py divides by a hardcoded 100; if configured weights sum to anything else, the per-source budget is skewed and the final source absorbs the (possibly negative) remainder — silently. Normalize by the actual sum or validate up front.
Assessment
Low-to-moderate risk. This is an examples/utility change with no mode/state, export, or public-API surface, and the early-stop may_stop_early batching logic (waiting for finite pool.map batches so no worker result is left pending when the caller stops at the token target) is sound and well-commented. The one blocking item is the jsonl filename collision, which produces silent stale output in the documented workflow. The suggestion is non-blocking.
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
What does this PR do?
Create a tool for data blend preparation to enable fast experimentation with distillation
Usage
Testing
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
max_tokensoption, enabling early stopping once the target output tokens are reached and producing distinct capped-run outputs.target_tokenssubsets.