Create a tool for data blend preparation to enable fast experimentation with distillation by danielkorzekwa · Pull Request #1888 · NVIDIA/Model-Optimizer

danielkorzekwa · 2026-07-02T17:15:32Z

What does this PR do?

Create a tool for data blend preparation to enable fast experimentation with distillation

Usage

examples/researcher_guide/README.md (## Prepare token-budgeted data blends)
examples/dataset/prepare_data_blend.py

Testing

tests/examples/dataset/test_prepare_data_blend.py
tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py

Before your PR is "Ready for review"

Is this change backward compatible?: ✅
Did you write any new necessary tests?: ✅
Did you update Changelog?: ✅

Additional Information

Summary by CodeRabbit

New Features
- Added token-budgeted preprocessing via a new max_tokens option, enabling early stopping once the target output tokens are reached and producing distinct capped-run outputs.
- Added an end-to-end workflow to build weighted Megatron data blends from YAML, supporting iterative smaller benchmark subsets.
Documentation
- Added a researcher workflow guide for fast experimentation, including iterative evaluation with token limits and token-budgeted distillation data blend setup.
- Updated Megatron/Nemotron tutorials with guidance on creating smaller target_tokens subsets.

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

coderabbitai · 2026-07-02T17:16:46Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f9fdd02a-9f06-4b5a-b68c-288019596c6e

📥 Commits

Reviewing files that changed from the base of the PR and between aeaba79 and aaccd76.

📒 Files selected for processing (3)

examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md
examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md
examples/researcher_guide/README.md

✅ Files skipped from review due to trivial changes (3)

examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md
examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md
examples/researcher_guide/README.md

📝 Walkthrough

Walkthrough

Adds max_tokens early stopping to megatron_preprocess_data, introduces prepare_data_blend.py for token-budgeted weighted blends, adds tests for both paths, and updates changelog and example documentation.

Changes

Token-budgeted data blend feature

Layer / File(s)	Summary
max_tokens early stopping in megatron_preprocess_data `modelopt/torch/utils/plugins/megatron_preprocess_data.py`	Adds `max_tokens` handling to the tokenization pipeline, including batch-based early stopping, HF output naming changes, global token capping, warning updates, and CLI wiring.
prepare_data_blend.py script `examples/dataset/prepare_data_blend.py`	New CLI script that loads a YAML blend config, computes per-source token budgets, tokenizes sources via `megatron_preprocess_data`, writes `data_blend.txt`, and copies the config.
prepare_data_blend.py tests `tests/examples/dataset/test_prepare_data_blend.py`	Adds pytest coverage for YAML loading, mocked dataset download, generated blend weights, token-tagged prefixes, and copied config output.
max_tokens tests for tokenization pipeline `tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py`	Adds GPU tests confirming JSONL and HF split outputs shrink when `max_tokens` is set.
Documentation updates `CHANGELOG.rst`, `examples/dataset/MEGATRON_DATA_PREP.md`, `examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md`, `examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md`, `examples/researcher_guide/README.md`	Updates the changelog and example READMEs to describe the token-budgeted blend workflow and researcher guide.

Estimated code review effort: 3 (Moderate) | ~30 minutes

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: a new data blend preparation tool for faster distillation experiments.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	Changed Python files use yaml.safe_load and AutoTokenizer.from_pretrained without trust_remote_code=True; no torch.load/numpy.load/eval/exec/nosec or dependency additions found.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dkorzekwa/prepare_data_blend

_{Comment @coderabbitai help to get the list of available commands.}

github-actions · 2026-07-02T17:19:08Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1888/
Built to branch `gh-pages` at 2026-07-02 17:46 UTC. Preview will be ready when the GitHub Pages deployment is complete.

kevalmorabia97 · 2026-07-02T17:20:43Z

/claude review

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/dataset/prepare_data_blend.py`:
- Around line 88-97: The blend token allocation in prepare_data_blend.py assumes
source weights sum to 100, which can produce zero/negative token counts and
leave prefixes empty, causing a ZeroDivisionError later in the prefix weighting
logic. Update the weight handling in the source iteration and the prefix
computation around the source_tokens and prefix_weight calculations to normalize
by the actual total of all weights instead of a hardcoded 100, and add a guard
in the prefix_weight path to skip or safely handle empty prefixes before
dividing by len(prefixes).

In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Around line 592-609: The Hugging Face split loop in the preprocessing flow is
only using the newly processed token count, so cached splits do not advance the
global token budget. Update the logic around final_enc_len and the
process_hf_split call to also account for the token count originally written by
a cached split, using the existing split-processing flow and _tokens... prefix
generation to keep remaining_tokens and cache keys stable across resumes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cb206243-e7be-40fa-9e13-4bf330ae67ae

📥 Commits

Reviewing files that changed from the base of the PR and between 9038b71 and aeaba79.

📒 Files selected for processing (8)

CHANGELOG.rst
examples/dataset/MEGATRON_DATA_PREP.md
examples/dataset/prepare_data_blend.py
examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md
examples/researcher_guide/README.md
modelopt/torch/utils/plugins/megatron_preprocess_data.py
tests/examples/dataset/test_prepare_data_blend.py
tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py

coderabbitai · 2026-07-02T17:22:46Z

+    for index, source in enumerate(sources):
+        weight = float(source["weight"])
+        if total_tokens is None:
+            source_tokens = None
+        elif index == len(sources) - 1:
+            source_tokens = total_tokens - allocated_tokens
+        else:
+            source_tokens = round(total_tokens * weight / 100)
+            allocated_tokens += source_tokens
+


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Weight math assumes weights sum to 100; can crash with a ZeroDivisionError.

source_tokens = round(total_tokens * weight / 100) hardcodes a divisor of 100, implicitly requiring sum(weight for source in sources) == 100. Nothing validates this. If weights don't sum to 100 (e.g. users pass arbitrary relative weights, as is customary for Megatron blend weights), the last source's total_tokens - allocated_tokens can go negative, or an earlier source with a small weight/small target_tokens can round down to 0.

Passing a max_tokens <= 0 into megatron_preprocess_data makes it break immediately and return an empty prefix list (per the upstream remaining_tokens <= 0: break logic). Back here, prefix_weight = weight / len(prefixes) at Line 139 then raises ZeroDivisionError, crashing the whole script.

🐛 Proposed fix: normalize by the actual weight sum and guard against empty prefixes

workers = min(32, os.cpu_count() or 1) blend: list[tuple[float, str]] = [] # (weight, shared .bin/.idx path without extension) allocated_tokens = 0 + total_weight = sum(float(source["weight"]) for source in sources) for index, source in enumerate(sources): weight = float(source["weight"]) if total_tokens is None: source_tokens = None elif index == len(sources) - 1: source_tokens = total_tokens - allocated_tokens else: - source_tokens = round(total_tokens * weight / 100) + source_tokens = round(total_tokens * weight / total_weight) allocated_tokens += source_tokens

- prefix_weight = weight / len(prefixes) - blend.extend((prefix_weight, prefix) for prefix in prefixes) + if not prefixes: + raise RuntimeError(f"No tokenized output produced for source {dataset!r}.") + prefix_weight = weight / len(prefixes) + blend.extend((prefix_weight, prefix) for prefix in prefixes)

Also applies to: 139-140

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/dataset/prepare_data_blend.py` around lines 88 - 97, The blend token allocation in prepare_data_blend.py assumes source weights sum to 100, which can produce zero/negative token counts and leave prefixes empty, causing a ZeroDivisionError later in the prefix weighting logic. Update the weight handling in the source iteration and the prefix computation around the source_tokens and prefix_weight calculations to normalize by the actual total of all weights instead of a hardcoded 100, and add a guard in the prefix_weight path to skip or safely handle empty prefixes before dividing by len(prefixes).

coderabbitai · 2026-07-02T17:22:46Z

+    # Tokens written across all input files or Hugging Face splits.
    final_enc_len = 0
    all_prefixes: list[str] = []
    overall_start = time.time()

    if hf_dataset is not None:
        for config, split in _enumerate_hf_splits(hf_dataset, hf_name, hf_split):
+            remaining_tokens = None if max_tokens is None else max_tokens - final_enc_len
+            if remaining_tokens is not None and remaining_tokens <= 0:
+                break
            enc_len, prefixes = partition.process_hf_split(
                output_dir,
                encoder,
                hf_dataset,
                config,
                split,
                hf_max_samples_per_split,
+                remaining_tokens,


🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Inspect the cached-split early return and how final_enc_len feeds token_tag naming. rg -nP -C3 'return 0, prefixes|token_tag|remaining_tokens' modelopt/torch/utils/plugins/megatron_preprocess_data.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2609

🏁 Script executed:

#!/bin/bash sed -n '384,459p' modelopt/torch/utils/plugins/megatron_preprocess_data.py printf '\n----\n' sed -n '592,609p' modelopt/torch/utils/plugins/megatron_preprocess_data.py

Repository: NVIDIA/Model-Optimizer

Length of output: 4617

🏁 Script executed:

#!/bin/bash sed -n '459,540p' modelopt/torch/utils/plugins/megatron_preprocess_data.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3410

Keep cached HF splits aligned with the global token cap
When a split is skipped from cache, it still needs to contribute the number of tokens it originally wrote; otherwise later splits compute a larger remaining_tokens and get a different _tokens... prefix, forcing a re-tokenize instead of a cache hit. Returning the cached split’s written token count would keep resume behavior stable.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py` around lines 592 - 609, The Hugging Face split loop in the preprocessing flow is only using the newly processed token count, so cached splits do not advance the global token budget. Update the logic around final_enc_len and the process_hf_split call to also account for the token count originally written by a cached split, using the existing split-processing flow and _tokens... prefix generation to keep remaining_tokens and cache keys stable across resumes.

codecov · 2026-07-02T17:25:21Z

Codecov Report

❌ Patch coverage is 89.74359% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.52%. Comparing base (1b03381) to head (aaccd76).
⚠️ Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
...pt/torch/utils/plugins/megatron_preprocess_data.py	89.74%	4 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1888       +/-   ##
===========================================
+ Coverage   61.17%   76.52%   +15.35%     
===========================================
  Files         515      515               
  Lines       57207    58634     +1427     
===========================================
+ Hits        34994    44870     +9876     
+ Misses      22213    13764     -8449

Flag	Coverage Δ
examples	`42.89% <10.25%> (+10.06%)`	⬆️
gpu	`57.85% <89.74%> (+37.31%)`	⬆️
regression	`14.83% <2.56%> (+0.06%)`	⬆️
unit	`54.87% <2.56%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

claude · 2026-07-02T17:29:24Z

+        input_file_name: str | Path,
+        output_dir: str | Path,
+        encoder: _Encoder,
+        max_tokens: int | None = None,


[IMPORTANT Compatibility] Unlike process_hf_split (which now appends token_tag = f"_tokens{max_tokens}" to its output prefix at lines 419/427), process_json_file does not encode max_tokens into its output filename — the prefix stays {stem}_{key} (line 318), independent of the token cap.

Why it matters: the skip-if-exists logic at line 336 assumes a given output prefix always maps to the same content. That breaks for the jsonl path:

Run with max_tokens set into OUTPUT_DIR → writes a truncated .bin.

Re-run the same file into OUTPUT_DIR without max_tokens (full run) → files exist → [SKIP], and the caller silently reuses the stale truncated file and gets 0 back for enc_len.

This is the exact mixed workflow the docs recommend (scale down with target_tokens, then re-run in full). In a blend mixing files: (jsonl) and split: (HF) sources, HF sources regenerate correctly (their filename changes with the token tag) while jsonl sources stay stale — producing an inconsistent blend with no error.

Fix: mirror the HF path — include a token tag in the jsonl output prefix when max_tokens is not None so a capped run and a full run don't collide.

claude · 2026-07-02T17:29:41Z

+                "hf_streaming": True,
+            }
+
+        # Each prefix is the path shared by a tokenized Megatron .bin/.idx file pair.


[SUGGESTION] The per-source token budget hardcodes 100 as the denominator (round(total_tokens * weight / 100)), so it silently assumes the configured weight values sum to exactly 100. The blend paths themselves (_write_data_blend) emit the raw weights, which Megatron normalizes — so nothing else in the pipeline requires a sum of 100.

If a user's weights sum to something other than 100 (e.g. 10/20/30), the token allocation is skewed: each non-final source is under/over-allocated, and the final source absorbs the entire remainder (total_tokens - allocated_tokens), which can hand it a wildly disproportionate — or even negative — budget. Total still hits target_tokens, so the failure is silent.

Consider either normalizing by the actual sum (weight / sum(all weights)) or validating up front that the weights sum to 100, so a misconfigured blend fails loudly rather than producing a skewed subset.

claude

Claude review summary

Reviewed all 8 changed files (they fit comfortably): the core megatron_preprocess_data.py change, the new prepare_data_blend.py example, its unit test + gpu test, and the docs. Focused on the max_tokens early-stopping logic and the new blend tool's token-budget arithmetic.

Findings by severity: CRITICAL: 0, IMPORTANT: 1, SUGGESTION: 1

Most impactful

[IMPORTANT Compatibility] Stale-cache collision on the jsonl path. process_hf_split encodes max_tokens into its output filename via token_tag, but process_json_file does not — its prefix stays {stem}_{key}. Because of the skip-if-exists guard, a capped run followed by a full run (or vice versa) into the same output dir silently [SKIP]s and reuses the stale truncated .bin, returning 0 tokens. This is precisely the scale-down-then-scale-up workflow the new docs recommend, and in a mixed jsonl+HF blend the HF sources regenerate while the jsonl sources go stale — an inconsistent blend with no error. Fix: mirror the HF token-tag in the jsonl output prefix.

[SUGGESTION] Token allocation assumes weights sum to 100. prepare_data_blend.py divides by a hardcoded 100; if configured weights sum to anything else, the per-source budget is skewed and the final source absorbs the (possibly negative) remainder — silently. Normalize by the actual sum or validate up front.

Assessment

Low-to-moderate risk. This is an examples/utility change with no mode/state, export, or public-API surface, and the early-stop may_stop_early batching logic (waiting for finite pool.map batches so no worker result is left pending when the caller stops at the token target) is sound and well-commented. The one blocking item is the jsonl filename collision, which produces silent stale output in the documented workflow. The suggestion is non-blocking.

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

danielkorzekwa added 5 commits July 1, 2026 12:37

Implement prepare_data_blend utility.

510a414

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

remodve not needed __all__ variable

d22f71c

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

implement prepare_data_blend.py

52e6035

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

created a tool for efficient data blend preparation

8d40667

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

add change log and improve a unit test

aeaba79

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

danielkorzekwa requested review from a team as code owners July 2, 2026 17:15

danielkorzekwa requested a review from kevalmorabia97 July 2, 2026 17:15

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

claude Bot reviewed Jul 2, 2026

View reviewed changes

improve docs

aaccd76

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create a tool for data blend preparation to enable fast experimentation with distillation#1888

Create a tool for data blend preparation to enable fast experimentation with distillation#1888
danielkorzekwa wants to merge 6 commits into
mainfrom
dkorzekwa/prepare_data_blend

danielkorzekwa commented Jul 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

Walkthrough

Changes

Uh oh!

github-actions Bot commented Jul 2, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-07-02 17:46 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

kevalmorabia97 commented Jul 2, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Uh oh!

coderabbitai Bot Jul 2, 2026

Uh oh!

codecov Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

claude Bot Jul 2, 2026

Uh oh!

claude Bot Jul 2, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

danielkorzekwa commented Jul 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-07-02 17:46 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

kevalmorabia97 commented Jul 2, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude review summary

Most impactful

Assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielkorzekwa commented Jul 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

github-actions Bot commented Jul 2, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-07-02 17:46 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented Jul 2, 2026 •

edited

Loading