Skip to content

Create a tool for data blend preparation to enable fast experimentation with distillation#1888

Open
danielkorzekwa wants to merge 6 commits into
mainfrom
dkorzekwa/prepare_data_blend
Open

Create a tool for data blend preparation to enable fast experimentation with distillation#1888
danielkorzekwa wants to merge 6 commits into
mainfrom
dkorzekwa/prepare_data_blend

Conversation

@danielkorzekwa

@danielkorzekwa danielkorzekwa commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Create a tool for data blend preparation to enable fast experimentation with distillation

Usage

  • examples/researcher_guide/README.md (## Prepare token-budgeted data blends)
  • examples/dataset/prepare_data_blend.py

Testing

  • tests/examples/dataset/test_prepare_data_blend.py
  • tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅
  • Did you write any new necessary tests?: ✅
  • Did you update Changelog?: ✅

Additional Information

Summary by CodeRabbit

  • New Features
    • Added token-budgeted preprocessing via a new max_tokens option, enabling early stopping once the target output tokens are reached and producing distinct capped-run outputs.
    • Added an end-to-end workflow to build weighted Megatron data blends from YAML, supporting iterative smaller benchmark subsets.
  • Documentation
    • Added a researcher workflow guide for fast experimentation, including iterative evaluation with token limits and token-budgeted distillation data blend setup.
    • Updated Megatron/Nemotron tutorials with guidance on creating smaller target_tokens subsets.

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
@danielkorzekwa danielkorzekwa requested review from a team as code owners July 2, 2026 17:15
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f9fdd02a-9f06-4b5a-b68c-288019596c6e

📥 Commits

Reviewing files that changed from the base of the PR and between aeaba79 and aaccd76.

📒 Files selected for processing (3)
  • examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md
  • examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md
  • examples/researcher_guide/README.md
✅ Files skipped from review due to trivial changes (3)
  • examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md
  • examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md
  • examples/researcher_guide/README.md

📝 Walkthrough

Walkthrough

Adds max_tokens early stopping to megatron_preprocess_data, introduces prepare_data_blend.py for token-budgeted weighted blends, adds tests for both paths, and updates changelog and example documentation.

Changes

Token-budgeted data blend feature

Layer / File(s) Summary
max_tokens early stopping in megatron_preprocess_data
modelopt/torch/utils/plugins/megatron_preprocess_data.py
Adds max_tokens handling to the tokenization pipeline, including batch-based early stopping, HF output naming changes, global token capping, warning updates, and CLI wiring.
prepare_data_blend.py script
examples/dataset/prepare_data_blend.py
New CLI script that loads a YAML blend config, computes per-source token budgets, tokenizes sources via megatron_preprocess_data, writes data_blend.txt, and copies the config.
prepare_data_blend.py tests
tests/examples/dataset/test_prepare_data_blend.py
Adds pytest coverage for YAML loading, mocked dataset download, generated blend weights, token-tagged prefixes, and copied config output.
max_tokens tests for tokenization pipeline
tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py
Adds GPU tests confirming JSONL and HF split outputs shrink when max_tokens is set.
Documentation updates
CHANGELOG.rst, examples/dataset/MEGATRON_DATA_PREP.md, examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md, examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md, examples/researcher_guide/README.md
Updates the changelog and example READMEs to describe the token-budgeted blend workflow and researcher guide.

Estimated code review effort: 3 (Moderate) | ~30 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: a new data blend preparation tool for faster distillation experiments.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed Changed Python files use yaml.safe_load and AutoTokenizer.from_pretrained without trust_remote_code=True; no torch.load/numpy.load/eval/exec/nosec or dependency additions found.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dkorzekwa/prepare_data_blend

Comment @coderabbitai help to get the list of available commands.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1888/

Built to branch gh-pages at 2026-07-02 17:46 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@kevalmorabia97

Copy link
Copy Markdown
Collaborator

/claude review

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/dataset/prepare_data_blend.py`:
- Around line 88-97: The blend token allocation in prepare_data_blend.py assumes
source weights sum to 100, which can produce zero/negative token counts and
leave prefixes empty, causing a ZeroDivisionError later in the prefix weighting
logic. Update the weight handling in the source iteration and the prefix
computation around the source_tokens and prefix_weight calculations to normalize
by the actual total of all weights instead of a hardcoded 100, and add a guard
in the prefix_weight path to skip or safely handle empty prefixes before
dividing by len(prefixes).

In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Around line 592-609: The Hugging Face split loop in the preprocessing flow is
only using the newly processed token count, so cached splits do not advance the
global token budget. Update the logic around final_enc_len and the
process_hf_split call to also account for the token count originally written by
a cached split, using the existing split-processing flow and _tokens... prefix
generation to keep remaining_tokens and cache keys stable across resumes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cb206243-e7be-40fa-9e13-4bf330ae67ae

📥 Commits

Reviewing files that changed from the base of the PR and between 9038b71 and aeaba79.

📒 Files selected for processing (8)
  • CHANGELOG.rst
  • examples/dataset/MEGATRON_DATA_PREP.md
  • examples/dataset/prepare_data_blend.py
  • examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md
  • examples/researcher_guide/README.md
  • modelopt/torch/utils/plugins/megatron_preprocess_data.py
  • tests/examples/dataset/test_prepare_data_blend.py
  • tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py

Comment on lines +88 to +97
for index, source in enumerate(sources):
weight = float(source["weight"])
if total_tokens is None:
source_tokens = None
elif index == len(sources) - 1:
source_tokens = total_tokens - allocated_tokens
else:
source_tokens = round(total_tokens * weight / 100)
allocated_tokens += source_tokens

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Weight math assumes weights sum to 100; can crash with a ZeroDivisionError.

source_tokens = round(total_tokens * weight / 100) hardcodes a divisor of 100, implicitly requiring sum(weight for source in sources) == 100. Nothing validates this. If weights don't sum to 100 (e.g. users pass arbitrary relative weights, as is customary for Megatron blend weights), the last source's total_tokens - allocated_tokens can go negative, or an earlier source with a small weight/small target_tokens can round down to 0.

Passing a max_tokens <= 0 into megatron_preprocess_data makes it break immediately and return an empty prefix list (per the upstream remaining_tokens <= 0: break logic). Back here, prefix_weight = weight / len(prefixes) at Line 139 then raises ZeroDivisionError, crashing the whole script.

🐛 Proposed fix: normalize by the actual weight sum and guard against empty prefixes
     workers = min(32, os.cpu_count() or 1)
     blend: list[tuple[float, str]] = []  # (weight, shared .bin/.idx path without extension)
     allocated_tokens = 0
+    total_weight = sum(float(source["weight"]) for source in sources)

     for index, source in enumerate(sources):
         weight = float(source["weight"])
         if total_tokens is None:
             source_tokens = None
         elif index == len(sources) - 1:
             source_tokens = total_tokens - allocated_tokens
         else:
-            source_tokens = round(total_tokens * weight / 100)
+            source_tokens = round(total_tokens * weight / total_weight)
             allocated_tokens += source_tokens
-        prefix_weight = weight / len(prefixes)
-        blend.extend((prefix_weight, prefix) for prefix in prefixes)
+        if not prefixes:
+            raise RuntimeError(f"No tokenized output produced for source {dataset!r}.")
+        prefix_weight = weight / len(prefixes)
+        blend.extend((prefix_weight, prefix) for prefix in prefixes)

Also applies to: 139-140

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/dataset/prepare_data_blend.py` around lines 88 - 97, The blend token
allocation in prepare_data_blend.py assumes source weights sum to 100, which can
produce zero/negative token counts and leave prefixes empty, causing a
ZeroDivisionError later in the prefix weighting logic. Update the weight
handling in the source iteration and the prefix computation around the
source_tokens and prefix_weight calculations to normalize by the actual total of
all weights instead of a hardcoded 100, and add a guard in the prefix_weight
path to skip or safely handle empty prefixes before dividing by len(prefixes).

Comment on lines +592 to +609
# Tokens written across all input files or Hugging Face splits.
final_enc_len = 0
all_prefixes: list[str] = []
overall_start = time.time()

if hf_dataset is not None:
for config, split in _enumerate_hf_splits(hf_dataset, hf_name, hf_split):
remaining_tokens = None if max_tokens is None else max_tokens - final_enc_len
if remaining_tokens is not None and remaining_tokens <= 0:
break
enc_len, prefixes = partition.process_hf_split(
output_dir,
encoder,
hf_dataset,
config,
split,
hf_max_samples_per_split,
remaining_tokens,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect the cached-split early return and how final_enc_len feeds token_tag naming.
rg -nP -C3 'return 0, prefixes|token_tag|remaining_tokens' modelopt/torch/utils/plugins/megatron_preprocess_data.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2609


🏁 Script executed:

#!/bin/bash
sed -n '384,459p' modelopt/torch/utils/plugins/megatron_preprocess_data.py
printf '\n----\n'
sed -n '592,609p' modelopt/torch/utils/plugins/megatron_preprocess_data.py

Repository: NVIDIA/Model-Optimizer

Length of output: 4617


🏁 Script executed:

#!/bin/bash
sed -n '459,540p' modelopt/torch/utils/plugins/megatron_preprocess_data.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3410


Keep cached HF splits aligned with the global token cap
When a split is skipped from cache, it still needs to contribute the number of tokens it originally wrote; otherwise later splits compute a larger remaining_tokens and get a different _tokens... prefix, forcing a re-tokenize instead of a cache hit. Returning the cached split’s written token count would keep resume behavior stable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py` around lines 592 -
609, The Hugging Face split loop in the preprocessing flow is only using the
newly processed token count, so cached splits do not advance the global token
budget. Update the logic around final_enc_len and the process_hf_split call to
also account for the token count originally written by a cached split, using the
existing split-processing flow and _tokens... prefix generation to keep
remaining_tokens and cache keys stable across resumes.

@codecov

codecov Bot commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 89.74359% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.52%. Comparing base (1b03381) to head (aaccd76).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...pt/torch/utils/plugins/megatron_preprocess_data.py 89.74% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1888       +/-   ##
===========================================
+ Coverage   61.17%   76.52%   +15.35%     
===========================================
  Files         515      515               
  Lines       57207    58634     +1427     
===========================================
+ Hits        34994    44870     +9876     
+ Misses      22213    13764     -8449     
Flag Coverage Δ
examples 42.89% <10.25%> (+10.06%) ⬆️
gpu 57.85% <89.74%> (+37.31%) ⬆️
regression 14.83% <2.56%> (+0.06%) ⬆️
unit 54.87% <2.56%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

input_file_name: str | Path,
output_dir: str | Path,
encoder: _Encoder,
max_tokens: int | None = None,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Compatibility] Unlike process_hf_split (which now appends token_tag = f"_tokens{max_tokens}" to its output prefix at lines 419/427), process_json_file does not encode max_tokens into its output filename — the prefix stays {stem}_{key} (line 318), independent of the token cap.

Why it matters: the skip-if-exists logic at line 336 assumes a given output prefix always maps to the same content. That breaks for the jsonl path:

  • Run with max_tokens set into OUTPUT_DIR → writes a truncated .bin.
  • Re-run the same file into OUTPUT_DIR without max_tokens (full run) → files exist → [SKIP], and the caller silently reuses the stale truncated file and gets 0 back for enc_len.

This is the exact mixed workflow the docs recommend (scale down with target_tokens, then re-run in full). In a blend mixing files: (jsonl) and split: (HF) sources, HF sources regenerate correctly (their filename changes with the token tag) while jsonl sources stay stale — producing an inconsistent blend with no error.

Fix: mirror the HF path — include a token tag in the jsonl output prefix when max_tokens is not None so a capped run and a full run don't collide.

"hf_streaming": True,
}

# Each prefix is the path shared by a tokenized Megatron .bin/.idx file pair.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] The per-source token budget hardcodes 100 as the denominator (round(total_tokens * weight / 100)), so it silently assumes the configured weight values sum to exactly 100. The blend paths themselves (_write_data_blend) emit the raw weights, which Megatron normalizes — so nothing else in the pipeline requires a sum of 100.

If a user's weights sum to something other than 100 (e.g. 10/20/30), the token allocation is skewed: each non-final source is under/over-allocated, and the final source absorbs the entire remainder (total_tokens - allocated_tokens), which can hand it a wildly disproportionate — or even negative — budget. Total still hits target_tokens, so the failure is silent.

Consider either normalizing by the actual sum (weight / sum(all weights)) or validating up front that the weights sum to 100, so a misconfigured blend fails loudly rather than producing a skewed subset.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review summary

Reviewed all 8 changed files (they fit comfortably): the core megatron_preprocess_data.py change, the new prepare_data_blend.py example, its unit test + gpu test, and the docs. Focused on the max_tokens early-stopping logic and the new blend tool's token-budget arithmetic.

Findings by severity: CRITICAL: 0, IMPORTANT: 1, SUGGESTION: 1

Most impactful

[IMPORTANT Compatibility] Stale-cache collision on the jsonl path. process_hf_split encodes max_tokens into its output filename via token_tag, but process_json_file does not — its prefix stays {stem}_{key}. Because of the skip-if-exists guard, a capped run followed by a full run (or vice versa) into the same output dir silently [SKIP]s and reuses the stale truncated .bin, returning 0 tokens. This is precisely the scale-down-then-scale-up workflow the new docs recommend, and in a mixed jsonl+HF blend the HF sources regenerate while the jsonl sources go stale — an inconsistent blend with no error. Fix: mirror the HF token-tag in the jsonl output prefix.

[SUGGESTION] Token allocation assumes weights sum to 100. prepare_data_blend.py divides by a hardcoded 100; if configured weights sum to anything else, the per-source budget is skewed and the final source absorbs the (possibly negative) remainder — silently. Normalize by the actual sum or validate up front.

Assessment

Low-to-moderate risk. This is an examples/utility change with no mode/state, export, or public-API surface, and the early-stop may_stop_early batching logic (waiting for finite pool.map batches so no worker result is left pending when the caller stops at the token target) is sound and well-commented. The one blocking item is the jsonl filename collision, which produces silent stale output in the documented workflow. The suggestion is non-blocking.

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants