Skip to content

feat(export): quant-aware reverse weight conversion for unified HF export#1833

Open
Edwardf0t1 wants to merge 12 commits into
mainfrom
feat/unified-export-quant-aware-revert
Open

feat(export): quant-aware reverse weight conversion for unified HF export#1833
Edwardf0t1 wants to merge 12 commits into
mainfrom
feat/unified-export-quant-aware-revert

Conversation

@Edwardf0t1

@Edwardf0t1 Edwardf0t1 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

What

ModelOpt's unified HF export builds the state dict from the in-memory (transformers post-conversion) module names and disables transformers' save-side revert_weight_conversion (it raises IndexError on ModelOpt's 0-d scalar scale tensors). So when transformers applies a load-time conversion_mapping (renamed MoE leaves, block_sparse_moemlp, reordered model/language_model prefix, fused dense gate_up_proj), the exported tensor names no longer match the original HF hub checkpoint, breaking the unified-checkpoint contract.

Observed concretely on MiniMax-M3: nvidia/MiniMax-M3-NVFP4-v1 emitted the converted names (model.language_model.*, mlp.experts.*.gate_proj) instead of hub names (language_model.model.*, block_sparse_moe.experts.*.w{1,2,3}).

How

New modelopt/torch/export/quant_aware_conversion.py performs a quantization-aware reverse conversion, derived from the model's conversion mapping via transformers' own reverse_transform() (so anchored regex renamings reverse correctly), carrying each weight's companion scale tensors (weight_scale, weight_scale_2, input_scale, weight_scale_inv, bias):

  • Rename — key-level substitution; scale siblings follow the module path automatically.
  • Split — un-fuse a dense output-dim concatenation (gate_up_projgate_proj+up_proj): split weight/weight_scale/bias on the fused dim, duplicate the 0-d scalars.

Key insight from GPU validation: ModelOpt's export already expands fused, stacked in-memory experts (experts.gate_up_proj [E,2F,H]) into per-expert 2-D linears before save, so the reverse for experts is a pure per-expert leaf rename (gate_projw1, up_projw3, down_projw2), not a 3-D un-stack. Converters are classified by their reversed ops: SplitModulelist present ⇒ expert rename; Chunk-only ⇒ dense split.

export_hf_checkpoint applies this before save_pretrained; anything not reversible raises QuantConversionUnsupportedError and the export falls back to prior behavior with a warning (non-breaking).

Tests

  • Unit (tests/unit/torch/export/test_quant_aware_conversion.py, CPU): rename carries scales; dense gate_up_proj un-fuse with scale split + scalar duplication; 3-D / non-divisible guards; end-to-end MiniMax-M3-like reversal.
  • GPU (tests/gpu/torch/export/test_quant_aware_conversion_gpu.py): quantize a tiny Mixtral to NVFP4 → export_hf_checkpoint → assert exported tensor names exactly equal the canonical hub names from transformers' own revert_weight_conversion on the reference model (experts land as block_sparse_moe.experts.N.w{1,2,3}; no fused in-memory names remain).

Validation

  • ✅ Unit tests pass (CPU). Existing test_unified_export_hf.py remains green (no regression).
  • GPU end-to-end passes: tiny Mixtral NVFP4 quantize→export yields 0 missing / 0 extra vs. canonical hub names (transformers 5.3.0, RTX 6000 Ada). Mixtral's conversion (block_sparse_moemlp, expert gate/up fuse) is the same machinery MoE VLMs like MiniMax-M3 use.
  • Note: transformers 5.3.0 has no minimax_m3_vl entry, so M3 itself isn't exercised in CI here; Mixtral is the structural surrogate. When a transformers build with the M3 mapping is available, the same test can be parametrized for M3.

Follow-ups

  • Extend the derivation if a model needs stacked-scalar-scale handling that ModelOpt does not pre-expand (none known today).
  • Upstream fix to transformers Chunk.convert() for 0-d tensors, then drop the _patch_revert_weight_conversion no-op.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Improved Hugging Face export so quantized models save with canonical hub tensor names instead of temporary in-memory names.
    • Fixed export handling for scalar scale tensors and other quantized companion tensors, helping checkpoints round-trip more reliably.
    • Added a fallback path so export continues even when a conversion pattern can’t be safely reversed.
    • Exported Mixtral-style expert weights now preserve the expected hub layout in saved checkpoints.

@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds quant-aware reverse weight conversion to unified HF export, restoring hub-canonical tensor names (with propagated scale tensors) instead of transformers' in-memory names, falling back to prior naming when reversal isn't supported. Includes changelog entry, integration into export flow, and unit/GPU tests.

Changes

Quant-aware reverse weight conversion for HF export

Layer / File(s) Summary
Reverse rule primitives
modelopt/torch/export/quant_aware_conversion.py (lines 1-181)
Defines QuantConversionUnsupportedError, RenameRule, SplitRule, tensor splitting/duplication logic, and apply_reverse_rules to split-then-rename a state dict.
Reverse rule construction
modelopt/torch/export/quant_aware_conversion.py (lines 184-364)
Adds revert_weight_conversion_quant_aware entrypoint, _assert_experts_pre_expanded guard, _build_reverse_rules deriving rules from Transformers conversion mappings, expert leaf rename generation, dense split rule construction, and normalization helpers.
HF export integration and changelog
modelopt/torch/export/unified_export_hf.py, CHANGELOG.rst
export_hf_checkpoint merges state dicts, applies quant-aware reversal with a warning fallback on failure, and saves using the reverted names; changelog documents the bug fix.
Reverse conversion unit tests
tests/unit/torch/export/test_quant_aware_conversion.py
CPU tests cover rename/split behavior, scale propagation, error cases (stacked experts, non-divisible splits, split collisions), expert guard behavior, and Transformers-derived rule building.
GPU export key validation
tests/gpu/torch/export/test_quant_aware_conversion_gpu.py
CUDA-only test quantizes and exports a Mixtral model, verifying exported safetensors keys exactly match canonical hub names and expert layout naming.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant export_hf_checkpoint
  participant revert_weight_conversion_quant_aware
  participant BuildReverseRules
  participant save_pretrained

  export_hf_checkpoint->>export_hf_checkpoint: merge post_state_dict + extra_state_dict
  export_hf_checkpoint->>revert_weight_conversion_quant_aware: reverse quantized weight names
  revert_weight_conversion_quant_aware->>BuildReverseRules: derive split/rename rules from model
  BuildReverseRules-->>revert_weight_conversion_quant_aware: rules or unsupported error
  alt rules supported
    revert_weight_conversion_quant_aware-->>export_hf_checkpoint: reverted state_dict with hub names
  else unsupported pattern
    revert_weight_conversion_quant_aware-->>export_hf_checkpoint: QuantConversionUnsupportedError
    export_hf_checkpoint->>export_hf_checkpoint: warn, keep in-memory names
  end
  export_hf_checkpoint->>save_pretrained: save checkpoint with export_state_dict
Loading
🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: adding quant-aware reverse weight conversion for unified Hugging Face export.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed Touched modelopt/export files and tests contain no banned torch.load/numpy.load/trust_remote_code/eval/exec/nosec patterns, and no dependency additions.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/unified-export-quant-aware-revert

Comment @coderabbitai help to get the list of available commands.

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1833/

Built to branch gh-pages at 2026-07-03 21:53 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.74699% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.47%. Comparing base (b0ee953) to head (85649f2).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/export/quant_aware_conversion.py 87.74% 19 Missing ⚠️
modelopt/torch/export/unified_export_hf.py 72.72% 3 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (b0ee953) and HEAD (85649f2). Click for more details.

HEAD has 3 uploads less than BASE
Flag BASE (b0ee953) HEAD (85649f2)
unit 2 1
examples 12 10
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1833      +/-   ##
==========================================
- Coverage   75.20%   68.47%   -6.73%     
==========================================
  Files         515      516       +1     
  Lines       57245    57409     +164     
==========================================
- Hits        43050    39313    -3737     
- Misses      14195    18096    +3901     
Flag Coverage Δ
examples 32.69% <69.87%> (-10.44%) ⬇️
gpu 31.99% <18.67%> (-18.03%) ⬇️
regression 14.84% <18.67%> (+0.08%) ⬆️
unit 55.00% <82.53%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Edwardf0t1 Edwardf0t1 marked this pull request as ready for review July 2, 2026 01:08
@Edwardf0t1 Edwardf0t1 requested review from a team as code owners July 2, 2026 01:08
@Edwardf0t1 Edwardf0t1 requested a review from shengliangxu July 2, 2026 01:08

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/export/unified_export_hf.py (1)

1505-1517: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Restore the Transformers patch even if sanitization fails.

_sanitize_generation_config_for_save(model) runs after the global patch but before the try/finally; if it raises, revert_weight_conversion stays patched for the rest of the process.

Proposed fix
         _patches = _patch_revert_weight_conversion()
 
-        _sanitize_generation_config_for_save(model)
-
         try:
+            _sanitize_generation_config_for_save(model)
             model.save_pretrained(
                 export_dir,
                 state_dict=export_state_dict,
                 save_modelopt_state=save_modelopt_state,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/export/unified_export_hf.py` around lines 1505 - 1517, The
global Transformers patch is left active if
_sanitize_generation_config_for_save(model) throws before the try/finally in the
export flow. Move the sanitization call into the same try block used around
model.save_pretrained in unified_export_hf.py, and keep
_unpatch_revert_weight_conversion(_patches) in finally so
revert_weight_conversion is always restored. Use the existing
_patch_revert_weight_conversion, _sanitize_generation_config_for_save, and
_unpatch_revert_weight_conversion symbols to place the fix.
🧹 Nitpick comments (1)
tests/gpu/torch/export/test_quant_aware_conversion_gpu.py (1)

38-69: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Move these imports to module scope

tests/gpu/torch/export/test_quant_aware_conversion_gpu.py keeps MixtralConfig, MixtralForCausalLM, the conversion helpers, export_hf_checkpoint, and safe_open inside the test body. Hoist the non-optional imports and add a short reason only if any import must stay local for an optional dependency.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/gpu/torch/export/test_quant_aware_conversion_gpu.py` around lines 38 -
69, The test is still doing import setup inside the function body; move the
non-optional imports used by _tiny_mixtral_config and
test_export_tensor_names_match_hub_after_conversion_reverse to module scope so
the test is easier to read and reuse. Keep only truly optional dependency checks
local (for example, the transformers import guards), and if any import must
remain inside the test, add a short inline reason for that exception.

Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/export/quant_aware_conversion.py`:
- Around line 151-152: The split logic in quant_aware_conversion’s
_split_leaf_tensor path can overwrite existing keys in state_dict or
extra_state_dict without warning. Add a collision check in the split handling
around the rule.part_suffixes loop, mirroring the existing rename collision
guard, and fail fast if any target key like gate_proj or up_proj already exists
instead of writing corrupted weights.
- Around line 213-230: The Transformers conversion imports in the quant-aware
conversion flow are still outside the guarded fallback, so older or drifted
Transformers versions can raise ImportError before the legacy path is used. Move
the transformers.core_model_loading imports into the same fallback branch in
quant_aware_conversion so they are only loaded after conversion mapping is
successfully resolved, and keep the try/except around the
get_model_conversion_mapping lookup in place. Use the existing symbols
get_model_conversion_mapping and QuantConversionUnsupportedError to locate and
preserve the intended fallback behavior.

---

Outside diff comments:
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1505-1517: The global Transformers patch is left active if
_sanitize_generation_config_for_save(model) throws before the try/finally in the
export flow. Move the sanitization call into the same try block used around
model.save_pretrained in unified_export_hf.py, and keep
_unpatch_revert_weight_conversion(_patches) in finally so
revert_weight_conversion is always restored. Use the existing
_patch_revert_weight_conversion, _sanitize_generation_config_for_save, and
_unpatch_revert_weight_conversion symbols to place the fix.

---

Nitpick comments:
In `@tests/gpu/torch/export/test_quant_aware_conversion_gpu.py`:
- Around line 38-69: The test is still doing import setup inside the function
body; move the non-optional imports used by _tiny_mixtral_config and
test_export_tensor_names_match_hub_after_conversion_reverse to module scope so
the test is easier to read and reuse. Keep only truly optional dependency checks
local (for example, the transformers import guards), and if any import must
remain inside the test, add a short inline reason for that exception.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 77769b4e-1580-4146-a8b6-7501b4474935

📥 Commits

Reviewing files that changed from the base of the PR and between 5177447 and 51c4286.

📒 Files selected for processing (7)
  • CHANGELOG.rst
  • modelopt/torch/export/quant_aware_conversion.py
  • modelopt/torch/export/unified_export_hf.py
  • modelopt/torch/quantization/plugins/huggingface.py
  • tests/gpu/torch/export/test_quant_aware_conversion_gpu.py
  • tests/unit/torch/export/test_quant_aware_conversion.py
  • tests/unit/torch/quantization/plugins/test_fused_experts.py

Comment thread modelopt/torch/export/quant_aware_conversion.py Outdated
Comment thread modelopt/torch/export/quant_aware_conversion.py Outdated
Edwardf0t1 added a commit that referenced this pull request Jul 2, 2026
…U derivation test

CI fixes for PR #1833:
- Drop float8_e4m3 from the synthetic weight_scale (Windows CPU lacks portable
  float8 ops, failing the `windows` unit job; the reverse logic is dtype-agnostic,
  so float32 keeps full coverage). Linux multi-version jobs were already green.
- Add a CPU test that derives reverse rules from a real transformers Mixtral
  conversion mapping and asserts a ModelOpt-expanded per-expert state dict reverts
  to the hub layout (block_sparse_moe.experts.N.w{1,2,3}) with scale siblings
  carried — covers _build_reverse_rules / revert_weight_conversion_quant_aware
  without a GPU (raises codecov coverage of the new module).
- Apply ruff-format (fixes the code-quality pre-commit failure).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/torch/export/test_quant_aware_conversion.py`:
- Around line 174-183: The local import of revert_weight_conversion_quant_aware
in test_quant_aware_conversion is unjustified because it is neither optional nor
needed to avoid a circular import; move it to the top-level import block with
the other modelopt imports so the test module follows the repo import convention
and import failures happen during collection.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 74bfd62d-d728-4de5-a9a6-2a1842c3f4fb

📥 Commits

Reviewing files that changed from the base of the PR and between 51c4286 and 4df7526.

📒 Files selected for processing (1)
  • tests/unit/torch/export/test_quant_aware_conversion.py

Comment thread tests/unit/torch/export/test_quant_aware_conversion.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tests/unit/torch/export/test_quant_aware_conversion.py (1)

237-237: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Same unjustified local import recurs in the new test.

revert_weight_conversion_quant_aware is an internal ModelOpt symbol (not an optional dependency or circular-import target — quant_aware_conversion.py only imports re, dataclasses, torch at module level), so it doesn't need to be deferred. This repeats the pattern already flagged in a prior review at the earlier occurrence of this same import.

♻️ Proposed fix
-    pytest.importorskip("transformers")
-    from transformers.core_model_loading import WeightRenaming
-
-    from modelopt.torch.export.quant_aware_conversion import revert_weight_conversion_quant_aware
-
+    pytest.importorskip("transformers")
+    from transformers.core_model_loading import WeightRenaming
+

Move from modelopt.torch.export.quant_aware_conversion import revert_weight_conversion_quant_aware to the top-level import block alongside other modelopt imports.

As per path instructions, "Imports inside functions or test methods without explicit justification" should be flagged; only "circular imports or optional dependencies" are acceptable exceptions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/export/test_quant_aware_conversion.py` at line 237, The new
test repeats an unjustified local import of
revert_weight_conversion_quant_aware; since this is a regular internal ModelOpt
symbol and not a circular/optional dependency, move the import out of the test
body into the top-level import block in test_quant_aware_conversion alongside
the other modelopt imports. Keep the test logic unchanged and use the existing
revert_weight_conversion_quant_aware reference from the module scope.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@tests/unit/torch/export/test_quant_aware_conversion.py`:
- Line 237: The new test repeats an unjustified local import of
revert_weight_conversion_quant_aware; since this is a regular internal ModelOpt
symbol and not a circular/optional dependency, move the import out of the test
body into the top-level import block in test_quant_aware_conversion alongside
the other modelopt imports. Keep the test logic unchanged and use the existing
revert_weight_conversion_quant_aware reference from the module scope.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e1a12cad-7f63-4e8d-ab91-89c8cfcaae04

📥 Commits

Reviewing files that changed from the base of the PR and between 4df7526 and f5f8be7.

📒 Files selected for processing (2)
  • modelopt/torch/export/quant_aware_conversion.py
  • tests/unit/torch/export/test_quant_aware_conversion.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt/torch/export/quant_aware_conversion.py

# we fall back to the in-memory names.
try:
export_state_dict = revert_weight_conversion_quant_aware(model, export_state_dict)
except QuantConversionUnsupportedError as exc:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we extend the broad-catch → re-raise-as-QuantConversionUnsupportedError pattern from the mapping read to the loop + rule application (or just widen this except to Exception)? Then the fallback also covers unanticipated failures, not just the ones we explicitly raise on.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — done in d727581. Widened the export-site catch to except Exception so any unanticipated failure (API drift, unexpected shapes) also falls back to in-memory names rather than aborting export; dropped the now-unused QuantConversionUnsupportedError import there.

for key, value in out.items():
new_key = key
for pattern, repl in compiled:
new_key = pattern.sub(repl, new_key)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The split path guards against a stray 3-D tensor (_apply_split_rule raises on tensor.dim() >= 3), but this rename path only rewrites keys and never inspects the tensor, so a stacked 3-D experts.gate_up_proj that survives pre-expansion just passes through unrenamed and ships under the wrong name silently. Should we add the same shape guard to the expert path so it falls back like the split path does?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — done in d727581. _build_reverse_rules now returns the in-memory fused expert leaves, and revert_weight_conversion_quant_aware calls _assert_experts_pre_expanded, which raises QuantConversionUnsupportedError (-> fallback) if a stacked/fused .experts.gate_up_proj (or any 3-D expert tensor) survived pre-expansion — so it no longer ships unrenamed. Covered by test_stacked_experts_guard.

Edwardf0t1 added a commit that referenced this pull request Jul 2, 2026
…U derivation test

CI fixes for PR #1833:
- Drop float8_e4m3 from the synthetic weight_scale (Windows CPU lacks portable
  float8 ops, failing the `windows` unit job; the reverse logic is dtype-agnostic,
  so float32 keeps full coverage). Linux multi-version jobs were already green.
- Add a CPU test that derives reverse rules from a real transformers Mixtral
  conversion mapping and asserts a ModelOpt-expanded per-expert state dict reverts
  to the hub layout (block_sparse_moe.experts.N.w{1,2,3}) with scale siblings
  carried — covers _build_reverse_rules / revert_weight_conversion_quant_aware
  without a GPU (raises codecov coverage of the new module).
- Apply ruff-format (fixes the code-quality pre-commit failure).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@Edwardf0t1 Edwardf0t1 force-pushed the feat/unified-export-quant-aware-revert branch from d727581 to 0c9230c Compare July 2, 2026 22:48

@meenchen meenchen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the model.save_pretrianed is broken for this model? What happen if we just load and save the BF16 checkpoint?

@Edwardf0t1

Copy link
Copy Markdown
Contributor Author

Does this mean that the model.save_pretrianed is broken for this model? What happen if we just load and save the BF16 checkpoint?

save_pretrained is not broken for the BF16 model, it's specific to quantized export.

transformers applies a load-time conversion_mapping (fused gate_up_proj, block_sparse_moe↔mlp, model↔language_model reorder), and on save its own revert_weight_conversion reverses it so on-disk names match the hub checkpoint.

But for quantized export, we disable transformers' revert (_patch_revert_weight_conversion) due to ModelOpt's scales are 0-d scalars, and without it the export keeps the in-memory post-conversion names.

Edwardf0t1 added a commit that referenced this pull request Jul 3, 2026
PR #1833 reverse-converts exported weight TENSOR names to the original hub
namespace, but the quantization config's module references (exclude_modules and,
for mixed precision, quantized_layers keys) are built from in-memory module names
and stayed in the post-conversion namespace. A deployment loader (vLLM/SGLang)
then matches those hub-mismatched patterns against the reverted hub-named modules,
finds no match, and loads an excluded BF16 layer as quantized -> shape assert at
load. Observed on MiniMax-M3 experts-only NVFP4: attention/dense excludes named
model.language_model.layers.N.self_attn* never matched the hub tensors
language_model.model.layers.N.self_attn.*.

Apply the same reverse rename to exclude_modules / quantized_layers keys, gated on
the weight-reverse succeeding so weights and config stay mutually consistent (both
hub or both in-memory on fallback). The name mapper is wildcard-safe: exclude
globs (e.g. *.shared_experts*) keep their trailing wildcard through the rename.

- quant_aware_conversion.py: add build_reverse_name_mapper + revert_quant_config_names.
- unified_export_hf.py: reorder so the reverse (weights + config names) runs before
  hf_quant_config.json is written / format-converted.
- add unit test test_revert_quant_config_names_mapper (Mixtral mapping).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@Edwardf0t1 Edwardf0t1 requested a review from a team as a code owner July 3, 2026 03:35
Edwardf0t1 added a commit that referenced this pull request Jul 3, 2026
…U derivation test

CI fixes for PR #1833:
- Drop float8_e4m3 from the synthetic weight_scale (Windows CPU lacks portable
  float8 ops, failing the `windows` unit job; the reverse logic is dtype-agnostic,
  so float32 keeps full coverage). Linux multi-version jobs were already green.
- Add a CPU test that derives reverse rules from a real transformers Mixtral
  conversion mapping and asserts a ModelOpt-expanded per-expert state dict reverts
  to the hub layout (block_sparse_moe.experts.N.w{1,2,3}) with scale siblings
  carried — covers _build_reverse_rules / revert_weight_conversion_quant_aware
  without a GPU (raises codecov coverage of the new module).
- Apply ruff-format (fixes the code-quality pre-commit failure).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Edwardf0t1 added a commit that referenced this pull request Jul 3, 2026
PR #1833 reverse-converts exported weight TENSOR names to the original hub
namespace, but the quantization config's module references (exclude_modules and,
for mixed precision, quantized_layers keys) are built from in-memory module names
and stayed in the post-conversion namespace. A deployment loader (vLLM/SGLang)
then matches those hub-mismatched patterns against the reverted hub-named modules,
finds no match, and loads an excluded BF16 layer as quantized -> shape assert at
load. Observed on MiniMax-M3 experts-only NVFP4: attention/dense excludes named
model.language_model.layers.N.self_attn* never matched the hub tensors
language_model.model.layers.N.self_attn.*.

Apply the same reverse rename to exclude_modules / quantized_layers keys, gated on
the weight-reverse succeeding so weights and config stay mutually consistent (both
hub or both in-memory on fallback). The name mapper is wildcard-safe: exclude
globs (e.g. *.shared_experts*) keep their trailing wildcard through the rename.

- quant_aware_conversion.py: add build_reverse_name_mapper + revert_quant_config_names.
- unified_export_hf.py: reorder so the reverse (weights + config names) runs before
  hf_quant_config.json is written / format-converted.
- add unit test test_revert_quant_config_names_mapper (Mixtral mapping).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@Edwardf0t1 Edwardf0t1 force-pushed the feat/unified-export-quant-aware-revert branch from c08d168 to 6ec9ace Compare July 3, 2026 05:59
@Edwardf0t1 Edwardf0t1 enabled auto-merge (squash) July 3, 2026 06:34
Edwardf0t1 and others added 12 commits July 3, 2026 14:49
register_fused_experts_on_the_fly skipped fused-expert modules lacking an
act_fn attribute. MiniMaxM3VLExperts (transformers 5.12.0) uses a custom
GPT-OSS-style gated activation between its two F.linear calls instead of an
act_fn attribute, so it was never wrapped as _QuantFusedExperts: routed
experts stayed unquantized (an experts-only recipe matched nothing) and HF
export failed with NotImplementedError.

_QuantFusedExperts is activation-agnostic (it only intercepts the two
F.linear calls, gate_up then down), so act_fn is irrelevant to quantization,
calibration, and export. Drop the requirement from _is_fused_experts_module.
Enables NVFP4/FP8 PTQ + export for MiniMax-M2 / MiniMax-M3.

Verified end-to-end: experts-only NVFP4 + FP8 KV PTQ of MiniMaxAI/MiniMax-M3
detects MiniMaxM3VLExperts, quantizes all 57 MoE layers, and exports a valid
HF checkpoint.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Condense the act_fn explanation in _is_fused_experts_module's docstring
and remove the work-log comment block in the corresponding unit test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…port

ModelOpt's unified HF export builds the state dict from the in-memory
(transformers post-conversion) module names and disables transformers'
save-side revert_weight_conversion (it raises IndexError on 0-d scalar
scale tensors). As a result, when transformers applies a load-time
conversion_mapping (fused gate_up_proj, renamed MoE leaves, reordered
model/language_model prefix), the exported tensor names no longer match
the original HF hub checkpoint, breaking the unified-checkpoint contract
(observed: MiniMax-M3 NVFP4-v1 emitted fused/renamed names).

Add a quantization-aware reverse that carries each weight's companion
scale tensors (weight_scale, weight_scale_2, input_scale,
weight_scale_inv, bias) through:
  - renames (key-level substitution; scale siblings follow the module path)
  - output-dim un-fuse splits (split weight/weight_scale on the fused dim,
    duplicate 0-d scalar weight_scale_2/input_scale to each part)
Reverse rules are derived best-effort from the model's conversion mapping;
any op not yet reversible quant-aware (e.g. stacked-expert MergeModulelist)
raises QuantConversionUnsupportedError and the export falls back to the
prior in-memory-name behavior with a warning (non-breaking).

Adds CPU unit tests (rename carries scales, dense gate_up un-fuse with
scale split + scalar duplication, 3-D expert + non-divisible guards,
end-to-end MiniMax-M3-like reversal). End-to-end export validation on a
real M3 quantize+export (transformers with the minimax_m3_vl mapping, GPU)
is still pending.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…e test

GPU validation (tiny Mixtral, transformers 5.3.0) revealed the real shape of the
problem: ModelOpt's export already expands the fused, stacked in-memory experts
(experts.gate_up_proj [E,2F,H]) into per-expert 2-D linears
(experts.<i>.gate_proj/up_proj/down_proj) before save. So the reverse for experts
is a pure per-expert leaf rename (gate_proj->w1, up_proj->w3, down_proj->w2), not a
3-D tensor un-stack. The previous MergeModulelist->Unsupported bail was therefore
wrong: it forced the whole export to fall back, leaving experts mis-named.

Rewrite _build_reverse_rules to use transformers' reverse_transform() for correctly
reversed name patterns and to classify converters:
  - WeightRenaming            -> RenameRule
  - expert WeightConverter    -> per-expert leaf RenameRules (SplitModulelist present)
  - dense fusing WeightConverter (Chunk only) -> SplitRule (fused tensor survives)

Add a GPU end-to-end test: quantize tiny Mixtral to NVFP4, export, and assert the
exported tensor names exactly equal the canonical hub names from transformers'
own revert_weight_conversion on the reference model (experts land as
block_sparse_moe.experts.N.w{1,2,3}; no fused in-memory names remain). Passes.
Existing export unit tests remain green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…U derivation test

CI fixes for PR #1833:
- Drop float8_e4m3 from the synthetic weight_scale (Windows CPU lacks portable
  float8 ops, failing the `windows` unit job; the reverse logic is dtype-agnostic,
  so float32 keeps full coverage). Linux multi-version jobs were already green.
- Add a CPU test that derives reverse rules from a real transformers Mixtral
  conversion mapping and asserts a ModelOpt-expanded per-expert state dict reverts
  to the hub layout (block_sparse_moe.experts.N.w{1,2,3}) with scale siblings
  carried — covers _build_reverse_rules / revert_weight_conversion_quant_aware
  without a GPU (raises codecov coverage of the new module).
- Apply ruff-format (fixes the code-quality pre-commit failure).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
The quant-aware reverse conversion derived WeightRenamings from the model's
conversion mapping and applied them in forward (list) order. A component-
reordering rename (`model.language_model` -> `language_model.model`) then ran
before the MoE container renames that anchor on the resulting adjacency
(`.language_model.layers.N.mlp.experts.` -> `.block_sparse_moe.experts.`,
plus the router `gate`/`e_score_correction_bias`). The reorder moved
`language_model` away from `layers`, so those container renames silently
no-op'd and MiniMax-M3 experts exported under the in-memory `mlp.experts.*`
names instead of the hub `block_sparse_moe.experts.*.w{1,2,3}` names, breaking
the unified-checkpoint hub-naming contract.

Mirror transformers' save-path order (converter first, then renamings): collect
expert leaf renames and WeightRenamings separately, apply leaf renames first,
then WeightRenamings in reverse list order so the prefix reorder fires last and
preserves the anchors the container/gate renames depend on.

Add a regression test that reproduces the reorder+container interaction with a
minimal two-renaming mapping (Mixtral does not exercise it: no prefix reorder).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
- Guard split targets against collisions before overwriting keys, mirroring
  the rename collision guard (CodeRabbit).
- Move the transformers.core_model_loading import into a guarded try/except so
  an older/drifted transformers raises QuantConversionUnsupportedError and falls
  back to legacy names instead of a hard ImportError (CodeRabbit).
- Add a stacked/fused-expert guard: _build_reverse_rules now returns the
  in-memory fused expert leaves, and revert_weight_conversion_quant_aware raises
  (-> fallback) if experts were not pre-expanded, so a 3-D experts.gate_up_proj
  can no longer ship silently mis-named on the rename path (reviewer).
- Widen the export-site catch to Exception so any unanticipated reverse-conversion
  failure falls back to in-memory names rather than aborting export (reviewer);
  drop the now-unused QuantConversionUnsupportedError import there.
- Hoist revert_weight_conversion_quant_aware to a top-level test import (CodeRabbit).
- Add unit tests for the split-collision and stacked-expert guards.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Document the unified-HF-export fix under 0.46 Bug Fixes: exports now round-trip
to the original hub tensor names (e.g. MiniMax-M3 block_sparse_moe.experts.*.w{1,2,3})
instead of transformers' in-memory post-conversion names, via the new quant-aware
reverse conversion, with fallback to legacy names for unsupported mapping ops.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
… fix)

The reorder-ordering regression test guarded transformers presence with
pytest.importorskip("transformers") but then imported WeightRenaming from
transformers.core_model_loading, a transformers 5.x submodule absent in the
minimum-supported version. On the tf_min CI job transformers imports fine but the
submodule does not, so the import raised ModuleNotFoundError (a hard failure)
instead of skipping. Skip on the submodule directly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
The disabled transformers revert_weight_conversion fails on ModelOpt's 0-d scalar
scale tensors with a RuntimeError from torch.chunk ("chunk expects at least a
1-dimensional tensor"), not IndexError (verified on transformers 5.12.0). Fix the
wording in the quant_aware_conversion docstring, the CHANGELOG entry, and the
_patch_revert_weight_conversion comments, and note the actual Chunk.convert() call
site. Also collapse the two overlapping fallback comment blocks at the export call
site into one.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
PR #1833 reverse-converts exported weight TENSOR names to the original hub
namespace, but the quantization config's module references (exclude_modules and,
for mixed precision, quantized_layers keys) are built from in-memory module names
and stayed in the post-conversion namespace. A deployment loader (vLLM/SGLang)
then matches those hub-mismatched patterns against the reverted hub-named modules,
finds no match, and loads an excluded BF16 layer as quantized -> shape assert at
load. Observed on MiniMax-M3 experts-only NVFP4: attention/dense excludes named
model.language_model.layers.N.self_attn* never matched the hub tensors
language_model.model.layers.N.self_attn.*.

Apply the same reverse rename to exclude_modules / quantized_layers keys, gated on
the weight-reverse succeeding so weights and config stay mutually consistent (both
hub or both in-memory on fallback). The name mapper is wildcard-safe: exclude
globs (e.g. *.shared_experts*) keep their trailing wildcard through the rename.

- quant_aware_conversion.py: add build_reverse_name_mapper + revert_quant_config_names.
- unified_export_hf.py: reorder so the reverse (weights + config names) runs before
  hf_quant_config.json is written / format-converted.
- add unit test test_revert_quant_config_names_mapper (Mixtral mapping).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…st-remote-code) export

copy_custom_model_files returned early unless trust_remote_code was set, so a VLM
exported via the native transformers path (no --trust_remote_code) lost its
processor/tokenizer artifacts. transformers 5.x doesn't re-save these for natively
loaded models, so the exported checkpoint was missing e.g. preprocessor_config.json
and failed to serve on vLLM ('Can't load image processor ... containing a
preprocessor_config.json'). Observed on MiniMax-M3 NVFP4.

Split the copy: processor/tokenizer *data* artifacts (preprocessor_config.json,
processor helpers, merges.txt/vocab.json/added_tokens/special_tokens_map, etc.) are
deployment-critical and copied regardless of trust_remote_code; executable custom
model/config code (modeling*.py, configuration_*.py, tokenization_*.py, other JSON)
stays gated on trust_remote_code. config.json / safetensors index still skipped.
The trust_remote_code=True set remains a superset of the previous behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@Edwardf0t1 Edwardf0t1 force-pushed the feat/unified-export-quant-aware-revert branch from 6ec9ace to 85649f2 Compare July 3, 2026 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants