feat(export): quant-aware reverse weight conversion for unified HF export by Edwardf0t1 · Pull Request #1833 · NVIDIA/Model-Optimizer

Edwardf0t1 · 2026-06-26T00:20:13Z

What

ModelOpt's unified HF export builds the state dict from the in-memory (transformers post-conversion) module names and disables transformers' save-side revert_weight_conversion (it raises IndexError on ModelOpt's 0-d scalar scale tensors). So when transformers applies a load-time conversion_mapping (renamed MoE leaves, block_sparse_moe↔mlp, reordered model/language_model prefix, fused dense gate_up_proj), the exported tensor names no longer match the original HF hub checkpoint, breaking the unified-checkpoint contract.

Observed concretely on MiniMax-M3: nvidia/MiniMax-M3-NVFP4-v1 emitted the converted names (model.language_model.*, mlp.experts.*.gate_proj) instead of hub names (language_model.model.*, block_sparse_moe.experts.*.w{1,2,3}).

How

New modelopt/torch/export/quant_aware_conversion.py performs a quantization-aware reverse conversion, derived from the model's conversion mapping via transformers' own reverse_transform() (so anchored regex renamings reverse correctly), carrying each weight's companion scale tensors (weight_scale, weight_scale_2, input_scale, weight_scale_inv, bias):

Rename — key-level substitution; scale siblings follow the module path automatically.
Split — un-fuse a dense output-dim concatenation (gate_up_proj → gate_proj+up_proj): split weight/weight_scale/bias on the fused dim, duplicate the 0-d scalars.

Key insight from GPU validation: ModelOpt's export already expands fused, stacked in-memory experts (experts.gate_up_proj [E,2F,H]) into per-expert 2-D linears before save, so the reverse for experts is a pure per-expert leaf rename (gate_proj→w1, up_proj→w3, down_proj→w2), not a 3-D un-stack. Converters are classified by their reversed ops: SplitModulelist present ⇒ expert rename; Chunk-only ⇒ dense split.

export_hf_checkpoint applies this before save_pretrained; anything not reversible raises QuantConversionUnsupportedError and the export falls back to prior behavior with a warning (non-breaking).

Tests

Unit (tests/unit/torch/export/test_quant_aware_conversion.py, CPU): rename carries scales; dense gate_up_proj un-fuse with scale split + scalar duplication; 3-D / non-divisible guards; end-to-end MiniMax-M3-like reversal.
GPU (tests/gpu/torch/export/test_quant_aware_conversion_gpu.py): quantize a tiny Mixtral to NVFP4 → export_hf_checkpoint → assert exported tensor names exactly equal the canonical hub names from transformers' own revert_weight_conversion on the reference model (experts land as block_sparse_moe.experts.N.w{1,2,3}; no fused in-memory names remain).

Validation

✅ Unit tests pass (CPU). Existing test_unified_export_hf.py remains green (no regression).
✅ GPU end-to-end passes: tiny Mixtral NVFP4 quantize→export yields 0 missing / 0 extra vs. canonical hub names (transformers 5.3.0, RTX 6000 Ada). Mixtral's conversion (block_sparse_moe↔mlp, expert gate/up fuse) is the same machinery MoE VLMs like MiniMax-M3 use.
Note: transformers 5.3.0 has no minimax_m3_vl entry, so M3 itself isn't exercised in CI here; Mixtral is the structural surrogate. When a transformers build with the M3 mapping is available, the same test can be parametrized for M3.

Follow-ups

Extend the derivation if a model needs stacked-scalar-scale handling that ModelOpt does not pre-expand (none known today).
Upstream fix to transformers Chunk.convert() for 0-d tensors, then drop the _patch_revert_weight_conversion no-op.

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved Hugging Face export so quantized models save with canonical hub tensor names instead of temporary in-memory names.
- Fixed export handling for scalar scale tensors and other quantized companion tensors, helping checkpoints round-trip more reliably.
- Added a fallback path so export continues even when a conversion pattern can’t be safely reversed.
- Exported Mixtral-style expert weights now preserve the expected hub layout in saved checkpoints.

copy-pr-bot · 2026-06-26T00:20:16Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-26T00:20:21Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds quant-aware reverse weight conversion to unified HF export, restoring hub-canonical tensor names (with propagated scale tensors) instead of transformers' in-memory names, falling back to prior naming when reversal isn't supported. Includes changelog entry, integration into export flow, and unit/GPU tests.

Changes

Quant-aware reverse weight conversion for HF export

Layer / File(s)	Summary
Reverse rule primitives `modelopt/torch/export/quant_aware_conversion.py` (lines 1-181)	Defines `QuantConversionUnsupportedError`, `RenameRule`, `SplitRule`, tensor splitting/duplication logic, and `apply_reverse_rules` to split-then-rename a state dict.
Reverse rule construction `modelopt/torch/export/quant_aware_conversion.py` (lines 184-364)	Adds `revert_weight_conversion_quant_aware` entrypoint, `_assert_experts_pre_expanded` guard, `_build_reverse_rules` deriving rules from Transformers conversion mappings, expert leaf rename generation, dense split rule construction, and normalization helpers.
HF export integration and changelog `modelopt/torch/export/unified_export_hf.py`, `CHANGELOG.rst`	`export_hf_checkpoint` merges state dicts, applies quant-aware reversal with a warning fallback on failure, and saves using the reverted names; changelog documents the bug fix.
Reverse conversion unit tests `tests/unit/torch/export/test_quant_aware_conversion.py`	CPU tests cover rename/split behavior, scale propagation, error cases (stacked experts, non-divisible splits, split collisions), expert guard behavior, and Transformers-derived rule building.
GPU export key validation `tests/gpu/torch/export/test_quant_aware_conversion_gpu.py`	CUDA-only test quantizes and exports a Mixtral model, verifying exported safetensors keys exactly match canonical hub names and expert layout naming.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant export_hf_checkpoint
  participant revert_weight_conversion_quant_aware
  participant BuildReverseRules
  participant save_pretrained

  export_hf_checkpoint->>export_hf_checkpoint: merge post_state_dict + extra_state_dict
  export_hf_checkpoint->>revert_weight_conversion_quant_aware: reverse quantized weight names
  revert_weight_conversion_quant_aware->>BuildReverseRules: derive split/rename rules from model
  BuildReverseRules-->>revert_weight_conversion_quant_aware: rules or unsupported error
  alt rules supported
    revert_weight_conversion_quant_aware-->>export_hf_checkpoint: reverted state_dict with hub names
  else unsupported pattern
    revert_weight_conversion_quant_aware-->>export_hf_checkpoint: QuantConversionUnsupportedError
    export_hf_checkpoint->>export_hf_checkpoint: warn, keep in-memory names
  end
  export_hf_checkpoint->>save_pretrained: save checkpoint with export_state_dict

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly matches the main change: adding quant-aware reverse weight conversion for unified Hugging Face export.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	Touched modelopt/export files and tests contain no banned torch.load/numpy.load/trust_remote_code/eval/exec/nosec patterns, and no dependency additions.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/unified-export-quant-aware-revert

_{Comment @coderabbitai help to get the list of available commands.}

github-actions · 2026-06-26T00:24:27Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1833/
Built to branch `gh-pages` at 2026-07-03 21:53 UTC. Preview will be ready when the GitHub Pages deployment is complete.

codecov · 2026-06-26T00:31:10Z

Codecov Report

❌ Patch coverage is 86.74699% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.47%. Comparing base (b0ee953) to head (85649f2).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/export/quant_aware_conversion.py	87.74%	19 Missing ⚠️
modelopt/torch/export/unified_export_hf.py	72.72%	3 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (b0ee953) and HEAD (85649f2). Click for more details.

HEAD has 3 uploads less than BASE

Flag BASE (b0ee953) HEAD (85649f2)

unit 2 1

examples 12 10

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1833      +/-   ##
==========================================
- Coverage   75.20%   68.47%   -6.73%     
==========================================
  Files         515      516       +1     
  Lines       57245    57409     +164     
==========================================
- Hits        43050    39313    -3737     
- Misses      14195    18096    +3901

Flag	Coverage Δ
examples	`32.69% <69.87%> (-10.44%)`	⬇️
gpu	`31.99% <18.67%> (-18.03%)`	⬇️
regression	`14.84% <18.67%> (+0.08%)`	⬆️
unit	`55.00% <82.53%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/export/unified_export_hf.py (1)

1505-1517: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Restore the Transformers patch even if sanitization fails.

_sanitize_generation_config_for_save(model) runs after the global patch but before the try/finally; if it raises, revert_weight_conversion stays patched for the rest of the process.

Proposed fix

         _patches = _patch_revert_weight_conversion()
 
-        _sanitize_generation_config_for_save(model)
-
         try:
+            _sanitize_generation_config_for_save(model)
             model.save_pretrained(
                 export_dir,
                 state_dict=export_state_dict,
                 save_modelopt_state=save_modelopt_state,

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/export/unified_export_hf.py` around lines 1505 - 1517, The
global Transformers patch is left active if
_sanitize_generation_config_for_save(model) throws before the try/finally in the
export flow. Move the sanitization call into the same try block used around
model.save_pretrained in unified_export_hf.py, and keep
_unpatch_revert_weight_conversion(_patches) in finally so
revert_weight_conversion is always restored. Use the existing
_patch_revert_weight_conversion, _sanitize_generation_config_for_save, and
_unpatch_revert_weight_conversion symbols to place the fix.

🧹 Nitpick comments (1)

tests/gpu/torch/export/test_quant_aware_conversion_gpu.py (1)
38-69: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Move these imports to module scope

tests/gpu/torch/export/test_quant_aware_conversion_gpu.py keeps MixtralConfig, MixtralForCausalLM, the conversion helpers, export_hf_checkpoint, and safe_open inside the test body. Hoist the non-optional imports and add a short reason only if any import must stay local for an optional dependency.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/gpu/torch/export/test_quant_aware_conversion_gpu.py` around lines 38 -
69, The test is still doing import setup inside the function body; move the
non-optional imports used by _tiny_mixtral_config and
test_export_tensor_names_match_hub_after_conversion_reverse to module scope so
the test is easier to read and reuse. Keep only truly optional dependency checks
local (for example, the transformers import guards), and if any import must
remain inside the test, add a short inline reason for that exception.
Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/export/quant_aware_conversion.py`:
- Around line 151-152: The split logic in quant_aware_conversion’s
_split_leaf_tensor path can overwrite existing keys in state_dict or
extra_state_dict without warning. Add a collision check in the split handling
around the rule.part_suffixes loop, mirroring the existing rename collision
guard, and fail fast if any target key like gate_proj or up_proj already exists
instead of writing corrupted weights.
- Around line 213-230: The Transformers conversion imports in the quant-aware
conversion flow are still outside the guarded fallback, so older or drifted
Transformers versions can raise ImportError before the legacy path is used. Move
the transformers.core_model_loading imports into the same fallback branch in
quant_aware_conversion so they are only loaded after conversion mapping is
successfully resolved, and keep the try/except around the
get_model_conversion_mapping lookup in place. Use the existing symbols
get_model_conversion_mapping and QuantConversionUnsupportedError to locate and
preserve the intended fallback behavior.

---

Outside diff comments:
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1505-1517: The global Transformers patch is left active if
_sanitize_generation_config_for_save(model) throws before the try/finally in the
export flow. Move the sanitization call into the same try block used around
model.save_pretrained in unified_export_hf.py, and keep
_unpatch_revert_weight_conversion(_patches) in finally so
revert_weight_conversion is always restored. Use the existing
_patch_revert_weight_conversion, _sanitize_generation_config_for_save, and
_unpatch_revert_weight_conversion symbols to place the fix.

---

Nitpick comments:
In `@tests/gpu/torch/export/test_quant_aware_conversion_gpu.py`:
- Around line 38-69: The test is still doing import setup inside the function
body; move the non-optional imports used by _tiny_mixtral_config and
test_export_tensor_names_match_hub_after_conversion_reverse to module scope so
the test is easier to read and reuse. Keep only truly optional dependency checks
local (for example, the transformers import guards), and if any import must
remain inside the test, add a short inline reason for that exception.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 77769b4e-1580-4146-a8b6-7501b4474935

📥 Commits

Reviewing files that changed from the base of the PR and between 5177447 and 51c4286.

📒 Files selected for processing (7)

CHANGELOG.rst
modelopt/torch/export/quant_aware_conversion.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/quantization/plugins/huggingface.py
tests/gpu/torch/export/test_quant_aware_conversion_gpu.py
tests/unit/torch/export/test_quant_aware_conversion.py
tests/unit/torch/quantization/plugins/test_fused_experts.py

…U derivation test CI fixes for PR #1833: - Drop float8_e4m3 from the synthetic weight_scale (Windows CPU lacks portable float8 ops, failing the `windows` unit job; the reverse logic is dtype-agnostic, so float32 keeps full coverage). Linux multi-version jobs were already green. - Add a CPU test that derives reverse rules from a real transformers Mixtral conversion mapping and asserts a ModelOpt-expanded per-expert state dict reverts to the hub layout (block_sparse_moe.experts.N.w{1,2,3}) with scale siblings carried — covers _build_reverse_rules / revert_weight_conversion_quant_aware without a GPU (raises codecov coverage of the new module). - Apply ruff-format (fixes the code-quality pre-commit failure). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/torch/export/test_quant_aware_conversion.py`:
- Around line 174-183: The local import of revert_weight_conversion_quant_aware
in test_quant_aware_conversion is unjustified because it is neither optional nor
needed to avoid a circular import; move it to the top-level import block with
the other modelopt imports so the test module follows the repo import convention
and import failures happen during collection.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 74bfd62d-d728-4de5-a9a6-2a1842c3f4fb

📥 Commits

Reviewing files that changed from the base of the PR and between 51c4286 and 4df7526.

📒 Files selected for processing (1)

tests/unit/torch/export/test_quant_aware_conversion.py

coderabbitai

♻️ Duplicate comments (1)

tests/unit/torch/export/test_quant_aware_conversion.py (1)
237-237: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Same unjustified local import recurs in the new test.

revert_weight_conversion_quant_aware is an internal ModelOpt symbol (not an optional dependency or circular-import target — quant_aware_conversion.py only imports re, dataclasses, torch at module level), so it doesn't need to be deferred. This repeats the pattern already flagged in a prior review at the earlier occurrence of this same import.
♻️ Proposed fix
-    pytest.importorskip("transformers")
-    from transformers.core_model_loading import WeightRenaming
-
-    from modelopt.torch.export.quant_aware_conversion import revert_weight_conversion_quant_aware
-
+    pytest.importorskip("transformers")
+    from transformers.core_model_loading import WeightRenaming
+
Move from modelopt.torch.export.quant_aware_conversion import revert_weight_conversion_quant_aware to the top-level import block alongside other modelopt imports.
As per path instructions, "Imports inside functions or test methods without explicit justification" should be flagged; only "circular imports or optional dependencies" are acceptable exceptions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/export/test_quant_aware_conversion.py` at line 237, The new
test repeats an unjustified local import of
revert_weight_conversion_quant_aware; since this is a regular internal ModelOpt
symbol and not a circular/optional dependency, move the import out of the test
body into the top-level import block in test_quant_aware_conversion alongside
the other modelopt imports. Keep the test logic unchanged and use the existing
revert_weight_conversion_quant_aware reference from the module scope.
Source: Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@tests/unit/torch/export/test_quant_aware_conversion.py`:
- Line 237: The new test repeats an unjustified local import of
revert_weight_conversion_quant_aware; since this is a regular internal ModelOpt
symbol and not a circular/optional dependency, move the import out of the test
body into the top-level import block in test_quant_aware_conversion alongside
the other modelopt imports. Keep the test logic unchanged and use the existing
revert_weight_conversion_quant_aware reference from the module scope.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e1a12cad-7f63-4e8d-ab91-89c8cfcaae04

📥 Commits

Reviewing files that changed from the base of the PR and between 4df7526 and f5f8be7.

📒 Files selected for processing (2)

modelopt/torch/export/quant_aware_conversion.py
tests/unit/torch/export/test_quant_aware_conversion.py

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/torch/export/quant_aware_conversion.py

juhi10071998 · 2026-07-02T21:56:37Z

+        # we fall back to the in-memory names.
+        try:
+            export_state_dict = revert_weight_conversion_quant_aware(model, export_state_dict)
+        except QuantConversionUnsupportedError as exc:


Should we extend the broad-catch → re-raise-as-QuantConversionUnsupportedError pattern from the mapping read to the loop + rule application (or just widen this except to Exception)? Then the fallback also covers unanticipated failures, not just the ones we explicitly raise on.

Good call — done in d727581. Widened the export-site catch to except Exception so any unanticipated failure (API drift, unexpected shapes) also falls back to in-memory names rather than aborting export; dropped the now-unused QuantConversionUnsupportedError import there.

juhi10071998 · 2026-07-02T21:56:40Z

+    for key, value in out.items():
+        new_key = key
+        for pattern, repl in compiled:
+            new_key = pattern.sub(repl, new_key)


The split path guards against a stray 3-D tensor (_apply_split_rule raises on tensor.dim() >= 3), but this rename path only rewrites keys and never inspects the tensor, so a stacked 3-D experts.gate_up_proj that survives pre-expansion just passes through unrenamed and ships under the wrong name silently. Should we add the same shape guard to the expert path so it falls back like the split path does?

Good catch — done in d727581. _build_reverse_rules now returns the in-memory fused expert leaves, and revert_weight_conversion_quant_aware calls _assert_experts_pre_expanded, which raises QuantConversionUnsupportedError (-> fallback) if a stacked/fused .experts.gate_up_proj (or any 3-D expert tensor) survived pre-expansion — so it no longer ships unrenamed. Covered by test_stacked_experts_guard.

…U derivation test CI fixes for PR #1833: - Drop float8_e4m3 from the synthetic weight_scale (Windows CPU lacks portable float8 ops, failing the `windows` unit job; the reverse logic is dtype-agnostic, so float32 keeps full coverage). Linux multi-version jobs were already green. - Add a CPU test that derives reverse rules from a real transformers Mixtral conversion mapping and asserts a ModelOpt-expanded per-expert state dict reverts to the hub layout (block_sparse_moe.experts.N.w{1,2,3}) with scale siblings carried — covers _build_reverse_rules / revert_weight_conversion_quant_aware without a GPU (raises codecov coverage of the new module). - Apply ruff-format (fixes the code-quality pre-commit failure). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

meenchen

Does this mean that the model.save_pretrianed is broken for this model? What happen if we just load and save the BF16 checkpoint?

Edwardf0t1 · 2026-07-03T01:16:43Z

Does this mean that the model.save_pretrianed is broken for this model? What happen if we just load and save the BF16 checkpoint?

save_pretrained is not broken for the BF16 model, it's specific to quantized export.

transformers applies a load-time conversion_mapping (fused gate_up_proj, block_sparse_moe↔mlp, model↔language_model reorder), and on save its own revert_weight_conversion reverses it so on-disk names match the hub checkpoint.

But for quantized export, we disable transformers' revert (_patch_revert_weight_conversion) due to ModelOpt's scales are 0-d scalars, and without it the export keeps the in-memory post-conversion names.

PR #1833 reverse-converts exported weight TENSOR names to the original hub namespace, but the quantization config's module references (exclude_modules and, for mixed precision, quantized_layers keys) are built from in-memory module names and stayed in the post-conversion namespace. A deployment loader (vLLM/SGLang) then matches those hub-mismatched patterns against the reverted hub-named modules, finds no match, and loads an excluded BF16 layer as quantized -> shape assert at load. Observed on MiniMax-M3 experts-only NVFP4: attention/dense excludes named model.language_model.layers.N.self_attn* never matched the hub tensors language_model.model.layers.N.self_attn.*. Apply the same reverse rename to exclude_modules / quantized_layers keys, gated on the weight-reverse succeeding so weights and config stay mutually consistent (both hub or both in-memory on fallback). The name mapper is wildcard-safe: exclude globs (e.g. *.shared_experts*) keep their trailing wildcard through the rename. - quant_aware_conversion.py: add build_reverse_name_mapper + revert_quant_config_names. - unified_export_hf.py: reorder so the reverse (weights + config names) runs before hf_quant_config.json is written / format-converted. - add unit test test_revert_quant_config_names_mapper (Mixtral mapping). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

…U derivation test CI fixes for PR #1833: - Drop float8_e4m3 from the synthetic weight_scale (Windows CPU lacks portable float8 ops, failing the `windows` unit job; the reverse logic is dtype-agnostic, so float32 keeps full coverage). Linux multi-version jobs were already green. - Add a CPU test that derives reverse rules from a real transformers Mixtral conversion mapping and asserts a ModelOpt-expanded per-expert state dict reverts to the hub layout (block_sparse_moe.experts.N.w{1,2,3}) with scale siblings carried — covers _build_reverse_rules / revert_weight_conversion_quant_aware without a GPU (raises codecov coverage of the new module). - Apply ruff-format (fixes the code-quality pre-commit failure). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

PR #1833 reverse-converts exported weight TENSOR names to the original hub namespace, but the quantization config's module references (exclude_modules and, for mixed precision, quantized_layers keys) are built from in-memory module names and stayed in the post-conversion namespace. A deployment loader (vLLM/SGLang) then matches those hub-mismatched patterns against the reverted hub-named modules, finds no match, and loads an excluded BF16 layer as quantized -> shape assert at load. Observed on MiniMax-M3 experts-only NVFP4: attention/dense excludes named model.language_model.layers.N.self_attn* never matched the hub tensors language_model.model.layers.N.self_attn.*. Apply the same reverse rename to exclude_modules / quantized_layers keys, gated on the weight-reverse succeeding so weights and config stay mutually consistent (both hub or both in-memory on fallback). The name mapper is wildcard-safe: exclude globs (e.g. *.shared_experts*) keep their trailing wildcard through the rename. - quant_aware_conversion.py: add build_reverse_name_mapper + revert_quant_config_names. - unified_export_hf.py: reorder so the reverse (weights + config names) runs before hf_quant_config.json is written / format-converted. - add unit test test_revert_quant_config_names_mapper (Mixtral mapping). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

register_fused_experts_on_the_fly skipped fused-expert modules lacking an act_fn attribute. MiniMaxM3VLExperts (transformers 5.12.0) uses a custom GPT-OSS-style gated activation between its two F.linear calls instead of an act_fn attribute, so it was never wrapped as _QuantFusedExperts: routed experts stayed unquantized (an experts-only recipe matched nothing) and HF export failed with NotImplementedError. _QuantFusedExperts is activation-agnostic (it only intercepts the two F.linear calls, gate_up then down), so act_fn is irrelevant to quantization, calibration, and export. Drop the requirement from _is_fused_experts_module. Enables NVFP4/FP8 PTQ + export for MiniMax-M2 / MiniMax-M3. Verified end-to-end: experts-only NVFP4 + FP8 KV PTQ of MiniMaxAI/MiniMax-M3 detects MiniMaxM3VLExperts, quantizes all 57 MoE layers, and exports a valid HF checkpoint. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Condense the act_fn explanation in _is_fused_experts_module's docstring and remove the work-log comment block in the corresponding unit test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

…port ModelOpt's unified HF export builds the state dict from the in-memory (transformers post-conversion) module names and disables transformers' save-side revert_weight_conversion (it raises IndexError on 0-d scalar scale tensors). As a result, when transformers applies a load-time conversion_mapping (fused gate_up_proj, renamed MoE leaves, reordered model/language_model prefix), the exported tensor names no longer match the original HF hub checkpoint, breaking the unified-checkpoint contract (observed: MiniMax-M3 NVFP4-v1 emitted fused/renamed names). Add a quantization-aware reverse that carries each weight's companion scale tensors (weight_scale, weight_scale_2, input_scale, weight_scale_inv, bias) through: - renames (key-level substitution; scale siblings follow the module path) - output-dim un-fuse splits (split weight/weight_scale on the fused dim, duplicate 0-d scalar weight_scale_2/input_scale to each part) Reverse rules are derived best-effort from the model's conversion mapping; any op not yet reversible quant-aware (e.g. stacked-expert MergeModulelist) raises QuantConversionUnsupportedError and the export falls back to the prior in-memory-name behavior with a warning (non-breaking). Adds CPU unit tests (rename carries scales, dense gate_up un-fuse with scale split + scalar duplication, 3-D expert + non-divisible guards, end-to-end MiniMax-M3-like reversal). End-to-end export validation on a real M3 quantize+export (transformers with the minimax_m3_vl mapping, GPU) is still pending. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

…e test GPU validation (tiny Mixtral, transformers 5.3.0) revealed the real shape of the problem: ModelOpt's export already expands the fused, stacked in-memory experts (experts.gate_up_proj [E,2F,H]) into per-expert 2-D linears (experts.<i>.gate_proj/up_proj/down_proj) before save. So the reverse for experts is a pure per-expert leaf rename (gate_proj->w1, up_proj->w3, down_proj->w2), not a 3-D tensor un-stack. The previous MergeModulelist->Unsupported bail was therefore wrong: it forced the whole export to fall back, leaving experts mis-named. Rewrite _build_reverse_rules to use transformers' reverse_transform() for correctly reversed name patterns and to classify converters: - WeightRenaming -> RenameRule - expert WeightConverter -> per-expert leaf RenameRules (SplitModulelist present) - dense fusing WeightConverter (Chunk only) -> SplitRule (fused tensor survives) Add a GPU end-to-end test: quantize tiny Mixtral to NVFP4, export, and assert the exported tensor names exactly equal the canonical hub names from transformers' own revert_weight_conversion on the reference model (experts land as block_sparse_moe.experts.N.w{1,2,3}; no fused in-memory names remain). Passes. Existing export unit tests remain green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

…U derivation test CI fixes for PR #1833: - Drop float8_e4m3 from the synthetic weight_scale (Windows CPU lacks portable float8 ops, failing the `windows` unit job; the reverse logic is dtype-agnostic, so float32 keeps full coverage). Linux multi-version jobs were already green. - Add a CPU test that derives reverse rules from a real transformers Mixtral conversion mapping and asserts a ModelOpt-expanded per-expert state dict reverts to the hub layout (block_sparse_moe.experts.N.w{1,2,3}) with scale siblings carried — covers _build_reverse_rules / revert_weight_conversion_quant_aware without a GPU (raises codecov coverage of the new module). - Apply ruff-format (fixes the code-quality pre-commit failure). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

The quant-aware reverse conversion derived WeightRenamings from the model's conversion mapping and applied them in forward (list) order. A component- reordering rename (`model.language_model` -> `language_model.model`) then ran before the MoE container renames that anchor on the resulting adjacency (`.language_model.layers.N.mlp.experts.` -> `.block_sparse_moe.experts.`, plus the router `gate`/`e_score_correction_bias`). The reorder moved `language_model` away from `layers`, so those container renames silently no-op'd and MiniMax-M3 experts exported under the in-memory `mlp.experts.*` names instead of the hub `block_sparse_moe.experts.*.w{1,2,3}` names, breaking the unified-checkpoint hub-naming contract. Mirror transformers' save-path order (converter first, then renamings): collect expert leaf renames and WeightRenamings separately, apply leaf renames first, then WeightRenamings in reverse list order so the prefix reorder fires last and preserves the anchors the container/gate renames depend on. Add a regression test that reproduces the reorder+container interaction with a minimal two-renaming mapping (Mixtral does not exercise it: no prefix reorder). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

- Guard split targets against collisions before overwriting keys, mirroring the rename collision guard (CodeRabbit). - Move the transformers.core_model_loading import into a guarded try/except so an older/drifted transformers raises QuantConversionUnsupportedError and falls back to legacy names instead of a hard ImportError (CodeRabbit). - Add a stacked/fused-expert guard: _build_reverse_rules now returns the in-memory fused expert leaves, and revert_weight_conversion_quant_aware raises (-> fallback) if experts were not pre-expanded, so a 3-D experts.gate_up_proj can no longer ship silently mis-named on the rename path (reviewer). - Widen the export-site catch to Exception so any unanticipated reverse-conversion failure falls back to in-memory names rather than aborting export (reviewer); drop the now-unused QuantConversionUnsupportedError import there. - Hoist revert_weight_conversion_quant_aware to a top-level test import (CodeRabbit). - Add unit tests for the split-collision and stacked-expert guards. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Document the unified-HF-export fix under 0.46 Bug Fixes: exports now round-trip to the original hub tensor names (e.g. MiniMax-M3 block_sparse_moe.experts.*.w{1,2,3}) instead of transformers' in-memory post-conversion names, via the new quant-aware reverse conversion, with fallback to legacy names for unsupported mapping ops. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

… fix) The reorder-ordering regression test guarded transformers presence with pytest.importorskip("transformers") but then imported WeightRenaming from transformers.core_model_loading, a transformers 5.x submodule absent in the minimum-supported version. On the tf_min CI job transformers imports fine but the submodule does not, so the import raised ModuleNotFoundError (a hard failure) instead of skipping. Skip on the submodule directly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

The disabled transformers revert_weight_conversion fails on ModelOpt's 0-d scalar scale tensors with a RuntimeError from torch.chunk ("chunk expects at least a 1-dimensional tensor"), not IndexError (verified on transformers 5.12.0). Fix the wording in the quant_aware_conversion docstring, the CHANGELOG entry, and the _patch_revert_weight_conversion comments, and note the actual Chunk.convert() call site. Also collapse the two overlapping fallback comment blocks at the export call site into one. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

PR #1833 reverse-converts exported weight TENSOR names to the original hub namespace, but the quantization config's module references (exclude_modules and, for mixed precision, quantized_layers keys) are built from in-memory module names and stayed in the post-conversion namespace. A deployment loader (vLLM/SGLang) then matches those hub-mismatched patterns against the reverted hub-named modules, finds no match, and loads an excluded BF16 layer as quantized -> shape assert at load. Observed on MiniMax-M3 experts-only NVFP4: attention/dense excludes named model.language_model.layers.N.self_attn* never matched the hub tensors language_model.model.layers.N.self_attn.*. Apply the same reverse rename to exclude_modules / quantized_layers keys, gated on the weight-reverse succeeding so weights and config stay mutually consistent (both hub or both in-memory on fallback). The name mapper is wildcard-safe: exclude globs (e.g. *.shared_experts*) keep their trailing wildcard through the rename. - quant_aware_conversion.py: add build_reverse_name_mapper + revert_quant_config_names. - unified_export_hf.py: reorder so the reverse (weights + config names) runs before hf_quant_config.json is written / format-converted. - add unit test test_revert_quant_config_names_mapper (Mixtral mapping). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

…st-remote-code) export copy_custom_model_files returned early unless trust_remote_code was set, so a VLM exported via the native transformers path (no --trust_remote_code) lost its processor/tokenizer artifacts. transformers 5.x doesn't re-save these for natively loaded models, so the exported checkpoint was missing e.g. preprocessor_config.json and failed to serve on vLLM ('Can't load image processor ... containing a preprocessor_config.json'). Observed on MiniMax-M3 NVFP4. Split the copy: processor/tokenizer *data* artifacts (preprocessor_config.json, processor helpers, merges.txt/vocab.json/added_tokens/special_tokens_map, etc.) are deployment-critical and copied regardless of trust_remote_code; executable custom model/config code (modeling*.py, configuration_*.py, tokenization_*.py, other JSON) stays gated on trust_remote_code. config.json / safetensors index still skipped. The trust_remote_code=True set remains a superset of the previous behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 marked this pull request as ready for review July 2, 2026 01:08

Edwardf0t1 requested review from a team as code owners July 2, 2026 01:08

Edwardf0t1 requested a review from shengliangxu July 2, 2026 01:08

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread modelopt/torch/export/quant_aware_conversion.py Outdated

Comment thread modelopt/torch/export/quant_aware_conversion.py Outdated

Edwardf0t1 requested review from chadvoegele, meenchen and realAsma July 2, 2026 02:15

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread tests/unit/torch/export/test_quant_aware_conversion.py Outdated

Edwardf0t1 requested review from jingyu-ml and juhi10071998 July 2, 2026 18:50

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

juhi10071998 reviewed Jul 2, 2026

View reviewed changes

coderabbitai Bot approved these changes Jul 2, 2026

View reviewed changes

Edwardf0t1 force-pushed the feat/unified-export-quant-aware-revert branch from d727581 to 0c9230c Compare July 2, 2026 22:48

meenchen reviewed Jul 2, 2026

View reviewed changes

meenchen approved these changes Jul 2, 2026

View reviewed changes

Edwardf0t1 requested a review from a team as a code owner July 3, 2026 03:35

Edwardf0t1 force-pushed the feat/unified-export-quant-aware-revert branch from c08d168 to 6ec9ace Compare July 3, 2026 05:59

Edwardf0t1 enabled auto-merge (squash) July 3, 2026 06:34

Edwardf0t1 and others added 12 commits July 3, 2026 14:49

Edwardf0t1 force-pushed the feat/unified-export-quant-aware-revert branch from 6ec9ace to 85649f2 Compare July 3, 2026 21:49

Uh oh!

Conversation

Edwardf0t1 commented Jun 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

Tests

Validation

Follow-ups

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jun 26, 2026

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

github-actions Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-07-03 21:53 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

juhi10071998 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

juhi10071998 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Edwardf0t1 commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

github-actions Bot commented Jun 26, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-07-03 21:53 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented Jun 26, 2026 •

edited

Loading