feat(export): quant-aware reverse weight conversion for unified HF export#1833
feat(export): quant-aware reverse weight conversion for unified HF export#1833Edwardf0t1 wants to merge 12 commits into
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds quant-aware reverse weight conversion to unified HF export, restoring hub-canonical tensor names (with propagated scale tensors) instead of transformers' in-memory names, falling back to prior naming when reversal isn't supported. Includes changelog entry, integration into export flow, and unit/GPU tests. ChangesQuant-aware reverse weight conversion for HF export
Estimated code review effort: 4 (Complex) | ~60 minutes Sequence Diagram(s)sequenceDiagram
participant export_hf_checkpoint
participant revert_weight_conversion_quant_aware
participant BuildReverseRules
participant save_pretrained
export_hf_checkpoint->>export_hf_checkpoint: merge post_state_dict + extra_state_dict
export_hf_checkpoint->>revert_weight_conversion_quant_aware: reverse quantized weight names
revert_weight_conversion_quant_aware->>BuildReverseRules: derive split/rename rules from model
BuildReverseRules-->>revert_weight_conversion_quant_aware: rules or unsupported error
alt rules supported
revert_weight_conversion_quant_aware-->>export_hf_checkpoint: reverted state_dict with hub names
else unsupported pattern
revert_weight_conversion_quant_aware-->>export_hf_checkpoint: QuantConversionUnsupportedError
export_hf_checkpoint->>export_hf_checkpoint: warn, keep in-memory names
end
export_hf_checkpoint->>save_pretrained: save checkpoint with export_state_dict
🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1833 +/- ##
==========================================
- Coverage 75.20% 68.47% -6.73%
==========================================
Files 515 516 +1
Lines 57245 57409 +164
==========================================
- Hits 43050 39313 -3737
- Misses 14195 18096 +3901
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/export/unified_export_hf.py (1)
1505-1517: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winRestore the Transformers patch even if sanitization fails.
_sanitize_generation_config_for_save(model)runs after the global patch but before thetry/finally; if it raises,revert_weight_conversionstays patched for the rest of the process.Proposed fix
_patches = _patch_revert_weight_conversion() - _sanitize_generation_config_for_save(model) - try: + _sanitize_generation_config_for_save(model) model.save_pretrained( export_dir, state_dict=export_state_dict, save_modelopt_state=save_modelopt_state,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/export/unified_export_hf.py` around lines 1505 - 1517, The global Transformers patch is left active if _sanitize_generation_config_for_save(model) throws before the try/finally in the export flow. Move the sanitization call into the same try block used around model.save_pretrained in unified_export_hf.py, and keep _unpatch_revert_weight_conversion(_patches) in finally so revert_weight_conversion is always restored. Use the existing _patch_revert_weight_conversion, _sanitize_generation_config_for_save, and _unpatch_revert_weight_conversion symbols to place the fix.
🧹 Nitpick comments (1)
tests/gpu/torch/export/test_quant_aware_conversion_gpu.py (1)
38-69: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winMove these imports to module scope
tests/gpu/torch/export/test_quant_aware_conversion_gpu.pykeepsMixtralConfig,MixtralForCausalLM, the conversion helpers,export_hf_checkpoint, andsafe_openinside the test body. Hoist the non-optional imports and add a short reason only if any import must stay local for an optional dependency.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/gpu/torch/export/test_quant_aware_conversion_gpu.py` around lines 38 - 69, The test is still doing import setup inside the function body; move the non-optional imports used by _tiny_mixtral_config and test_export_tensor_names_match_hub_after_conversion_reverse to module scope so the test is easier to read and reuse. Keep only truly optional dependency checks local (for example, the transformers import guards), and if any import must remain inside the test, add a short inline reason for that exception.Sources: Coding guidelines, Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/export/quant_aware_conversion.py`:
- Around line 151-152: The split logic in quant_aware_conversion’s
_split_leaf_tensor path can overwrite existing keys in state_dict or
extra_state_dict without warning. Add a collision check in the split handling
around the rule.part_suffixes loop, mirroring the existing rename collision
guard, and fail fast if any target key like gate_proj or up_proj already exists
instead of writing corrupted weights.
- Around line 213-230: The Transformers conversion imports in the quant-aware
conversion flow are still outside the guarded fallback, so older or drifted
Transformers versions can raise ImportError before the legacy path is used. Move
the transformers.core_model_loading imports into the same fallback branch in
quant_aware_conversion so they are only loaded after conversion mapping is
successfully resolved, and keep the try/except around the
get_model_conversion_mapping lookup in place. Use the existing symbols
get_model_conversion_mapping and QuantConversionUnsupportedError to locate and
preserve the intended fallback behavior.
---
Outside diff comments:
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1505-1517: The global Transformers patch is left active if
_sanitize_generation_config_for_save(model) throws before the try/finally in the
export flow. Move the sanitization call into the same try block used around
model.save_pretrained in unified_export_hf.py, and keep
_unpatch_revert_weight_conversion(_patches) in finally so
revert_weight_conversion is always restored. Use the existing
_patch_revert_weight_conversion, _sanitize_generation_config_for_save, and
_unpatch_revert_weight_conversion symbols to place the fix.
---
Nitpick comments:
In `@tests/gpu/torch/export/test_quant_aware_conversion_gpu.py`:
- Around line 38-69: The test is still doing import setup inside the function
body; move the non-optional imports used by _tiny_mixtral_config and
test_export_tensor_names_match_hub_after_conversion_reverse to module scope so
the test is easier to read and reuse. Keep only truly optional dependency checks
local (for example, the transformers import guards), and if any import must
remain inside the test, add a short inline reason for that exception.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 77769b4e-1580-4146-a8b6-7501b4474935
📒 Files selected for processing (7)
CHANGELOG.rstmodelopt/torch/export/quant_aware_conversion.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/quantization/plugins/huggingface.pytests/gpu/torch/export/test_quant_aware_conversion_gpu.pytests/unit/torch/export/test_quant_aware_conversion.pytests/unit/torch/quantization/plugins/test_fused_experts.py
…U derivation test CI fixes for PR #1833: - Drop float8_e4m3 from the synthetic weight_scale (Windows CPU lacks portable float8 ops, failing the `windows` unit job; the reverse logic is dtype-agnostic, so float32 keeps full coverage). Linux multi-version jobs were already green. - Add a CPU test that derives reverse rules from a real transformers Mixtral conversion mapping and asserts a ModelOpt-expanded per-expert state dict reverts to the hub layout (block_sparse_moe.experts.N.w{1,2,3}) with scale siblings carried — covers _build_reverse_rules / revert_weight_conversion_quant_aware without a GPU (raises codecov coverage of the new module). - Apply ruff-format (fixes the code-quality pre-commit failure). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/unit/torch/export/test_quant_aware_conversion.py`:
- Around line 174-183: The local import of revert_weight_conversion_quant_aware
in test_quant_aware_conversion is unjustified because it is neither optional nor
needed to avoid a circular import; move it to the top-level import block with
the other modelopt imports so the test module follows the repo import convention
and import failures happen during collection.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 74bfd62d-d728-4de5-a9a6-2a1842c3f4fb
📒 Files selected for processing (1)
tests/unit/torch/export/test_quant_aware_conversion.py
There was a problem hiding this comment.
♻️ Duplicate comments (1)
tests/unit/torch/export/test_quant_aware_conversion.py (1)
237-237: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick winSame unjustified local import recurs in the new test.
revert_weight_conversion_quant_awareis an internal ModelOpt symbol (not an optional dependency or circular-import target —quant_aware_conversion.pyonly importsre,dataclasses,torchat module level), so it doesn't need to be deferred. This repeats the pattern already flagged in a prior review at the earlier occurrence of this same import.♻️ Proposed fix
- pytest.importorskip("transformers") - from transformers.core_model_loading import WeightRenaming - - from modelopt.torch.export.quant_aware_conversion import revert_weight_conversion_quant_aware - + pytest.importorskip("transformers") + from transformers.core_model_loading import WeightRenaming +Move
from modelopt.torch.export.quant_aware_conversion import revert_weight_conversion_quant_awareto the top-level import block alongside other modelopt imports.As per path instructions, "Imports inside functions or test methods without explicit justification" should be flagged; only "circular imports or optional dependencies" are acceptable exceptions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/export/test_quant_aware_conversion.py` at line 237, The new test repeats an unjustified local import of revert_weight_conversion_quant_aware; since this is a regular internal ModelOpt symbol and not a circular/optional dependency, move the import out of the test body into the top-level import block in test_quant_aware_conversion alongside the other modelopt imports. Keep the test logic unchanged and use the existing revert_weight_conversion_quant_aware reference from the module scope.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@tests/unit/torch/export/test_quant_aware_conversion.py`:
- Line 237: The new test repeats an unjustified local import of
revert_weight_conversion_quant_aware; since this is a regular internal ModelOpt
symbol and not a circular/optional dependency, move the import out of the test
body into the top-level import block in test_quant_aware_conversion alongside
the other modelopt imports. Keep the test logic unchanged and use the existing
revert_weight_conversion_quant_aware reference from the module scope.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e1a12cad-7f63-4e8d-ab91-89c8cfcaae04
📒 Files selected for processing (2)
modelopt/torch/export/quant_aware_conversion.pytests/unit/torch/export/test_quant_aware_conversion.py
🚧 Files skipped from review as they are similar to previous changes (1)
- modelopt/torch/export/quant_aware_conversion.py
| # we fall back to the in-memory names. | ||
| try: | ||
| export_state_dict = revert_weight_conversion_quant_aware(model, export_state_dict) | ||
| except QuantConversionUnsupportedError as exc: |
There was a problem hiding this comment.
Should we extend the broad-catch → re-raise-as-QuantConversionUnsupportedError pattern from the mapping read to the loop + rule application (or just widen this except to Exception)? Then the fallback also covers unanticipated failures, not just the ones we explicitly raise on.
There was a problem hiding this comment.
Good call — done in d727581. Widened the export-site catch to except Exception so any unanticipated failure (API drift, unexpected shapes) also falls back to in-memory names rather than aborting export; dropped the now-unused QuantConversionUnsupportedError import there.
| for key, value in out.items(): | ||
| new_key = key | ||
| for pattern, repl in compiled: | ||
| new_key = pattern.sub(repl, new_key) |
There was a problem hiding this comment.
The split path guards against a stray 3-D tensor (_apply_split_rule raises on tensor.dim() >= 3), but this rename path only rewrites keys and never inspects the tensor, so a stacked 3-D experts.gate_up_proj that survives pre-expansion just passes through unrenamed and ships under the wrong name silently. Should we add the same shape guard to the expert path so it falls back like the split path does?
There was a problem hiding this comment.
Good catch — done in d727581. _build_reverse_rules now returns the in-memory fused expert leaves, and revert_weight_conversion_quant_aware calls _assert_experts_pre_expanded, which raises QuantConversionUnsupportedError (-> fallback) if a stacked/fused .experts.gate_up_proj (or any 3-D expert tensor) survived pre-expansion — so it no longer ships unrenamed. Covered by test_stacked_experts_guard.
…U derivation test CI fixes for PR #1833: - Drop float8_e4m3 from the synthetic weight_scale (Windows CPU lacks portable float8 ops, failing the `windows` unit job; the reverse logic is dtype-agnostic, so float32 keeps full coverage). Linux multi-version jobs were already green. - Add a CPU test that derives reverse rules from a real transformers Mixtral conversion mapping and asserts a ModelOpt-expanded per-expert state dict reverts to the hub layout (block_sparse_moe.experts.N.w{1,2,3}) with scale siblings carried — covers _build_reverse_rules / revert_weight_conversion_quant_aware without a GPU (raises codecov coverage of the new module). - Apply ruff-format (fixes the code-quality pre-commit failure). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
d727581 to
0c9230c
Compare
meenchen
left a comment
There was a problem hiding this comment.
Does this mean that the model.save_pretrianed is broken for this model? What happen if we just load and save the BF16 checkpoint?
save_pretrained is not broken for the BF16 model, it's specific to quantized export. transformers applies a load-time But for quantized export, we disable transformers' revert (_patch_revert_weight_conversion) due to ModelOpt's scales are 0-d scalars, and without it the export keeps the in-memory post-conversion names. |
PR #1833 reverse-converts exported weight TENSOR names to the original hub namespace, but the quantization config's module references (exclude_modules and, for mixed precision, quantized_layers keys) are built from in-memory module names and stayed in the post-conversion namespace. A deployment loader (vLLM/SGLang) then matches those hub-mismatched patterns against the reverted hub-named modules, finds no match, and loads an excluded BF16 layer as quantized -> shape assert at load. Observed on MiniMax-M3 experts-only NVFP4: attention/dense excludes named model.language_model.layers.N.self_attn* never matched the hub tensors language_model.model.layers.N.self_attn.*. Apply the same reverse rename to exclude_modules / quantized_layers keys, gated on the weight-reverse succeeding so weights and config stay mutually consistent (both hub or both in-memory on fallback). The name mapper is wildcard-safe: exclude globs (e.g. *.shared_experts*) keep their trailing wildcard through the rename. - quant_aware_conversion.py: add build_reverse_name_mapper + revert_quant_config_names. - unified_export_hf.py: reorder so the reverse (weights + config names) runs before hf_quant_config.json is written / format-converted. - add unit test test_revert_quant_config_names_mapper (Mixtral mapping). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…U derivation test CI fixes for PR #1833: - Drop float8_e4m3 from the synthetic weight_scale (Windows CPU lacks portable float8 ops, failing the `windows` unit job; the reverse logic is dtype-agnostic, so float32 keeps full coverage). Linux multi-version jobs were already green. - Add a CPU test that derives reverse rules from a real transformers Mixtral conversion mapping and asserts a ModelOpt-expanded per-expert state dict reverts to the hub layout (block_sparse_moe.experts.N.w{1,2,3}) with scale siblings carried — covers _build_reverse_rules / revert_weight_conversion_quant_aware without a GPU (raises codecov coverage of the new module). - Apply ruff-format (fixes the code-quality pre-commit failure). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
PR #1833 reverse-converts exported weight TENSOR names to the original hub namespace, but the quantization config's module references (exclude_modules and, for mixed precision, quantized_layers keys) are built from in-memory module names and stayed in the post-conversion namespace. A deployment loader (vLLM/SGLang) then matches those hub-mismatched patterns against the reverted hub-named modules, finds no match, and loads an excluded BF16 layer as quantized -> shape assert at load. Observed on MiniMax-M3 experts-only NVFP4: attention/dense excludes named model.language_model.layers.N.self_attn* never matched the hub tensors language_model.model.layers.N.self_attn.*. Apply the same reverse rename to exclude_modules / quantized_layers keys, gated on the weight-reverse succeeding so weights and config stay mutually consistent (both hub or both in-memory on fallback). The name mapper is wildcard-safe: exclude globs (e.g. *.shared_experts*) keep their trailing wildcard through the rename. - quant_aware_conversion.py: add build_reverse_name_mapper + revert_quant_config_names. - unified_export_hf.py: reorder so the reverse (weights + config names) runs before hf_quant_config.json is written / format-converted. - add unit test test_revert_quant_config_names_mapper (Mixtral mapping). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
c08d168 to
6ec9ace
Compare
register_fused_experts_on_the_fly skipped fused-expert modules lacking an act_fn attribute. MiniMaxM3VLExperts (transformers 5.12.0) uses a custom GPT-OSS-style gated activation between its two F.linear calls instead of an act_fn attribute, so it was never wrapped as _QuantFusedExperts: routed experts stayed unquantized (an experts-only recipe matched nothing) and HF export failed with NotImplementedError. _QuantFusedExperts is activation-agnostic (it only intercepts the two F.linear calls, gate_up then down), so act_fn is irrelevant to quantization, calibration, and export. Drop the requirement from _is_fused_experts_module. Enables NVFP4/FP8 PTQ + export for MiniMax-M2 / MiniMax-M3. Verified end-to-end: experts-only NVFP4 + FP8 KV PTQ of MiniMaxAI/MiniMax-M3 detects MiniMaxM3VLExperts, quantizes all 57 MoE layers, and exports a valid HF checkpoint. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Condense the act_fn explanation in _is_fused_experts_module's docstring and remove the work-log comment block in the corresponding unit test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…port
ModelOpt's unified HF export builds the state dict from the in-memory
(transformers post-conversion) module names and disables transformers'
save-side revert_weight_conversion (it raises IndexError on 0-d scalar
scale tensors). As a result, when transformers applies a load-time
conversion_mapping (fused gate_up_proj, renamed MoE leaves, reordered
model/language_model prefix), the exported tensor names no longer match
the original HF hub checkpoint, breaking the unified-checkpoint contract
(observed: MiniMax-M3 NVFP4-v1 emitted fused/renamed names).
Add a quantization-aware reverse that carries each weight's companion
scale tensors (weight_scale, weight_scale_2, input_scale,
weight_scale_inv, bias) through:
- renames (key-level substitution; scale siblings follow the module path)
- output-dim un-fuse splits (split weight/weight_scale on the fused dim,
duplicate 0-d scalar weight_scale_2/input_scale to each part)
Reverse rules are derived best-effort from the model's conversion mapping;
any op not yet reversible quant-aware (e.g. stacked-expert MergeModulelist)
raises QuantConversionUnsupportedError and the export falls back to the
prior in-memory-name behavior with a warning (non-breaking).
Adds CPU unit tests (rename carries scales, dense gate_up un-fuse with
scale split + scalar duplication, 3-D expert + non-divisible guards,
end-to-end MiniMax-M3-like reversal). End-to-end export validation on a
real M3 quantize+export (transformers with the minimax_m3_vl mapping, GPU)
is still pending.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…e test
GPU validation (tiny Mixtral, transformers 5.3.0) revealed the real shape of the
problem: ModelOpt's export already expands the fused, stacked in-memory experts
(experts.gate_up_proj [E,2F,H]) into per-expert 2-D linears
(experts.<i>.gate_proj/up_proj/down_proj) before save. So the reverse for experts
is a pure per-expert leaf rename (gate_proj->w1, up_proj->w3, down_proj->w2), not a
3-D tensor un-stack. The previous MergeModulelist->Unsupported bail was therefore
wrong: it forced the whole export to fall back, leaving experts mis-named.
Rewrite _build_reverse_rules to use transformers' reverse_transform() for correctly
reversed name patterns and to classify converters:
- WeightRenaming -> RenameRule
- expert WeightConverter -> per-expert leaf RenameRules (SplitModulelist present)
- dense fusing WeightConverter (Chunk only) -> SplitRule (fused tensor survives)
Add a GPU end-to-end test: quantize tiny Mixtral to NVFP4, export, and assert the
exported tensor names exactly equal the canonical hub names from transformers'
own revert_weight_conversion on the reference model (experts land as
block_sparse_moe.experts.N.w{1,2,3}; no fused in-memory names remain). Passes.
Existing export unit tests remain green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…U derivation test CI fixes for PR #1833: - Drop float8_e4m3 from the synthetic weight_scale (Windows CPU lacks portable float8 ops, failing the `windows` unit job; the reverse logic is dtype-agnostic, so float32 keeps full coverage). Linux multi-version jobs were already green. - Add a CPU test that derives reverse rules from a real transformers Mixtral conversion mapping and asserts a ModelOpt-expanded per-expert state dict reverts to the hub layout (block_sparse_moe.experts.N.w{1,2,3}) with scale siblings carried — covers _build_reverse_rules / revert_weight_conversion_quant_aware without a GPU (raises codecov coverage of the new module). - Apply ruff-format (fixes the code-quality pre-commit failure). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
The quant-aware reverse conversion derived WeightRenamings from the model's
conversion mapping and applied them in forward (list) order. A component-
reordering rename (`model.language_model` -> `language_model.model`) then ran
before the MoE container renames that anchor on the resulting adjacency
(`.language_model.layers.N.mlp.experts.` -> `.block_sparse_moe.experts.`,
plus the router `gate`/`e_score_correction_bias`). The reorder moved
`language_model` away from `layers`, so those container renames silently
no-op'd and MiniMax-M3 experts exported under the in-memory `mlp.experts.*`
names instead of the hub `block_sparse_moe.experts.*.w{1,2,3}` names, breaking
the unified-checkpoint hub-naming contract.
Mirror transformers' save-path order (converter first, then renamings): collect
expert leaf renames and WeightRenamings separately, apply leaf renames first,
then WeightRenamings in reverse list order so the prefix reorder fires last and
preserves the anchors the container/gate renames depend on.
Add a regression test that reproduces the reorder+container interaction with a
minimal two-renaming mapping (Mixtral does not exercise it: no prefix reorder).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
- Guard split targets against collisions before overwriting keys, mirroring the rename collision guard (CodeRabbit). - Move the transformers.core_model_loading import into a guarded try/except so an older/drifted transformers raises QuantConversionUnsupportedError and falls back to legacy names instead of a hard ImportError (CodeRabbit). - Add a stacked/fused-expert guard: _build_reverse_rules now returns the in-memory fused expert leaves, and revert_weight_conversion_quant_aware raises (-> fallback) if experts were not pre-expanded, so a 3-D experts.gate_up_proj can no longer ship silently mis-named on the rename path (reviewer). - Widen the export-site catch to Exception so any unanticipated reverse-conversion failure falls back to in-memory names rather than aborting export (reviewer); drop the now-unused QuantConversionUnsupportedError import there. - Hoist revert_weight_conversion_quant_aware to a top-level test import (CodeRabbit). - Add unit tests for the split-collision and stacked-expert guards. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Document the unified-HF-export fix under 0.46 Bug Fixes: exports now round-trip
to the original hub tensor names (e.g. MiniMax-M3 block_sparse_moe.experts.*.w{1,2,3})
instead of transformers' in-memory post-conversion names, via the new quant-aware
reverse conversion, with fallback to legacy names for unsupported mapping ops.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
… fix)
The reorder-ordering regression test guarded transformers presence with
pytest.importorskip("transformers") but then imported WeightRenaming from
transformers.core_model_loading, a transformers 5.x submodule absent in the
minimum-supported version. On the tf_min CI job transformers imports fine but the
submodule does not, so the import raised ModuleNotFoundError (a hard failure)
instead of skipping. Skip on the submodule directly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
The disabled transformers revert_weight_conversion fails on ModelOpt's 0-d scalar
scale tensors with a RuntimeError from torch.chunk ("chunk expects at least a
1-dimensional tensor"), not IndexError (verified on transformers 5.12.0). Fix the
wording in the quant_aware_conversion docstring, the CHANGELOG entry, and the
_patch_revert_weight_conversion comments, and note the actual Chunk.convert() call
site. Also collapse the two overlapping fallback comment blocks at the export call
site into one.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
PR #1833 reverse-converts exported weight TENSOR names to the original hub namespace, but the quantization config's module references (exclude_modules and, for mixed precision, quantized_layers keys) are built from in-memory module names and stayed in the post-conversion namespace. A deployment loader (vLLM/SGLang) then matches those hub-mismatched patterns against the reverted hub-named modules, finds no match, and loads an excluded BF16 layer as quantized -> shape assert at load. Observed on MiniMax-M3 experts-only NVFP4: attention/dense excludes named model.language_model.layers.N.self_attn* never matched the hub tensors language_model.model.layers.N.self_attn.*. Apply the same reverse rename to exclude_modules / quantized_layers keys, gated on the weight-reverse succeeding so weights and config stay mutually consistent (both hub or both in-memory on fallback). The name mapper is wildcard-safe: exclude globs (e.g. *.shared_experts*) keep their trailing wildcard through the rename. - quant_aware_conversion.py: add build_reverse_name_mapper + revert_quant_config_names. - unified_export_hf.py: reorder so the reverse (weights + config names) runs before hf_quant_config.json is written / format-converted. - add unit test test_revert_quant_config_names_mapper (Mixtral mapping). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…st-remote-code) export
copy_custom_model_files returned early unless trust_remote_code was set, so a VLM
exported via the native transformers path (no --trust_remote_code) lost its
processor/tokenizer artifacts. transformers 5.x doesn't re-save these for natively
loaded models, so the exported checkpoint was missing e.g. preprocessor_config.json
and failed to serve on vLLM ('Can't load image processor ... containing a
preprocessor_config.json'). Observed on MiniMax-M3 NVFP4.
Split the copy: processor/tokenizer *data* artifacts (preprocessor_config.json,
processor helpers, merges.txt/vocab.json/added_tokens/special_tokens_map, etc.) are
deployment-critical and copied regardless of trust_remote_code; executable custom
model/config code (modeling*.py, configuration_*.py, tokenization_*.py, other JSON)
stays gated on trust_remote_code. config.json / safetensors index still skipped.
The trust_remote_code=True set remains a superset of the previous behavior.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
6ec9ace to
85649f2
Compare
What
ModelOpt's unified HF export builds the state dict from the in-memory (transformers post-conversion) module names and disables transformers' save-side
revert_weight_conversion(it raisesIndexErroron ModelOpt's 0-d scalar scale tensors). So when transformers applies a load-timeconversion_mapping(renamed MoE leaves,block_sparse_moe↔mlp, reorderedmodel/language_modelprefix, fused densegate_up_proj), the exported tensor names no longer match the original HF hub checkpoint, breaking the unified-checkpoint contract.Observed concretely on MiniMax-M3:
nvidia/MiniMax-M3-NVFP4-v1emitted the converted names (model.language_model.*,mlp.experts.*.gate_proj) instead of hub names (language_model.model.*,block_sparse_moe.experts.*.w{1,2,3}).How
New
modelopt/torch/export/quant_aware_conversion.pyperforms a quantization-aware reverse conversion, derived from the model's conversion mapping via transformers' ownreverse_transform()(so anchored regex renamings reverse correctly), carrying each weight's companion scale tensors (weight_scale,weight_scale_2,input_scale,weight_scale_inv,bias):gate_up_proj→gate_proj+up_proj): splitweight/weight_scale/biason the fused dim, duplicate the 0-d scalars.Key insight from GPU validation: ModelOpt's export already expands fused, stacked in-memory experts (
experts.gate_up_proj [E,2F,H]) into per-expert 2-D linears before save, so the reverse for experts is a pure per-expert leaf rename (gate_proj→w1,up_proj→w3,down_proj→w2), not a 3-D un-stack. Converters are classified by their reversed ops:SplitModulelistpresent ⇒ expert rename;Chunk-only ⇒ dense split.export_hf_checkpointapplies this beforesave_pretrained; anything not reversible raisesQuantConversionUnsupportedErrorand the export falls back to prior behavior with a warning (non-breaking).Tests
tests/unit/torch/export/test_quant_aware_conversion.py, CPU): rename carries scales; densegate_up_projun-fuse with scale split + scalar duplication; 3-D / non-divisible guards; end-to-end MiniMax-M3-like reversal.tests/gpu/torch/export/test_quant_aware_conversion_gpu.py): quantize a tiny Mixtral to NVFP4 →export_hf_checkpoint→ assert exported tensor names exactly equal the canonical hub names from transformers' ownrevert_weight_conversionon the reference model (experts land asblock_sparse_moe.experts.N.w{1,2,3}; no fused in-memory names remain).Validation
test_unified_export_hf.pyremains green (no regression).block_sparse_moe↔mlp, expert gate/up fuse) is the same machinery MoE VLMs like MiniMax-M3 use.minimax_m3_vlentry, so M3 itself isn't exercised in CI here; Mixtral is the structural surrogate. When a transformers build with the M3 mapping is available, the same test can be parametrized for M3.Follow-ups
Chunk.convert()for 0-d tensors, then drop the_patch_revert_weight_conversionno-op.🤖 Generated with Claude Code
Summary by CodeRabbit