fix: handle Qwen3-VL-MoE expert dimensions#1885
Conversation
Signed-off-by: harshtripathi6 <htripathi6@gmail.com>
📝 WalkthroughWalkthroughReworks ChangesQwen3-VL MoE experts quantization fix
Estimated code review effort: 3 (Moderate) | ~25 minutes Sequence Diagram(s)sequenceDiagram
participant Test as Unit Test
participant QuantExperts as _QuantQwen3VLMoeTextExperts
participant HFExperts as Qwen3VLMoeTextExperts
Test->>QuantExperts: _convert_to_quant_wrapper + _setup()
Test->>HFExperts: forward(hidden_states, top_k_index, top_k_weights)
Test->>QuantExperts: forward(hidden_states, top_k_index, top_k_weights)
QuantExperts->>QuantExperts: reshape using hidden_dim
QuantExperts->>QuantExperts: route by (token_idx, top_k_pos)
QuantExperts-->>Test: output reshaped to orig_shape
HFExperts-->>Test: output
Test->>Test: assert outputs match
Suggested reviewers: 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 2
🧹 Nitpick comments (3)
tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py (2)
16-17: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueDuplicate license header.
Lines 1-14 already carry the full NVIDIA Apache-2.0 header; the extra
SPDX-FileCopyrightText/SPDX-License-Identifierpair on Lines 16-17 is redundant and may cause theinsert-licensepre-commit hook to flag the file. Remove the duplicate.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py` around lines 16 - 17, The file has a duplicate SPDX license header, with an extra SPDX-FileCopyrightText and SPDX-License-Identifier pair appended after the existing full NVIDIA Apache-2.0 header. Remove the redundant license lines from test_qwen3_vl_moe_experts.py so only the original header remains and the insert-license hook won’t flag it.
32-37: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse distinct dimensions so the parity test can catch layout/transpose regressions.
With
hidden_size=8andmoe_intermediate_size=4, the fusedgate_up_projhas shape(num_experts, 2*expert_dim=8, hidden_dim=8)— a square inner block. A transpose or wrong split dimension (the exact class of bug this PR fixes) would still produce shape-compatible tensors and pass, so the test provides weak regression protection. Pick dims where2*moe_intermediate_size != hidden_size(e.g.moe_intermediate_size = 6).♻️ Make dimensions asymmetric
class TinyQwen3VLMoeTextConfig: num_experts = 2 hidden_size = 8 - moe_intermediate_size = 4 + moe_intermediate_size = 6 hidden_act = "silu" _experts_implementation = "eager"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py` around lines 32 - 37, The parity test setup in TinyQwen3VLMoeTextConfig uses symmetric dimensions, which can hide transpose/layout bugs in the fused gate_up_proj path. Change the test config to use asymmetric sizes in the Qwen3 VL MoE quantization test so 2*moe_intermediate_size does not equal hidden_size, and keep the same config class/reference points so the regression coverage still exercises the eager expert implementation.modelopt/torch/quantization/plugins/huggingface.py (1)
887-915: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winWeight slicing is correct for the current HF layout; persist
hidden_dimto avoid a divergent fallback inforward.The layout handling here is right: HF stores
gate_up_projas(num_experts, 2*expert_dim, hidden_dim)anddown_projas(num_experts, hidden_dim, expert_dim), soshape[-1]/shape[1]//2and the[:expert_dim]/[expert_dim:]row slices map cleanly onto thenn.Linearweight shapes.However,
_setupresolveshidden_dimwith a 3-level fallback (hidden_dim→hidden_size→gate_up_proj.shape[-1]), whileforward(Line 930) uses only a 2-level fallback andgate_up_projis already deleted. If a module exposes neither attribute,forwardgetshidden_dim=Noneandreshape(-1, None)crashes. Store the resolved value once and reuse it.♻️ Persist the resolved hidden dimension
if hidden_dim is None: hidden_dim = self.gate_up_proj.shape[-1] + self.hidden_dim = hidden_dimAnd in
forward(Line 930-931):- hidden_dim = getattr(self, "hidden_dim", getattr(self, "hidden_size", None)) - hidden_states = hidden_states.reshape(-1, hidden_dim) + hidden_states = hidden_states.reshape(-1, self.hidden_dim)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 887 - 915, Persist the resolved hidden dimension from the setup logic in the MoE plugin so `forward` does not depend on a different fallback path after `gate_up_proj` is deleted. Update the `_setup`/initialization flow in `huggingface.py` to store the computed `hidden_dim` once on the module and have `forward` reuse that saved value instead of re-deriving it from `hidden_size` or `gate_up_proj`. This should be applied around the existing `hidden_dim` resolution and the `forward` reshape path for the same class handling `gate_up_proj` and `down_proj`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 929-948: The Qwen3VLMoeTextExperts forward path is using the wrong
positional argument order, which swaps routing_weights and router_indices on the
production call. Update the Qwen3VLMoeTextExperts.forward signature and its
internal tensor usage to match HF’s call order from experts(hidden_states,
top_k_index, top_k_weights), then adjust the expert_mask, expert_hit, and
weighted_output logic to read the correct tensors. Also update the parity test
to use the same HF-aligned argument order so it validates the real dispatch
path.
In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py`:
- Around line 66-67: The test in qwen3_vl_moe_experts.py is masking the
signature mismatch by calling original and wrapped with different argument
orders. Update the assertions so both use the same HF order for hidden_states,
top_k_index, and top_k_weights, and verify the wrapper via the expert module
path that the parent MoE block uses. This should align with the wrapped forward
implementation in huggingface.py and catch any future signature regressions.
---
Nitpick comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 887-915: Persist the resolved hidden dimension from the setup
logic in the MoE plugin so `forward` does not depend on a different fallback
path after `gate_up_proj` is deleted. Update the `_setup`/initialization flow in
`huggingface.py` to store the computed `hidden_dim` once on the module and have
`forward` reuse that saved value instead of re-deriving it from `hidden_size` or
`gate_up_proj`. This should be applied around the existing `hidden_dim`
resolution and the `forward` reshape path for the same class handling
`gate_up_proj` and `down_proj`.
In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py`:
- Around line 16-17: The file has a duplicate SPDX license header, with an extra
SPDX-FileCopyrightText and SPDX-License-Identifier pair appended after the
existing full NVIDIA Apache-2.0 header. Remove the redundant license lines from
test_qwen3_vl_moe_experts.py so only the original header remains and the
insert-license hook won’t flag it.
- Around line 32-37: The parity test setup in TinyQwen3VLMoeTextConfig uses
symmetric dimensions, which can hide transpose/layout bugs in the fused
gate_up_proj path. Change the test config to use asymmetric sizes in the Qwen3
VL MoE quantization test so 2*moe_intermediate_size does not equal hidden_size,
and keep the same config class/reference points so the regression coverage still
exercises the eager expert implementation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 13833783-0784-4c47-82ce-a9e27324bdfe
📒 Files selected for processing (2)
modelopt/torch/quantization/plugins/huggingface.pytests/unit/torch/quantization/test_qwen3_vl_moe_experts.py
| orig_shape = hidden_states.shape | ||
| hidden_dim = getattr(self, "hidden_dim", getattr(self, "hidden_size", None)) | ||
| hidden_states = hidden_states.reshape(-1, hidden_dim) | ||
| next_states = torch.zeros_like(hidden_states) | ||
| with torch.no_grad(): | ||
| expert_mask = torch.nn.functional.one_hot(router_indices, num_classes=self.num_experts) | ||
| expert_mask = expert_mask.permute(2, 1, 0) | ||
| expert_hit = torch.greater(expert_mask.sum(dim=(-1, -2)), 0).nonzero() | ||
| for expert_idx in expert_hit: | ||
| expert_idx = expert_idx[0] | ||
| with torch.no_grad(): | ||
| _, token_idx = torch.where(expert_mask[expert_idx[0]]) | ||
| top_k_pos, token_idx = torch.where(expert_mask[expert_idx]) | ||
| current_state = hidden_states[token_idx] | ||
| gate = self.gate_proj[expert_idx](current_state) | ||
| up = self.up_proj[expert_idx](current_state) | ||
| gated_output = up * self.act_fn(gate) | ||
| out = self.down_proj[expert_idx](gated_output) | ||
| weighted_output = out * routing_weights[token_idx, expert_idx, None] | ||
| weighted_output = out * routing_weights[token_idx, top_k_pos, None] | ||
| next_states.index_add_(0, token_idx, weighted_output.to(hidden_states.dtype)) | ||
| next_states = next_states.view(batch_size, -1, self.hidden_size) | ||
|
|
||
| return next_states | ||
| return next_states.view(orig_shape) |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🔴 Critical | ⚡ Quick win
🧩 Analysis chain
🌐 Web query:
Qwen3VLMoeTextExperts forward signature top_k_index top_k_weights transformers 4.56
💡 Result:
In the Hugging Face Transformers library, specifically within the Qwen3-VL-Moe model implementation, the Qwen3VLMoeTextExperts.forward method is designed to process token hidden states through selected Mixture-of-Experts (MoE) layers [1][2][3]. The signature of this forward method is: def forward( self, hidden_states: torch.Tensor, top_k_index: torch.Tensor, top_k_weights: torch.Tensor,) -> torch.Tensor: Parameters: - hidden_states: The input tensor containing the hidden representations of the tokens [1][2]. - top_k_index: A tensor specifying the indices of the experts selected for each token (typically derived from the model's router/gate) [1][2][3]. - top_k_weights: A tensor containing the routing weights corresponding to the selected experts, used to scale the output of each expert before aggregation [1][2][3]. The method computes expert outputs, often by iterating over the active experts identified by top_k_index, applying the expert's gate/up/down projections, and then accumulating the weighted results back into the final hidden state using index_add_ [1][2][3]. Regarding transformers 4.56: This version was part of the development cycle where these MoE implementations were actively refined [4]. The variable names top_k_index and top_k_weights became the standard naming convention in Transformers MoE modules to ensure consistency across different model architectures (e.g., Qwen3, Mixtral) [5][6]. Users encountering memory issues or integration needs with this method often reference these specific parameters when troubleshooting or customizing expert execution [7][8].
Citations:
- 1: https://github.com/huggingface/transformers/blob/v5.8.1/src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py
- 2: https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py
- 3: https://github.com/huggingface/transformers/blob/24a4dc22/src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py
- 4: huggingface/transformers@357f414
- 5: Variable name mismatch on
MixtralExperts.forward()huggingface/transformers#41984 - 6: https://docs.vllm.ai/en/v0.11.2/models/supported_models/
- 7: https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/moe.py
- 8: Uneven GPU memory usage and CUDA OOM during multi-GPU inference with Qwen3-VL-30B-A3B-Thinking huggingface/transformers#41570
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the relevant implementation and test around the reported lines.
git ls-files modelopt/torch/quantization/plugins/huggingface.py tests | sed -n '1,200p'
echo '--- huggingface.py around reported lines ---'
sed -n '900,970p' modelopt/torch/quantization/plugins/huggingface.py
echo '--- search for Qwen3VLMoeTextSparseMoeBlock and experts call sites ---'
rg -n "Qwen3VLMoeTextSparseMoeBlock|self\.experts\(|top_k_index|top_k_weights|routing_weights|router_indices" modelopt tests -g '!**/__pycache__/**'
echo '--- test file cited in review comment ---'
sed -n '1,140p' tests/unit/torch/quantization/plugins/test_qwen3_vl_moe_experts.pyRepository: NVIDIA/Model-Optimizer
Length of output: 22730
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '--- huggingface.py around the second forward wrapper ---'
sed -n '1500,1545p' modelopt/torch/quantization/plugins/huggingface.py
echo '--- expert-removal hook call sites ---'
sed -n '360,410p' modelopt/torch/prune/importance_hooks/expert_removal_hooks.py
echo '--- any test file for qwen3 vl moe experts ---'
fd -a 'test_qwen3*' tests modelopt || true
fd -a '*qwen3*experts*' tests modelopt || trueRepository: NVIDIA/Model-Optimizer
Length of output: 4698
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '--- class header and surrounding methods for the reported wrapper ---'
sed -n '860,930p' modelopt/torch/quantization/plugins/huggingface.py
echo '--- test file for qwen3 vl moe experts ---'
sed -n '1,140p' tests/unit/torch/quantization/test_qwen3_vl_moe_experts.pyRepository: NVIDIA/Model-Optimizer
Length of output: 5443
Align the Qwen3-VL-MoE wrapper signature with HF’s call order
modelopt/torch/quantization/plugins/huggingface.py: Qwen3VLMoeTextExperts is called positionally as experts(hidden_states, top_k_index, top_k_weights), but this wrapper declares forward(self, hidden_states, routing_weights, router_indices). That swaps the tensors on the production path and breaks the expert dispatch. The parity test uses the same swapped order, so it doesn’t catch the mismatch.
Align the signature with HF
def forward(
self,
hidden_states: torch.Tensor,
- routing_weights: torch.Tensor,
router_indices: torch.Tensor,
+ routing_weights: torch.Tensor,
) -> torch.Tensor:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| orig_shape = hidden_states.shape | |
| hidden_dim = getattr(self, "hidden_dim", getattr(self, "hidden_size", None)) | |
| hidden_states = hidden_states.reshape(-1, hidden_dim) | |
| next_states = torch.zeros_like(hidden_states) | |
| with torch.no_grad(): | |
| expert_mask = torch.nn.functional.one_hot(router_indices, num_classes=self.num_experts) | |
| expert_mask = expert_mask.permute(2, 1, 0) | |
| expert_hit = torch.greater(expert_mask.sum(dim=(-1, -2)), 0).nonzero() | |
| for expert_idx in expert_hit: | |
| expert_idx = expert_idx[0] | |
| with torch.no_grad(): | |
| _, token_idx = torch.where(expert_mask[expert_idx[0]]) | |
| top_k_pos, token_idx = torch.where(expert_mask[expert_idx]) | |
| current_state = hidden_states[token_idx] | |
| gate = self.gate_proj[expert_idx](current_state) | |
| up = self.up_proj[expert_idx](current_state) | |
| gated_output = up * self.act_fn(gate) | |
| out = self.down_proj[expert_idx](gated_output) | |
| weighted_output = out * routing_weights[token_idx, expert_idx, None] | |
| weighted_output = out * routing_weights[token_idx, top_k_pos, None] | |
| next_states.index_add_(0, token_idx, weighted_output.to(hidden_states.dtype)) | |
| next_states = next_states.view(batch_size, -1, self.hidden_size) | |
| return next_states | |
| return next_states.view(orig_shape) | |
| def forward( | |
| self, | |
| hidden_states: torch.Tensor, | |
| router_indices: torch.Tensor, | |
| routing_weights: torch.Tensor, | |
| ) -> torch.Tensor: |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 929 - 948,
The Qwen3VLMoeTextExperts forward path is using the wrong positional argument
order, which swaps routing_weights and router_indices on the production call.
Update the Qwen3VLMoeTextExperts.forward signature and its internal tensor usage
to match HF’s call order from experts(hidden_states, top_k_index,
top_k_weights), then adjust the expert_mask, expert_hit, and weighted_output
logic to read the correct tensors. Also update the parity test to use the same
HF-aligned argument order so it validates the real dispatch path.
| expected = original(hidden_states, top_k_index, top_k_weights) | ||
| actual = wrapped(hidden_states, top_k_weights, top_k_index) |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Test passes different argument orders to original vs wrapped, masking the production signature mismatch.
original is called (hidden_states, top_k_index, top_k_weights) while wrapped is called (hidden_states, top_k_weights, top_k_index). This only "matches" because the wrapper's forward signature is reversed vs HF (see the critical comment in huggingface.py Lines 929-948). As written, the test does not exercise how the parent MoE block actually calls the experts module, so it cannot protect against the real defect. After fixing the wrapper signature, both calls should use the same (HF) order.
💚 Call both with HF's argument order
- expected = original(hidden_states, top_k_index, top_k_weights)
- actual = wrapped(hidden_states, top_k_weights, top_k_index)
+ expected = original(hidden_states, top_k_index, top_k_weights)
+ actual = wrapped(hidden_states, top_k_index, top_k_weights)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| expected = original(hidden_states, top_k_index, top_k_weights) | |
| actual = wrapped(hidden_states, top_k_weights, top_k_index) | |
| expected = original(hidden_states, top_k_index, top_k_weights) | |
| actual = wrapped(hidden_states, top_k_index, top_k_weights) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py` around lines 66 -
67, The test in qwen3_vl_moe_experts.py is masking the signature mismatch by
calling original and wrapped with different argument orders. Update the
assertions so both use the same HF order for hidden_states, top_k_index, and
top_k_weights, and verify the wrapper via the expert module path that the parent
MoE block uses. This should align with the wrapped forward implementation in
huggingface.py and catch any future signature regressions.
|
CodeRabbit reported that it does not have sufficient GitHub permissions to request changes on this PR. I do not have admin access to update the app permissions for this repository, so leaving this for maintainers.
|
What does this PR do?
Type of change: Bug fix
Fixes Qwen3-VL-MoE expert quantization wrapper handling for Hugging Face
Qwen3VLMoeTextExperts.The wrapper previously assumed the expert module exposed
hidden_size, but the current Hugging Face implementation exposeshidden_dim. It also splitgate_up_projalong the wrong dimension and transposed fused expert weights incorrectly for the current HF layout. This causedAttributeError: hidden_sizeand could later lead to matmul shape mismatches during expert forward.This PR:
hidden_dim,hidden_size, or the fused weight shape.gate_up_projalong the fused gate/up output dimension.down_projin the layout expected bynn.Linear.top_k_posinstead of expert id when indexing routing weights.Usage
No user-facing API changes.
This fix is exercised internally when ModelOpt wraps Hugging Face
Qwen3VLMoeTextExpertsduring quantization.Testing
Before your PR is "Ready for review"
Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (
git commit -s -S).Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: N/AAdditional Information
Fixes #1720.
Summary by CodeRabbit
Bug Fixes
Tests