fix: handle Qwen3-VL-MoE expert dimensions by harshtripathi6 · Pull Request #1885 · NVIDIA/Model-Optimizer

harshtripathi6 · 2026-07-02T02:42:11Z

What does this PR do?

Type of change: Bug fix

Fixes Qwen3-VL-MoE expert quantization wrapper handling for Hugging Face Qwen3VLMoeTextExperts.

The wrapper previously assumed the expert module exposed hidden_size, but the current Hugging Face implementation exposes hidden_dim. It also split gate_up_proj along the wrong dimension and transposed fused expert weights incorrectly for the current HF layout. This caused AttributeError: hidden_size and could later lead to matmul shape mismatches during expert forward.

This PR:

Resolves the hidden dimension from hidden_dim, hidden_size, or the fused weight shape.
Splits gate_up_proj along the fused gate/up output dimension.
Copies down_proj in the layout expected by nn.Linear.
Uses top_k_pos instead of expert id when indexing routing weights.
Preserves the original hidden state shape in the wrapper output.
Adds a small offline CPU unit test comparing the quant wrapper output against the HF expert forward.

Usage

No user-facing API changes.

This fix is exercised internally when ModelOpt wraps Hugging Face Qwen3VLMoeTextExperts during quantization.

Testing

pytest tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py -q
pytest tests/unit/torch/quantization -q
pre-commit run --files modelopt/torch/quantization/plugins/huggingface.py tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py

Before your PR is "Ready for review"

Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (git commit -s -S).

Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: ✅
Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A
Did you get Claude approval on this PR?: N/A

Additional Information

Fixes #1720.

Summary by CodeRabbit

Bug Fixes
- Improved quantized routing for Mixture-of-Experts text layers so outputs now match the expected model behavior more closely.
- Fixed tensor reshaping and expert weight handling to produce consistent results across small and larger routing setups.
Tests
- Added coverage to verify the quantized wrapper matches the reference model’s outputs and shapes for a representative MoE configuration.

Signed-off-by: harshtripathi6 <htripathi6@gmail.com>

copy-pr-bot · 2026-07-02T02:42:14Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-07-02T02:43:26Z

📝 Walkthrough

Walkthrough

Reworks _QuantQwen3VLMoeTextExperts in the Hugging Face quantization plugin to derive dimensions from hidden_dim/gate_up_proj shape instead of hidden_size, changes weight slice copying to avoid transpose assumptions, and fixes forward routing to index routing_weights by token and top-k position. Adds a unit test comparing outputs against Hugging Face's implementation.

Changes

Qwen3-VL MoE experts quantization fix

Layer / File(s)	Summary
Expert weight setup and dimension inference `modelopt/torch/quantization/plugins/huggingface.py`	`_setup` derives `hidden_dim`/`expert_dim` from `hidden_dim`/`hidden_size` and `gate_up_proj` shape, builds gate/up/down Linear layers using `hidden_dim`, and copies weight slices without transpose-based indexing.
Forward routing correction `modelopt/torch/quantization/plugins/huggingface.py`	`forward` reshapes tensors using `hidden_dim`, accumulates outputs in a flat buffer, and indexes `routing_weights` by `(token_idx, top_k_pos)` instead of an expert-based dimension, then reshapes back to `orig_shape`.
Parity unit test `tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py`	Adds `TinyQwen3VLMoeTextConfig`, `_convert_to_quant_wrapper`, and a skip-guarded test that seeds RNG, initializes expert weights, and asserts the quant wrapper's outputs match Hugging Face's `Qwen3VLMoeTextExperts` on fixed inputs.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Test as Unit Test
  participant QuantExperts as _QuantQwen3VLMoeTextExperts
  participant HFExperts as Qwen3VLMoeTextExperts

  Test->>QuantExperts: _convert_to_quant_wrapper + _setup()
  Test->>HFExperts: forward(hidden_states, top_k_index, top_k_weights)
  Test->>QuantExperts: forward(hidden_states, top_k_index, top_k_weights)
  QuantExperts->>QuantExperts: reshape using hidden_dim
  QuantExperts->>QuantExperts: route by (token_idx, top_k_pos)
  QuantExperts-->>Test: output reshaped to orig_shape
  HFExperts-->>Test: output
  Test->>Test: assert outputs match

Suggested reviewers: shengliangxu, jingyu-ml, kevalmorabia97

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title concisely describes the main fix: Qwen3-VL-MoE expert dimension handling.
Linked Issues check	✅ Passed	The changes fix hidden_dim handling, correct fused gate/up routing, and add a matching HF parity test for `#1720`.
Out of Scope Changes check	✅ Passed	The added unit test and wrapper fixes stay within the Qwen3-VL-MoE quantization bug fix scope.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns	✅ Passed	The Qwen3-MoE wrapper/test changes add no unsafe load, remote-code, eval/exec, or nosec patterns.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

🧹 Nitpick comments (3)

tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py (2)
16-17: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Duplicate license header.

Lines 1-14 already carry the full NVIDIA Apache-2.0 header; the extra SPDX-FileCopyrightText/SPDX-License-Identifier pair on Lines 16-17 is redundant and may cause the insert-license pre-commit hook to flag the file. Remove the duplicate.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py` around lines 16 -
17, The file has a duplicate SPDX license header, with an extra
SPDX-FileCopyrightText and SPDX-License-Identifier pair appended after the
existing full NVIDIA Apache-2.0 header. Remove the redundant license lines from
test_qwen3_vl_moe_experts.py so only the original header remains and the
insert-license hook won’t flag it.
32-37: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use distinct dimensions so the parity test can catch layout/transpose regressions.

With hidden_size=8 and moe_intermediate_size=4, the fused gate_up_proj has shape (num_experts, 2*expert_dim=8, hidden_dim=8) — a square inner block. A transpose or wrong split dimension (the exact class of bug this PR fixes) would still produce shape-compatible tensors and pass, so the test provides weak regression protection. Pick dims where 2*moe_intermediate_size != hidden_size (e.g. moe_intermediate_size = 6).
♻️ Make dimensions asymmetric
 class TinyQwen3VLMoeTextConfig:
     num_experts = 2
     hidden_size = 8
-    moe_intermediate_size = 4
+    moe_intermediate_size = 6
     hidden_act = "silu"
     _experts_implementation = "eager"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py` around lines 32 -
37, The parity test setup in TinyQwen3VLMoeTextConfig uses symmetric dimensions,
which can hide transpose/layout bugs in the fused gate_up_proj path. Change the
test config to use asymmetric sizes in the Qwen3 VL MoE quantization test so
2*moe_intermediate_size does not equal hidden_size, and keep the same config
class/reference points so the regression coverage still exercises the eager
expert implementation.
modelopt/torch/quantization/plugins/huggingface.py (1)
887-915: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Weight slicing is correct for the current HF layout; persist hidden_dim to avoid a divergent fallback in forward.

The layout handling here is right: HF stores gate_up_proj as (num_experts, 2*expert_dim, hidden_dim) and down_proj as (num_experts, hidden_dim, expert_dim), so shape[-1]/shape[1]//2 and the [:expert_dim]/[expert_dim:] row slices map cleanly onto the nn.Linear weight shapes.

However, _setup resolves hidden_dim with a 3-level fallback (hidden_dim → hidden_size → gate_up_proj.shape[-1]), while forward (Line 930) uses only a 2-level fallback and gate_up_proj is already deleted. If a module exposes neither attribute, forward gets hidden_dim=None and reshape(-1, None) crashes. Store the resolved value once and reuse it.
♻️ Persist the resolved hidden dimension
         if hidden_dim is None:
             hidden_dim = self.gate_up_proj.shape[-1]
+        self.hidden_dim = hidden_dim
And in forward (Line 930-931):
-        hidden_dim = getattr(self, "hidden_dim", getattr(self, "hidden_size", None))
-        hidden_states = hidden_states.reshape(-1, hidden_dim)
+        hidden_states = hidden_states.reshape(-1, self.hidden_dim)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 887 - 915,
Persist the resolved hidden dimension from the setup logic in the MoE plugin so
`forward` does not depend on a different fallback path after `gate_up_proj` is
deleted. Update the `_setup`/initialization flow in `huggingface.py` to store
the computed `hidden_dim` once on the module and have `forward` reuse that saved
value instead of re-deriving it from `hidden_size` or `gate_up_proj`. This
should be applied around the existing `hidden_dim` resolution and the `forward`
reshape path for the same class handling `gate_up_proj` and `down_proj`.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 929-948: The Qwen3VLMoeTextExperts forward path is using the wrong
positional argument order, which swaps routing_weights and router_indices on the
production call. Update the Qwen3VLMoeTextExperts.forward signature and its
internal tensor usage to match HF’s call order from experts(hidden_states,
top_k_index, top_k_weights), then adjust the expert_mask, expert_hit, and
weighted_output logic to read the correct tensors. Also update the parity test
to use the same HF-aligned argument order so it validates the real dispatch
path.

In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py`:
- Around line 66-67: The test in qwen3_vl_moe_experts.py is masking the
signature mismatch by calling original and wrapped with different argument
orders. Update the assertions so both use the same HF order for hidden_states,
top_k_index, and top_k_weights, and verify the wrapper via the expert module
path that the parent MoE block uses. This should align with the wrapped forward
implementation in huggingface.py and catch any future signature regressions.

---

Nitpick comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 887-915: Persist the resolved hidden dimension from the setup
logic in the MoE plugin so `forward` does not depend on a different fallback
path after `gate_up_proj` is deleted. Update the `_setup`/initialization flow in
`huggingface.py` to store the computed `hidden_dim` once on the module and have
`forward` reuse that saved value instead of re-deriving it from `hidden_size` or
`gate_up_proj`. This should be applied around the existing `hidden_dim`
resolution and the `forward` reshape path for the same class handling
`gate_up_proj` and `down_proj`.

In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py`:
- Around line 16-17: The file has a duplicate SPDX license header, with an extra
SPDX-FileCopyrightText and SPDX-License-Identifier pair appended after the
existing full NVIDIA Apache-2.0 header. Remove the redundant license lines from
test_qwen3_vl_moe_experts.py so only the original header remains and the
insert-license hook won’t flag it.
- Around line 32-37: The parity test setup in TinyQwen3VLMoeTextConfig uses
symmetric dimensions, which can hide transpose/layout bugs in the fused
gate_up_proj path. Change the test config to use asymmetric sizes in the Qwen3
VL MoE quantization test so 2*moe_intermediate_size does not equal hidden_size,
and keep the same config class/reference points so the regression coverage still
exercises the eager expert implementation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 13833783-0784-4c47-82ce-a9e27324bdfe

📥 Commits

Reviewing files that changed from the base of the PR and between 9038b71 and 86512a9.

📒 Files selected for processing (2)

modelopt/torch/quantization/plugins/huggingface.py
tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py

coderabbitai · 2026-07-02T02:53:05Z

+        orig_shape = hidden_states.shape
+        hidden_dim = getattr(self, "hidden_dim", getattr(self, "hidden_size", None))
+        hidden_states = hidden_states.reshape(-1, hidden_dim)
        next_states = torch.zeros_like(hidden_states)
        with torch.no_grad():
            expert_mask = torch.nn.functional.one_hot(router_indices, num_classes=self.num_experts)
            expert_mask = expert_mask.permute(2, 1, 0)
            expert_hit = torch.greater(expert_mask.sum(dim=(-1, -2)), 0).nonzero()
        for expert_idx in expert_hit:
+            expert_idx = expert_idx[0]
            with torch.no_grad():
-                _, token_idx = torch.where(expert_mask[expert_idx[0]])
+                top_k_pos, token_idx = torch.where(expert_mask[expert_idx])
            current_state = hidden_states[token_idx]
            gate = self.gate_proj[expert_idx](current_state)
            up = self.up_proj[expert_idx](current_state)
            gated_output = up * self.act_fn(gate)
            out = self.down_proj[expert_idx](gated_output)
-            weighted_output = out * routing_weights[token_idx, expert_idx, None]
+            weighted_output = out * routing_weights[token_idx, top_k_pos, None]
            next_states.index_add_(0, token_idx, weighted_output.to(hidden_states.dtype))
-        next_states = next_states.view(batch_size, -1, self.hidden_size)
-
-        return next_states
+        return next_states.view(orig_shape)


🗄️ Data Integrity & Integration | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Qwen3VLMoeTextExperts forward signature top_k_index top_k_weights transformers 4.56

💡 Result:

In the Hugging Face Transformers library, specifically within the Qwen3-VL-Moe model implementation, the Qwen3VLMoeTextExperts.forward method is designed to process token hidden states through selected Mixture-of-Experts (MoE) layers [1][2][3]. The signature of this forward method is: def forward( self, hidden_states: torch.Tensor, top_k_index: torch.Tensor, top_k_weights: torch.Tensor,) -> torch.Tensor: Parameters: - hidden_states: The input tensor containing the hidden representations of the tokens [1][2]. - top_k_index: A tensor specifying the indices of the experts selected for each token (typically derived from the model's router/gate) [1][2][3]. - top_k_weights: A tensor containing the routing weights corresponding to the selected experts, used to scale the output of each expert before aggregation [1][2][3]. The method computes expert outputs, often by iterating over the active experts identified by top_k_index, applying the expert's gate/up/down projections, and then accumulating the weighted results back into the final hidden state using index_add_ [1][2][3]. Regarding transformers 4.56: This version was part of the development cycle where these MoE implementations were actively refined [4]. The variable names top_k_index and top_k_weights became the standard naming convention in Transformers MoE modules to ensure consistency across different model architectures (e.g., Qwen3, Mixtral) [5][6]. Users encountering memory issues or integration needs with this method often reference these specific parameters when troubleshooting or customizing expert execution [7][8].

Citations:

1: https://github.com/huggingface/transformers/blob/v5.8.1/src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py

2: https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py

3: https://github.com/huggingface/transformers/blob/24a4dc22/src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py

4: huggingface/transformers@357f414

5: Variable name mismatch on MixtralExperts.forward() huggingface/transformers#41984

6: https://docs.vllm.ai/en/v0.11.2/models/supported_models/

7: https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/moe.py

8: Uneven GPU memory usage and CUDA OOM during multi-GPU inference with Qwen3-VL-30B-A3B-Thinking huggingface/transformers#41570

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the relevant implementation and test around the reported lines. git ls-files modelopt/torch/quantization/plugins/huggingface.py tests | sed -n '1,200p' echo '--- huggingface.py around reported lines ---' sed -n '900,970p' modelopt/torch/quantization/plugins/huggingface.py echo '--- search for Qwen3VLMoeTextSparseMoeBlock and experts call sites ---' rg -n "Qwen3VLMoeTextSparseMoeBlock|self\.experts\(|top_k_index|top_k_weights|routing_weights|router_indices" modelopt tests -g '!**/__pycache__/**' echo '--- test file cited in review comment ---' sed -n '1,140p' tests/unit/torch/quantization/plugins/test_qwen3_vl_moe_experts.py

Repository: NVIDIA/Model-Optimizer

Length of output: 22730

🏁 Script executed:

#!/bin/bash set -euo pipefail echo '--- huggingface.py around the second forward wrapper ---' sed -n '1500,1545p' modelopt/torch/quantization/plugins/huggingface.py echo '--- expert-removal hook call sites ---' sed -n '360,410p' modelopt/torch/prune/importance_hooks/expert_removal_hooks.py echo '--- any test file for qwen3 vl moe experts ---' fd -a 'test_qwen3*' tests modelopt || true fd -a '*qwen3*experts*' tests modelopt || true

Repository: NVIDIA/Model-Optimizer

Length of output: 4698

🏁 Script executed:

#!/bin/bash set -euo pipefail echo '--- class header and surrounding methods for the reported wrapper ---' sed -n '860,930p' modelopt/torch/quantization/plugins/huggingface.py echo '--- test file for qwen3 vl moe experts ---' sed -n '1,140p' tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py

Repository: NVIDIA/Model-Optimizer

Length of output: 5443

Align the Qwen3-VL-MoE wrapper signature with HF’s call order

modelopt/torch/quantization/plugins/huggingface.py: Qwen3VLMoeTextExperts is called positionally as experts(hidden_states, top_k_index, top_k_weights), but this wrapper declares forward(self, hidden_states, routing_weights, router_indices). That swaps the tensors on the production path and breaks the expert dispatch. The parity test uses the same swapped order, so it doesn’t catch the mismatch.

Align the signature with HF

def forward( self, hidden_states: torch.Tensor, - routing_weights: torch.Tensor, router_indices: torch.Tensor, + routing_weights: torch.Tensor, ) -> torch.Tensor:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

orig_shape = hidden_states.shape

hidden_dim = getattr(self, "hidden_dim", getattr(self, "hidden_size", None))

hidden_states = hidden_states.reshape(-1, hidden_dim)

next_states = torch.zeros_like(hidden_states)

with torch.no_grad():

expert_mask = torch.nn.functional.one_hot(router_indices, num_classes=self.num_experts)

expert_mask = expert_mask.permute(2, 1, 0)

expert_hit = torch.greater(expert_mask.sum(dim=(-1, -2)), 0).nonzero()

for expert_idx in expert_hit:

expert_idx = expert_idx[0]

with torch.no_grad():

_, token_idx = torch.where(expert_mask[expert_idx[0]])

top_k_pos, token_idx = torch.where(expert_mask[expert_idx])

current_state = hidden_states[token_idx]

gate = self.gate_proj[expert_idx](current_state)

up = self.up_proj[expert_idx](current_state)

gated_output = up * self.act_fn(gate)

out = self.down_proj[expert_idx](gated_output)

weighted_output = out * routing_weights[token_idx, expert_idx, None]

weighted_output = out * routing_weights[token_idx, top_k_pos, None]

next_states.index_add_(0, token_idx, weighted_output.to(hidden_states.dtype))

next_states = next_states.view(batch_size, -1, self.hidden_size)

return next_states

return next_states.view(orig_shape)

def forward(

self,

hidden_states: torch.Tensor,

router_indices: torch.Tensor,

routing_weights: torch.Tensor,

) -> torch.Tensor:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 929 - 948, The Qwen3VLMoeTextExperts forward path is using the wrong positional argument order, which swaps routing_weights and router_indices on the production call. Update the Qwen3VLMoeTextExperts.forward signature and its internal tensor usage to match HF’s call order from experts(hidden_states, top_k_index, top_k_weights), then adjust the expert_mask, expert_hit, and weighted_output logic to read the correct tensors. Also update the parity test to use the same HF-aligned argument order so it validates the real dispatch path.

coderabbitai · 2026-07-02T02:53:05Z

+    expected = original(hidden_states, top_k_index, top_k_weights)
+    actual = wrapped(hidden_states, top_k_weights, top_k_index)


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Test passes different argument orders to original vs wrapped, masking the production signature mismatch.

original is called (hidden_states, top_k_index, top_k_weights) while wrapped is called (hidden_states, top_k_weights, top_k_index). This only "matches" because the wrapper's forward signature is reversed vs HF (see the critical comment in huggingface.py Lines 929-948). As written, the test does not exercise how the parent MoE block actually calls the experts module, so it cannot protect against the real defect. After fixing the wrapper signature, both calls should use the same (HF) order.

💚 Call both with HF's argument order

- expected = original(hidden_states, top_k_index, top_k_weights) - actual = wrapped(hidden_states, top_k_weights, top_k_index) + expected = original(hidden_states, top_k_index, top_k_weights) + actual = wrapped(hidden_states, top_k_index, top_k_weights)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

expected = original(hidden_states, top_k_index, top_k_weights)

actual = wrapped(hidden_states, top_k_weights, top_k_index)

expected = original(hidden_states, top_k_index, top_k_weights)

actual = wrapped(hidden_states, top_k_index, top_k_weights)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py` around lines 66 - 67, The test in qwen3_vl_moe_experts.py is masking the signature mismatch by calling original and wrapped with different argument orders. Update the assertions so both use the same HF order for hidden_states, top_k_index, and top_k_weights, and verify the wrapper via the expert module path that the parent MoE block uses. This should align with the wrapped forward implementation in huggingface.py and catch any future signature regressions.

harshtripathi6 · 2026-07-02T03:00:29Z

CodeRabbit reported that it does not have sufficient GitHub permissions to request changes on this PR. I do not have admin access to update the app permissions for this repository, so leaving this for maintainers.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

🧹 Nitpick comments (3)
tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py (2)> 16-17: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value
Duplicate license header.
Lines 1-14 already carry the full NVIDIA Apache-2.0 header; the extra SPDX-FileCopyrightText/SPDX-License-Identifier pair on Lines 16-17 is redundant and may cause the insert-license pre-commit hook to flag the file. Remove the duplicate.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py` around lines 16 -
17, The file has a duplicate SPDX license header, with an extra
SPDX-FileCopyrightText and SPDX-License-Identifier pair appended after the
existing full NVIDIA Apache-2.0 header. Remove the redundant license lines from
test_qwen3_vl_moe_experts.py so only the original header remains and the
insert-license hook won’t flag it.
32-37: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win
Use distinct dimensions so the parity test can catch layout/transpose regressions.
With hidden_size=8 and moe_intermediate_size=4, the fused gate_up_proj has shape (num_experts, 2*expert_dim=8, hidden_dim=8) — a square inner block. A transpose or wrong split dimension (the exact class of bug this PR fixes) would still produce shape-compatible tensors and pass, so the test provides weak regression protection. Pick dims where 2*moe_intermediate_size != hidden_size (e.g. moe_intermediate_size = 6).

♻️ Make dimensions asymmetric
 class TinyQwen3VLMoeTextConfig:
     num_experts = 2
     hidden_size = 8
-    moe_intermediate_size = 4
+    moe_intermediate_size = 6
     hidden_act = "silu"
     _experts_implementation = "eager"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py` around lines 32 -
37, The parity test setup in TinyQwen3VLMoeTextConfig uses symmetric dimensions,
which can hide transpose/layout bugs in the fused gate_up_proj path. Change the
test config to use asymmetric sizes in the Qwen3 VL MoE quantization test so
2*moe_intermediate_size does not equal hidden_size, and keep the same config
class/reference points so the regression coverage still exercises the eager
expert implementation.
modelopt/torch/quantization/plugins/huggingface.py (1)> 887-915: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win
Weight slicing is correct for the current HF layout; persist hidden_dim to avoid a divergent fallback in forward.
The layout handling here is right: HF stores gate_up_proj as (num_experts, 2*expert_dim, hidden_dim) and down_proj as (num_experts, hidden_dim, expert_dim), so shape[-1]/shape[1]//2 and the [:expert_dim]/[expert_dim:] row slices map cleanly onto the nn.Linear weight shapes.
However, _setup resolves hidden_dim with a 3-level fallback (hidden_dim → hidden_size → gate_up_proj.shape[-1]), while forward (Line 930) uses only a 2-level fallback and gate_up_proj is already deleted. If a module exposes neither attribute, forward gets hidden_dim=None and reshape(-1, None) crashes. Store the resolved value once and reuse it.

♻️ Persist the resolved hidden dimension
         if hidden_dim is None:
             hidden_dim = self.gate_up_proj.shape[-1]
+        self.hidden_dim = hidden_dim
And in forward (Line 930-931):
-        hidden_dim = getattr(self, "hidden_dim", getattr(self, "hidden_size", None))
-        hidden_states = hidden_states.reshape(-1, hidden_dim)
+        hidden_states = hidden_states.reshape(-1, self.hidden_dim)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 887 - 915,
Persist the resolved hidden dimension from the setup logic in the MoE plugin so
`forward` does not depend on a different fallback path after `gate_up_proj` is
deleted. Update the `_setup`/initialization flow in `huggingface.py` to store
the computed `hidden_dim` once on the module and have `forward` reuse that saved
value instead of re-deriving it from `hidden_size` or `gate_up_proj`. This
should be applied around the existing `hidden_dim` resolution and the `forward`
reshape path for the same class handling `gate_up_proj` and `down_proj`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 929-948: The Qwen3VLMoeTextExperts forward path is using the wrong
positional argument order, which swaps routing_weights and router_indices on the
production call. Update the Qwen3VLMoeTextExperts.forward signature and its
internal tensor usage to match HF’s call order from experts(hidden_states,
top_k_index, top_k_weights), then adjust the expert_mask, expert_hit, and
weighted_output logic to read the correct tensors. Also update the parity test
to use the same HF-aligned argument order so it validates the real dispatch
path.

In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py`:
- Around line 66-67: The test in qwen3_vl_moe_experts.py is masking the
signature mismatch by calling original and wrapped with different argument
orders. Update the assertions so both use the same HF order for hidden_states,
top_k_index, and top_k_weights, and verify the wrapper via the expert module
path that the parent MoE block uses. This should align with the wrapped forward
implementation in huggingface.py and catch any future signature regressions.

---

Nitpick comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 887-915: Persist the resolved hidden dimension from the setup
logic in the MoE plugin so `forward` does not depend on a different fallback
path after `gate_up_proj` is deleted. Update the `_setup`/initialization flow in
`huggingface.py` to store the computed `hidden_dim` once on the module and have
`forward` reuse that saved value instead of re-deriving it from `hidden_size` or
`gate_up_proj`. This should be applied around the existing `hidden_dim`
resolution and the `forward` reshape path for the same class handling
`gate_up_proj` and `down_proj`.

In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py`:
- Around line 16-17: The file has a duplicate SPDX license header, with an extra
SPDX-FileCopyrightText and SPDX-License-Identifier pair appended after the
existing full NVIDIA Apache-2.0 header. Remove the redundant license lines from
test_qwen3_vl_moe_experts.py so only the original header remains and the
insert-license hook won’t flag it.
- Around line 32-37: The parity test setup in TinyQwen3VLMoeTextConfig uses
symmetric dimensions, which can hide transpose/layout bugs in the fused
gate_up_proj path. Change the test config to use asymmetric sizes in the Qwen3
VL MoE quantization test so 2*moe_intermediate_size does not equal hidden_size,
and keep the same config class/reference points so the regression coverage still
exercises the eager expert implementation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)

Create a new PR with the fixes

ℹ️ Review info

fix: handle Qwen3-VL-MoE expert dimensions

86512a9

Signed-off-by: harshtripathi6 <htripathi6@gmail.com>

harshtripathi6 requested review from a team as code owners July 2, 2026 02:42

harshtripathi6 requested a review from kaix-nv July 2, 2026 02:42

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: handle Qwen3-VL-MoE expert dimensions#1885

fix: handle Qwen3-VL-MoE expert dimensions#1885
harshtripathi6 wants to merge 1 commit into
NVIDIA:mainfrom
harshtripathi6:fix/qwen3-vl-moe-hidden-size

harshtripathi6 commented Jul 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Jul 2, 2026

Uh oh!

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Uh oh!

coderabbitai Bot Jul 2, 2026

Uh oh!

harshtripathi6 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		expected = original(hidden_states, top_k_index, top_k_weights)
		actual = wrapped(hidden_states, top_k_weights, top_k_index)

Uh oh!

Conversation

harshtripathi6 commented Jul 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jul 2, 2026

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

harshtripathi6 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

harshtripathi6 commented Jul 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading