Skip to content

fix: handle Qwen3-VL-MoE expert dimensions#1885

Open
harshtripathi6 wants to merge 1 commit into
NVIDIA:mainfrom
harshtripathi6:fix/qwen3-vl-moe-hidden-size
Open

fix: handle Qwen3-VL-MoE expert dimensions#1885
harshtripathi6 wants to merge 1 commit into
NVIDIA:mainfrom
harshtripathi6:fix/qwen3-vl-moe-hidden-size

Conversation

@harshtripathi6

@harshtripathi6 harshtripathi6 commented Jul 2, 2026

Copy link
Copy Markdown

What does this PR do?

Type of change: Bug fix

Fixes Qwen3-VL-MoE expert quantization wrapper handling for Hugging Face Qwen3VLMoeTextExperts.

The wrapper previously assumed the expert module exposed hidden_size, but the current Hugging Face implementation exposes hidden_dim. It also split gate_up_proj along the wrong dimension and transposed fused expert weights incorrectly for the current HF layout. This caused AttributeError: hidden_size and could later lead to matmul shape mismatches during expert forward.

This PR:

  • Resolves the hidden dimension from hidden_dim, hidden_size, or the fused weight shape.
  • Splits gate_up_proj along the fused gate/up output dimension.
  • Copies down_proj in the layout expected by nn.Linear.
  • Uses top_k_pos instead of expert id when indexing routing weights.
  • Preserves the original hidden state shape in the wrapper output.
  • Adds a small offline CPU unit test comparing the quant wrapper output against the HF expert forward.

Usage

No user-facing API changes.

This fix is exercised internally when ModelOpt wraps Hugging Face Qwen3VLMoeTextExperts during quantization.

Testing

pytest tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py -q
pytest tests/unit/torch/quantization -q
pre-commit run --files modelopt/torch/quantization/plugins/huggingface.py tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py

Before your PR is "Ready for review"

Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (git commit -s -S).

Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Additional Information

Fixes #1720.

Summary by CodeRabbit

  • Bug Fixes

    • Improved quantized routing for Mixture-of-Experts text layers so outputs now match the expected model behavior more closely.
    • Fixed tensor reshaping and expert weight handling to produce consistent results across small and larger routing setups.
  • Tests

    • Added coverage to verify the quantized wrapper matches the reference model’s outputs and shapes for a representative MoE configuration.

Signed-off-by: harshtripathi6 <htripathi6@gmail.com>
@harshtripathi6 harshtripathi6 requested review from a team as code owners July 2, 2026 02:42
@harshtripathi6 harshtripathi6 requested a review from kaix-nv July 2, 2026 02:42
@copy-pr-bot

copy-pr-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Reworks _QuantQwen3VLMoeTextExperts in the Hugging Face quantization plugin to derive dimensions from hidden_dim/gate_up_proj shape instead of hidden_size, changes weight slice copying to avoid transpose assumptions, and fixes forward routing to index routing_weights by token and top-k position. Adds a unit test comparing outputs against Hugging Face's implementation.

Changes

Qwen3-VL MoE experts quantization fix

Layer / File(s) Summary
Expert weight setup and dimension inference
modelopt/torch/quantization/plugins/huggingface.py
_setup derives hidden_dim/expert_dim from hidden_dim/hidden_size and gate_up_proj shape, builds gate/up/down Linear layers using hidden_dim, and copies weight slices without transpose-based indexing.
Forward routing correction
modelopt/torch/quantization/plugins/huggingface.py
forward reshapes tensors using hidden_dim, accumulates outputs in a flat buffer, and indexes routing_weights by (token_idx, top_k_pos) instead of an expert-based dimension, then reshapes back to orig_shape.
Parity unit test
tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py
Adds TinyQwen3VLMoeTextConfig, _convert_to_quant_wrapper, and a skip-guarded test that seeds RNG, initializes expert weights, and asserts the quant wrapper's outputs match Hugging Face's Qwen3VLMoeTextExperts on fixed inputs.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Test as Unit Test
  participant QuantExperts as _QuantQwen3VLMoeTextExperts
  participant HFExperts as Qwen3VLMoeTextExperts

  Test->>QuantExperts: _convert_to_quant_wrapper + _setup()
  Test->>HFExperts: forward(hidden_states, top_k_index, top_k_weights)
  Test->>QuantExperts: forward(hidden_states, top_k_index, top_k_weights)
  QuantExperts->>QuantExperts: reshape using hidden_dim
  QuantExperts->>QuantExperts: route by (token_idx, top_k_pos)
  QuantExperts-->>Test: output reshaped to orig_shape
  HFExperts-->>Test: output
  Test->>Test: assert outputs match
Loading

Suggested reviewers: shengliangxu, jingyu-ml, kevalmorabia97

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title concisely describes the main fix: Qwen3-VL-MoE expert dimension handling.
Linked Issues check ✅ Passed The changes fix hidden_dim handling, correct fused gate/up routing, and add a matching HF parity test for #1720.
Out of Scope Changes check ✅ Passed The added unit test and wrapper fixes stay within the Qwen3-VL-MoE quantization bug fix scope.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns ✅ Passed The Qwen3-MoE wrapper/test changes add no unsafe load, remote-code, eval/exec, or nosec patterns.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

🧹 Nitpick comments (3)
tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py (2)

16-17: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Duplicate license header.

Lines 1-14 already carry the full NVIDIA Apache-2.0 header; the extra SPDX-FileCopyrightText/SPDX-License-Identifier pair on Lines 16-17 is redundant and may cause the insert-license pre-commit hook to flag the file. Remove the duplicate.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py` around lines 16 -
17, The file has a duplicate SPDX license header, with an extra
SPDX-FileCopyrightText and SPDX-License-Identifier pair appended after the
existing full NVIDIA Apache-2.0 header. Remove the redundant license lines from
test_qwen3_vl_moe_experts.py so only the original header remains and the
insert-license hook won’t flag it.

32-37: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use distinct dimensions so the parity test can catch layout/transpose regressions.

With hidden_size=8 and moe_intermediate_size=4, the fused gate_up_proj has shape (num_experts, 2*expert_dim=8, hidden_dim=8) — a square inner block. A transpose or wrong split dimension (the exact class of bug this PR fixes) would still produce shape-compatible tensors and pass, so the test provides weak regression protection. Pick dims where 2*moe_intermediate_size != hidden_size (e.g. moe_intermediate_size = 6).

♻️ Make dimensions asymmetric
 class TinyQwen3VLMoeTextConfig:
     num_experts = 2
     hidden_size = 8
-    moe_intermediate_size = 4
+    moe_intermediate_size = 6
     hidden_act = "silu"
     _experts_implementation = "eager"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py` around lines 32 -
37, The parity test setup in TinyQwen3VLMoeTextConfig uses symmetric dimensions,
which can hide transpose/layout bugs in the fused gate_up_proj path. Change the
test config to use asymmetric sizes in the Qwen3 VL MoE quantization test so
2*moe_intermediate_size does not equal hidden_size, and keep the same config
class/reference points so the regression coverage still exercises the eager
expert implementation.
modelopt/torch/quantization/plugins/huggingface.py (1)

887-915: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Weight slicing is correct for the current HF layout; persist hidden_dim to avoid a divergent fallback in forward.

The layout handling here is right: HF stores gate_up_proj as (num_experts, 2*expert_dim, hidden_dim) and down_proj as (num_experts, hidden_dim, expert_dim), so shape[-1]/shape[1]//2 and the [:expert_dim]/[expert_dim:] row slices map cleanly onto the nn.Linear weight shapes.

However, _setup resolves hidden_dim with a 3-level fallback (hidden_dimhidden_sizegate_up_proj.shape[-1]), while forward (Line 930) uses only a 2-level fallback and gate_up_proj is already deleted. If a module exposes neither attribute, forward gets hidden_dim=None and reshape(-1, None) crashes. Store the resolved value once and reuse it.

♻️ Persist the resolved hidden dimension
         if hidden_dim is None:
             hidden_dim = self.gate_up_proj.shape[-1]
+        self.hidden_dim = hidden_dim

And in forward (Line 930-931):

-        hidden_dim = getattr(self, "hidden_dim", getattr(self, "hidden_size", None))
-        hidden_states = hidden_states.reshape(-1, hidden_dim)
+        hidden_states = hidden_states.reshape(-1, self.hidden_dim)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 887 - 915,
Persist the resolved hidden dimension from the setup logic in the MoE plugin so
`forward` does not depend on a different fallback path after `gate_up_proj` is
deleted. Update the `_setup`/initialization flow in `huggingface.py` to store
the computed `hidden_dim` once on the module and have `forward` reuse that saved
value instead of re-deriving it from `hidden_size` or `gate_up_proj`. This
should be applied around the existing `hidden_dim` resolution and the `forward`
reshape path for the same class handling `gate_up_proj` and `down_proj`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 929-948: The Qwen3VLMoeTextExperts forward path is using the wrong
positional argument order, which swaps routing_weights and router_indices on the
production call. Update the Qwen3VLMoeTextExperts.forward signature and its
internal tensor usage to match HF’s call order from experts(hidden_states,
top_k_index, top_k_weights), then adjust the expert_mask, expert_hit, and
weighted_output logic to read the correct tensors. Also update the parity test
to use the same HF-aligned argument order so it validates the real dispatch
path.

In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py`:
- Around line 66-67: The test in qwen3_vl_moe_experts.py is masking the
signature mismatch by calling original and wrapped with different argument
orders. Update the assertions so both use the same HF order for hidden_states,
top_k_index, and top_k_weights, and verify the wrapper via the expert module
path that the parent MoE block uses. This should align with the wrapped forward
implementation in huggingface.py and catch any future signature regressions.

---

Nitpick comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 887-915: Persist the resolved hidden dimension from the setup
logic in the MoE plugin so `forward` does not depend on a different fallback
path after `gate_up_proj` is deleted. Update the `_setup`/initialization flow in
`huggingface.py` to store the computed `hidden_dim` once on the module and have
`forward` reuse that saved value instead of re-deriving it from `hidden_size` or
`gate_up_proj`. This should be applied around the existing `hidden_dim`
resolution and the `forward` reshape path for the same class handling
`gate_up_proj` and `down_proj`.

In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py`:
- Around line 16-17: The file has a duplicate SPDX license header, with an extra
SPDX-FileCopyrightText and SPDX-License-Identifier pair appended after the
existing full NVIDIA Apache-2.0 header. Remove the redundant license lines from
test_qwen3_vl_moe_experts.py so only the original header remains and the
insert-license hook won’t flag it.
- Around line 32-37: The parity test setup in TinyQwen3VLMoeTextConfig uses
symmetric dimensions, which can hide transpose/layout bugs in the fused
gate_up_proj path. Change the test config to use asymmetric sizes in the Qwen3
VL MoE quantization test so 2*moe_intermediate_size does not equal hidden_size,
and keep the same config class/reference points so the regression coverage still
exercises the eager expert implementation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 13833783-0784-4c47-82ce-a9e27324bdfe

📥 Commits

Reviewing files that changed from the base of the PR and between 9038b71 and 86512a9.

📒 Files selected for processing (2)
  • modelopt/torch/quantization/plugins/huggingface.py
  • tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py

Comment on lines +929 to +948
orig_shape = hidden_states.shape
hidden_dim = getattr(self, "hidden_dim", getattr(self, "hidden_size", None))
hidden_states = hidden_states.reshape(-1, hidden_dim)
next_states = torch.zeros_like(hidden_states)
with torch.no_grad():
expert_mask = torch.nn.functional.one_hot(router_indices, num_classes=self.num_experts)
expert_mask = expert_mask.permute(2, 1, 0)
expert_hit = torch.greater(expert_mask.sum(dim=(-1, -2)), 0).nonzero()
for expert_idx in expert_hit:
expert_idx = expert_idx[0]
with torch.no_grad():
_, token_idx = torch.where(expert_mask[expert_idx[0]])
top_k_pos, token_idx = torch.where(expert_mask[expert_idx])
current_state = hidden_states[token_idx]
gate = self.gate_proj[expert_idx](current_state)
up = self.up_proj[expert_idx](current_state)
gated_output = up * self.act_fn(gate)
out = self.down_proj[expert_idx](gated_output)
weighted_output = out * routing_weights[token_idx, expert_idx, None]
weighted_output = out * routing_weights[token_idx, top_k_pos, None]
next_states.index_add_(0, token_idx, weighted_output.to(hidden_states.dtype))
next_states = next_states.view(batch_size, -1, self.hidden_size)

return next_states
return next_states.view(orig_shape)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Qwen3VLMoeTextExperts forward signature top_k_index top_k_weights transformers 4.56

💡 Result:

In the Hugging Face Transformers library, specifically within the Qwen3-VL-Moe model implementation, the Qwen3VLMoeTextExperts.forward method is designed to process token hidden states through selected Mixture-of-Experts (MoE) layers [1][2][3]. The signature of this forward method is: def forward( self, hidden_states: torch.Tensor, top_k_index: torch.Tensor, top_k_weights: torch.Tensor,) -> torch.Tensor: Parameters: - hidden_states: The input tensor containing the hidden representations of the tokens [1][2]. - top_k_index: A tensor specifying the indices of the experts selected for each token (typically derived from the model's router/gate) [1][2][3]. - top_k_weights: A tensor containing the routing weights corresponding to the selected experts, used to scale the output of each expert before aggregation [1][2][3]. The method computes expert outputs, often by iterating over the active experts identified by top_k_index, applying the expert's gate/up/down projections, and then accumulating the weighted results back into the final hidden state using index_add_ [1][2][3]. Regarding transformers 4.56: This version was part of the development cycle where these MoE implementations were actively refined [4]. The variable names top_k_index and top_k_weights became the standard naming convention in Transformers MoE modules to ensure consistency across different model architectures (e.g., Qwen3, Mixtral) [5][6]. Users encountering memory issues or integration needs with this method often reference these specific parameters when troubleshooting or customizing expert execution [7][8].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the relevant implementation and test around the reported lines.
git ls-files modelopt/torch/quantization/plugins/huggingface.py tests | sed -n '1,200p'
echo '--- huggingface.py around reported lines ---'
sed -n '900,970p' modelopt/torch/quantization/plugins/huggingface.py
echo '--- search for Qwen3VLMoeTextSparseMoeBlock and experts call sites ---'
rg -n "Qwen3VLMoeTextSparseMoeBlock|self\.experts\(|top_k_index|top_k_weights|routing_weights|router_indices" modelopt tests -g '!**/__pycache__/**'
echo '--- test file cited in review comment ---'
sed -n '1,140p' tests/unit/torch/quantization/plugins/test_qwen3_vl_moe_experts.py

Repository: NVIDIA/Model-Optimizer

Length of output: 22730


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '--- huggingface.py around the second forward wrapper ---'
sed -n '1500,1545p' modelopt/torch/quantization/plugins/huggingface.py

echo '--- expert-removal hook call sites ---'
sed -n '360,410p' modelopt/torch/prune/importance_hooks/expert_removal_hooks.py

echo '--- any test file for qwen3 vl moe experts ---'
fd -a 'test_qwen3*' tests modelopt || true
fd -a '*qwen3*experts*' tests modelopt || true

Repository: NVIDIA/Model-Optimizer

Length of output: 4698


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '--- class header and surrounding methods for the reported wrapper ---'
sed -n '860,930p' modelopt/torch/quantization/plugins/huggingface.py

echo '--- test file for qwen3 vl moe experts ---'
sed -n '1,140p' tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py

Repository: NVIDIA/Model-Optimizer

Length of output: 5443


Align the Qwen3-VL-MoE wrapper signature with HF’s call order

modelopt/torch/quantization/plugins/huggingface.py: Qwen3VLMoeTextExperts is called positionally as experts(hidden_states, top_k_index, top_k_weights), but this wrapper declares forward(self, hidden_states, routing_weights, router_indices). That swaps the tensors on the production path and breaks the expert dispatch. The parity test uses the same swapped order, so it doesn’t catch the mismatch.

Align the signature with HF
     def forward(
         self,
         hidden_states: torch.Tensor,
-        routing_weights: torch.Tensor,
         router_indices: torch.Tensor,
+        routing_weights: torch.Tensor,
     ) -> torch.Tensor:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
orig_shape = hidden_states.shape
hidden_dim = getattr(self, "hidden_dim", getattr(self, "hidden_size", None))
hidden_states = hidden_states.reshape(-1, hidden_dim)
next_states = torch.zeros_like(hidden_states)
with torch.no_grad():
expert_mask = torch.nn.functional.one_hot(router_indices, num_classes=self.num_experts)
expert_mask = expert_mask.permute(2, 1, 0)
expert_hit = torch.greater(expert_mask.sum(dim=(-1, -2)), 0).nonzero()
for expert_idx in expert_hit:
expert_idx = expert_idx[0]
with torch.no_grad():
_, token_idx = torch.where(expert_mask[expert_idx[0]])
top_k_pos, token_idx = torch.where(expert_mask[expert_idx])
current_state = hidden_states[token_idx]
gate = self.gate_proj[expert_idx](current_state)
up = self.up_proj[expert_idx](current_state)
gated_output = up * self.act_fn(gate)
out = self.down_proj[expert_idx](gated_output)
weighted_output = out * routing_weights[token_idx, expert_idx, None]
weighted_output = out * routing_weights[token_idx, top_k_pos, None]
next_states.index_add_(0, token_idx, weighted_output.to(hidden_states.dtype))
next_states = next_states.view(batch_size, -1, self.hidden_size)
return next_states
return next_states.view(orig_shape)
def forward(
self,
hidden_states: torch.Tensor,
router_indices: torch.Tensor,
routing_weights: torch.Tensor,
) -> torch.Tensor:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 929 - 948,
The Qwen3VLMoeTextExperts forward path is using the wrong positional argument
order, which swaps routing_weights and router_indices on the production call.
Update the Qwen3VLMoeTextExperts.forward signature and its internal tensor usage
to match HF’s call order from experts(hidden_states, top_k_index,
top_k_weights), then adjust the expert_mask, expert_hit, and weighted_output
logic to read the correct tensors. Also update the parity test to use the same
HF-aligned argument order so it validates the real dispatch path.

Comment on lines +66 to +67
expected = original(hidden_states, top_k_index, top_k_weights)
actual = wrapped(hidden_states, top_k_weights, top_k_index)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Test passes different argument orders to original vs wrapped, masking the production signature mismatch.

original is called (hidden_states, top_k_index, top_k_weights) while wrapped is called (hidden_states, top_k_weights, top_k_index). This only "matches" because the wrapper's forward signature is reversed vs HF (see the critical comment in huggingface.py Lines 929-948). As written, the test does not exercise how the parent MoE block actually calls the experts module, so it cannot protect against the real defect. After fixing the wrapper signature, both calls should use the same (HF) order.

💚 Call both with HF's argument order
-    expected = original(hidden_states, top_k_index, top_k_weights)
-    actual = wrapped(hidden_states, top_k_weights, top_k_index)
+    expected = original(hidden_states, top_k_index, top_k_weights)
+    actual = wrapped(hidden_states, top_k_index, top_k_weights)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
expected = original(hidden_states, top_k_index, top_k_weights)
actual = wrapped(hidden_states, top_k_weights, top_k_index)
expected = original(hidden_states, top_k_index, top_k_weights)
actual = wrapped(hidden_states, top_k_index, top_k_weights)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py` around lines 66 -
67, The test in qwen3_vl_moe_experts.py is masking the signature mismatch by
calling original and wrapped with different argument orders. Update the
assertions so both use the same HF order for hidden_states, top_k_index, and
top_k_weights, and verify the wrapper via the expert module path that the parent
MoE block uses. This should align with the wrapped forward implementation in
huggingface.py and catch any future signature regressions.

@harshtripathi6

Copy link
Copy Markdown
Author

CodeRabbit reported that it does not have sufficient GitHub permissions to request changes on this PR. I do not have admin access to update the app permissions for this repository, so leaving this for maintainers.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

🧹 Nitpick comments (3)

tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py (2)> 16-17: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Duplicate license header.
Lines 1-14 already carry the full NVIDIA Apache-2.0 header; the extra SPDX-FileCopyrightText/SPDX-License-Identifier pair on Lines 16-17 is redundant and may cause the insert-license pre-commit hook to flag the file. Remove the duplicate.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py` around lines 16 -
17, The file has a duplicate SPDX license header, with an extra
SPDX-FileCopyrightText and SPDX-License-Identifier pair appended after the
existing full NVIDIA Apache-2.0 header. Remove the redundant license lines from
test_qwen3_vl_moe_experts.py so only the original header remains and the
insert-license hook won’t flag it.

32-37: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win
Use distinct dimensions so the parity test can catch layout/transpose regressions.
With hidden_size=8 and moe_intermediate_size=4, the fused gate_up_proj has shape (num_experts, 2*expert_dim=8, hidden_dim=8) — a square inner block. A transpose or wrong split dimension (the exact class of bug this PR fixes) would still produce shape-compatible tensors and pass, so the test provides weak regression protection. Pick dims where 2*moe_intermediate_size != hidden_size (e.g. moe_intermediate_size = 6).

♻️ Make dimensions asymmetric

 class TinyQwen3VLMoeTextConfig:
     num_experts = 2
     hidden_size = 8
-    moe_intermediate_size = 4
+    moe_intermediate_size = 6
     hidden_act = "silu"
     _experts_implementation = "eager"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py` around lines 32 -
37, The parity test setup in TinyQwen3VLMoeTextConfig uses symmetric dimensions,
which can hide transpose/layout bugs in the fused gate_up_proj path. Change the
test config to use asymmetric sizes in the Qwen3 VL MoE quantization test so
2*moe_intermediate_size does not equal hidden_size, and keep the same config
class/reference points so the regression coverage still exercises the eager
expert implementation.

modelopt/torch/quantization/plugins/huggingface.py (1)> 887-915: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Weight slicing is correct for the current HF layout; persist hidden_dim to avoid a divergent fallback in forward.
The layout handling here is right: HF stores gate_up_proj as (num_experts, 2*expert_dim, hidden_dim) and down_proj as (num_experts, hidden_dim, expert_dim), so shape[-1]/shape[1]//2 and the [:expert_dim]/[expert_dim:] row slices map cleanly onto the nn.Linear weight shapes.
However, _setup resolves hidden_dim with a 3-level fallback (hidden_dimhidden_sizegate_up_proj.shape[-1]), while forward (Line 930) uses only a 2-level fallback and gate_up_proj is already deleted. If a module exposes neither attribute, forward gets hidden_dim=None and reshape(-1, None) crashes. Store the resolved value once and reuse it.

♻️ Persist the resolved hidden dimension

         if hidden_dim is None:
             hidden_dim = self.gate_up_proj.shape[-1]
+        self.hidden_dim = hidden_dim

And in forward (Line 930-931):

-        hidden_dim = getattr(self, "hidden_dim", getattr(self, "hidden_size", None))
-        hidden_states = hidden_states.reshape(-1, hidden_dim)
+        hidden_states = hidden_states.reshape(-1, self.hidden_dim)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 887 - 915,
Persist the resolved hidden dimension from the setup logic in the MoE plugin so
`forward` does not depend on a different fallback path after `gate_up_proj` is
deleted. Update the `_setup`/initialization flow in `huggingface.py` to store
the computed `hidden_dim` once on the module and have `forward` reuse that saved
value instead of re-deriving it from `hidden_size` or `gate_up_proj`. This
should be applied around the existing `hidden_dim` resolution and the `forward`
reshape path for the same class handling `gate_up_proj` and `down_proj`.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 929-948: The Qwen3VLMoeTextExperts forward path is using the wrong
positional argument order, which swaps routing_weights and router_indices on the
production call. Update the Qwen3VLMoeTextExperts.forward signature and its
internal tensor usage to match HF’s call order from experts(hidden_states,
top_k_index, top_k_weights), then adjust the expert_mask, expert_hit, and
weighted_output logic to read the correct tensors. Also update the parity test
to use the same HF-aligned argument order so it validates the real dispatch
path.

In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py`:
- Around line 66-67: The test in qwen3_vl_moe_experts.py is masking the
signature mismatch by calling original and wrapped with different argument
orders. Update the assertions so both use the same HF order for hidden_states,
top_k_index, and top_k_weights, and verify the wrapper via the expert module
path that the parent MoE block uses. This should align with the wrapped forward
implementation in huggingface.py and catch any future signature regressions.

---

Nitpick comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 887-915: Persist the resolved hidden dimension from the setup
logic in the MoE plugin so `forward` does not depend on a different fallback
path after `gate_up_proj` is deleted. Update the `_setup`/initialization flow in
`huggingface.py` to store the computed `hidden_dim` once on the module and have
`forward` reuse that saved value instead of re-deriving it from `hidden_size` or
`gate_up_proj`. This should be applied around the existing `hidden_dim`
resolution and the `forward` reshape path for the same class handling
`gate_up_proj` and `down_proj`.

In `@tests/unit/torch/quantization/test_qwen3_vl_moe_experts.py`:
- Around line 16-17: The file has a duplicate SPDX license header, with an extra
SPDX-FileCopyrightText and SPDX-License-Identifier pair appended after the
existing full NVIDIA Apache-2.0 header. Remove the redundant license lines from
test_qwen3_vl_moe_experts.py so only the original header remains and the
insert-license hook won’t flag it.
- Around line 32-37: The parity test setup in TinyQwen3VLMoeTextConfig uses
symmetric dimensions, which can hide transpose/layout bugs in the fused
gate_up_proj path. Change the test config to use asymmetric sizes in the Qwen3
VL MoE quantization test so 2*moe_intermediate_size does not equal hidden_size,
and keep the same config class/reference points so the regression coverage still
exercises the eager expert implementation.

🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Qwen3-VL-30B-A3B NVFP4 quantization fails: hidden_size AttributeError in QuantQwen3VLMoeTextExperts

1 participant