Add TensorQuantizer rotate-back mode by realAsma · Pull Request #1879 · NVIDIA/Model-Optimizer

realAsma · 2026-07-01T20:47:47Z

Summary

add RotateConfig.mode with the existing rotate behavior as the default
support rotate={"enable": true, "mode": "rotate_back"} for rotate -> quantize -> rotate back in fake quantization
reject rotate-back mode for real quantization instead of silently skipping inverse rotation
add NVFP4_W4A4_ROTATE eval config with the config-level embed-token rotate workaround and TODO for the longer-term fix
add focused CPU coverage for config serialization, default rotation, rotate-back, and real-quant rejection

Latest Update

fixed a critical fakequant eval/export bug where fold_weight applied rotation into the stored weight, disabled the quantizer, but left rotation enabled, causing the folded weight to be rotated a second time on later use
fused the Hadamard rotation scale into the quantization path and removed the redundant autograd wrapper
added a config-level workaround for W4A4 rotate: *embed_tokens*quantizer has rotate explicitly disabled because the embedding token/input path creates the rotate issue; a TODO tracks replacing this with a better fix

Example

quant_cfg = {
    "quant_cfg": [
        {
            "quantizer_name": "*weight_quantizer",
            "cfg": {
                "num_bits": 4,
                "block_sizes": {-1: 128},
                "fake_quant": True,
                "rotate": {
                    "enable": True,
                    "mode": "rotate_back",
                },
            },
        },
    ],
    "algorithm": "max",
}

rotate_back selects rotate -> quantize -> rotate-back. The existing behavior stays as rotate=True or rotate={"enable": True, "mode": "rotate"}. rotate_back is supported for fake quant only; real quantization raises clearly instead of silently skipping the inverse rotation.

Validation

pre-commit run --files modelopt/torch/quantization/config.py modelopt/torch/quantization/nn/modules/tensor_quantizer.py tests/unit/torch/quantization/test_tensor_quant_cpu.py
pytest_pwd tests/unit/torch/quantization/test_tensor_quant_cpu.py -q --tb=short -x
python_pwd -m py_compile examples/llm_eval/quantization_utils.py
pre-commit run --files examples/llm_eval/quantization_utils.py

Eval Results

Qwen/Qwen3-8B MMLU eval from examples/llm_eval/lm_eval_hf.py using lm-eval==0.4.10, mmlu, 5-shot, batch_size=4, calib_size=512, and calib_batch_size=4 for quantized runs.

Configuration	Quant config	MMLU acc	Stderr	Runtime
BF16	-	0.7488249537102977	0.0034768059948852644	28m 42s
FP4 W/A	`NVFP4_DEFAULT_CFG`	0.7097	0.0036	-
FP4 W/A rotate, embed-token rotate disabled	`NVFP4_W4A4_ROTATE`	0.6916	0.0037	-
FP4 W/A rotate + rotate-back, embed-token rotate disabled	`NVFP4_W4A4_ROTATE_ROTATE_BACK`	0.6945	0.0036	-

The previous near-random W4A4 rotate diagnostics were removed from this table. The corrected W4A4 rotate result is 0.6916 +/- 0.0037; split checks isolated the issue to embed_tokens retaining rotate in the token/input path.

Full run logs, status files, GPU snapshots, and JSON outputs are retained in the private BeeBot artifact bundles for the Qwen3-8B eval/debug runs.

Review

SE implementation: complete
SR review: clean on iteration 2
ST validation: pass on iteration 2

Summary by CodeRabbit

New Features
- Added support for input rotation modes, including a new rotate-back flow for fake quantization.
- Improved weight folding so rotation is baked into weights and won’t be reapplied on later passes.
- Updated quantizer display text to better reflect rotation and disabled states.
Bug Fixes
- Fixed rotation handling when quantizers are disabled or partially updated.
- Improved consistency and validation for Hadamard transform behavior across CPU and GPU paths.

copy-pr-bot · 2026-07-01T20:47:51Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-07-01T20:47:54Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR adds a mode field ("rotate"/"rotate_back") to RotateConfig, updates config validation to skip nested rotate dicts, reworks normalized_hadamard_transform's block-size handling, and extends TensorQuantizer with rotate-back inverse rotation support, a disable_rotate() method, and a shared weight-folding helper used across export and plugin paths, with corresponding tests.

Changes

Rotate-back mode implementation

Layer / File(s)	Summary
Config and Hadamard support `modelopt/torch/quantization/config.py`, `modelopt/torch/quantization/nn/functional.py`, `tests/gpu/torch/quantization/test_hadamard.py`, `CHANGELOG.rst`	`RotateConfig` adds a `mode` field, recursive config validation skips the rotate subtree, `normalized_hadamard_transform` reworks block-size selection/scaling, and Hadamard tests assert roundtrip reconstruction.
TensorQuantizer rotate-back path `modelopt/torch/quantization/nn/modules/tensor_quantizer.py`, `tests/unit/torch/quantization/test_print.py`, `tests/unit/torch/quantization/test_quant_embedding.py`, `tests/unit/torch/quantization/test_tensor_quant_cpu.py`	Adds `rotate_back_is_enabled`, `disable_rotate()`, `_rotate_inputs()`; `forward()` gates rotate-back with a `fake_quant` guard and applies inverse rotation; `extra_repr()` and `SequentialQuantizer` delegation updated; extensive unit tests added.
Weight folding helpers and export wiring `modelopt/torch/quantization/nn/modules/quant_module.py`, `modelopt/torch/quantization/plugins/huggingface.py`, `modelopt/torch/quantization/plugins/vllm.py`, `modelopt/torch/export/plugins/vllm_fakequant_hf.py`, `modelopt/torch/quantization/model_quant.py`, `tests/unit/torch/quantization/test_tensor_quant_cpu.py`	Shared `_fold_weight_quantizer` helper folds and disables rotation for weights; huggingface/vllm plugins and vLLM fakequant export delegate to quantizer disable methods; `fold_weight` docstring updated; tests validate the fold path.
Fused-experts plugin tests `tests/unit/torch/quantization/plugins/test_fused_experts.py`	Adds tests verifying `fold_weight()` disables per-expert quantizers and rotation, with amax retention controlled by `keep_attrs`.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant TensorQuantizer
  participant normalized_hadamard_transform

  Caller->>TensorQuantizer: forward(inputs)
  alt rotate_back_is_enabled and fake_quant is false
    TensorQuantizer-->>Caller: raise ValueError
  else fake_quant is true
    TensorQuantizer->>normalized_hadamard_transform: _rotate_inputs(inputs)
    TensorQuantizer->>TensorQuantizer: quantize(inputs)
    opt rotate_back_is_enabled
      TensorQuantizer->>normalized_hadamard_transform: _rotate_inputs(outputs)
    end
    TensorQuantizer-->>Caller: return outputs
  end

sequenceDiagram
  participant QuantModule
  participant TensorQuantizer
  participant ExportFlow

  QuantModule->>TensorQuantizer: _fold_weight_quantizer(weights, keep_attrs)
  TensorQuantizer->>TensorQuantizer: fold weight values into tensors
  TensorQuantizer->>TensorQuantizer: disable_rotate()
  TensorQuantizer->>TensorQuantizer: disable()
  ExportFlow->>TensorQuantizer: disable_rotate() during export

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: adding rotate-back support to TensorQuantizer.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	No new torch.load/allow_pickle/trust_remote_code/eval/exec/nosec/dependency issues appear in the PR diff.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch asma/tensorquantizer-rotation-modes

_{Comment @coderabbitai help to get the list of available commands.}

github-actions · 2026-07-01T20:52:12Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1879/
Built to branch `gh-pages` at 2026-07-04 13:05 UTC. Preview will be ready when the GitHub Pages deployment is complete.

codecov · 2026-07-01T20:58:20Z

Codecov Report

❌ Patch coverage is 81.69014% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.30%. Comparing base (75b5803) to head (f5451f2).

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/nn/functional.py	0.00%	7 Missing ⚠️
modelopt/torch/export/plugins/vllm_fakequant_hf.py	0.00%	4 Missing ⚠️
modelopt/torch/quantization/plugins/vllm.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1879      +/-   ##
==========================================
+ Coverage   61.21%   61.30%   +0.08%     
==========================================
  Files         515      515              
  Lines       57245    57243       -2     
==========================================
+ Hits        35043    35092      +49     
+ Misses      22202    22151      -51

Flag	Coverage Δ
examples	`32.63% <50.70%> (-0.24%)`	⬇️
gpu	`20.55% <50.70%> (+0.03%)`	⬆️
regression	`14.83% <23.94%> (+0.07%)`	⬆️
unit	`55.02% <81.69%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/quantization/nn/modules/tensor_quantizer.py`:
- Around line 1110-1111: The rotate_back validation in TensorQuantizer should
happen before the first rotation path is entered, since _rotate_inputs() can
invoke normalized_hadamard_transform() and fail with lower-level errors first.
Move the unsupported rotate_back check in the TensorQuantizer flow ahead of the
initial rotation branch so the intended ValueError is raised consistently, and
update the related fake_quant=False path and test coverage to assert the early
rejection via the rotate_back/config validation rather than relying on a stubbed
transform.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6f9377bc-7e6a-4144-8fd9-109d398c314f

📥 Commits

Reviewing files that changed from the base of the PR and between fbbc598 and 3c6b394.

📒 Files selected for processing (3)

modelopt/torch/quantization/config.py
modelopt/torch/quantization/nn/modules/tensor_quantizer.py
tests/unit/torch/quantization/test_tensor_quant_cpu.py

cjluo-nv

Bot review — DM the bot to share feedback.

Small, cohesive feature (+106/-9, 3 files) adding a rotate_back mode to TensorQuantizer (rotate → quantize → rotate-back for fake quant), with a clear rejection for real quantization. Code is clean and reuses existing machinery: a new _rotate_inputs helper deduplicates the two rotation call-sites, and rotate_back_is_enabled mirrors the existing rotate_is_* property pattern (dict/RotateConfig/bool backward-compat handled consistently). The field_name == "rotate" short-circuit in validate_config correctly stops the "only {'enable': True}" check from mis-firing on rotate={"enable": True} (a latent bug), and RotateConfig's own validators still cover the sub-fields.

CPU test coverage is solid: config serialization round-trip, default-path (transform called once), rotate-back path (transform called twice with matching fp32/block_size args), and the real-quant ValueError. Tests monkeypatch normalized_hadamard_transform so they don't need the fast_hadamard_transform package. No licensing concerns, no oversized diff, and no prompt-injection content in the PR fields.

Why nudge rather than approve:

Correctness rests on the Hadamard transform being self-inverse. rotate_back applies the same normalized_hadamard_transform a second time rather than an explicit inverse; this is only correct because the normalized FHT is orthogonal/symmetric (HᵀH = I). That holds for the full-dim and fixed-block paths, but the auto-selected block-size path (block_size=None → _largest_pow2_divisor(dim)) picks the block size from the tensor shape at call time — worth a human confirming the forward and rotate-back calls always resolve to the same block size so the round-trip is truly identity. The CPU tests stub out the transform, so this numerical property is not actually exercised (GPU-only).
CHANGELOG not updated. This is a user-facing QuantizerAttributeConfig.rotate.mode addition; the 0.46 "New Features" section has no entry for it.

realAsma

Simplify all the fold_weight doctoring -> no need of mentioning rotate functionality in particular - it is a Lower level details obvious in the code.

realAsma

There are a lot of unittests; can we simplify or consolidate any of them?

are there any low level tests which are already covered by high-level tests - if yes should we remove them?

We can keep the unittests that are essential for testing the correctness or expected behavior

realAsma · 2026-07-03T18:10:44Z

    enable: bool = False
+    mode: Literal["rotate", "rotate_back"] = "rotate"
    rotate_fp32: bool = False
    block_size: int | None = None


Can we add doc for these fields? Not clear what these are for

coderabbitai

🧹 Nitpick comments (1)

examples/llm_eval/quantization_utils.py (1)
66-71: 🎯 Functional Correctness | 🔵 Trivial | 💤 Low value

Track the embed_tokens rotate workaround.

The TODO documents a known rotate issue for the embedding token path that's being worked around by force-disabling rotate/quantization for *embed_tokens*quantizer rather than fixing the root cause.

Do you want me to open a follow-up issue to investigate the root cause of the rotate incompatibility with the embedding path?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_eval/quantization_utils.py` around lines 66 - 71, The
embed_tokens rotate workaround in quantization_utils is currently only
documented with a TODO, so it should be tracked with a real follow-up issue
instead of an inline note. Update the embed_tokens quantizer entry in the
quantization config to reference the tracking ticket or issue ID, and keep the
workaround description concise so the intent is clear for the *embed_tokens*
quantizer path.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/llm_eval/quantization_utils.py`:
- Around line 66-71: The embed_tokens rotate workaround in quantization_utils is
currently only documented with a TODO, so it should be tracked with a real
follow-up issue instead of an inline note. Update the embed_tokens quantizer
entry in the quantization config to reference the tracking ticket or issue ID,
and keep the workaround description concise so the intent is clear for the
*embed_tokens* quantizer path.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4f889424-843f-4693-8d5f-656c5e1c9110

📥 Commits

Reviewing files that changed from the base of the PR and between afcbf78 and 80be688.

📒 Files selected for processing (12)

examples/llm_eval/quantization_utils.py
modelopt/torch/export/plugins/vllm_fakequant_hf.py
modelopt/torch/quantization/config.py
modelopt/torch/quantization/model_quant.py
modelopt/torch/quantization/nn/functional.py
modelopt/torch/quantization/nn/modules/quant_module.py
modelopt/torch/quantization/nn/modules/tensor_quantizer.py
modelopt/torch/quantization/plugins/huggingface.py
modelopt/torch/quantization/plugins/vllm.py
tests/unit/torch/quantization/plugins/test_fused_experts.py
tests/unit/torch/quantization/test_print.py
tests/unit/torch/quantization/test_tensor_quant_cpu.py

✅ Files skipped from review due to trivial changes (1)

modelopt/torch/quantization/model_quant.py

🚧 Files skipped from review as they are similar to previous changes (2)

modelopt/torch/quantization/config.py
modelopt/torch/quantization/nn/modules/tensor_quantizer.py

Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma force-pushed the asma/tensorquantizer-rotation-modes branch from a51f4da to 3c6b394 Compare July 1, 2026 20:48

realAsma marked this pull request as ready for review July 1, 2026 21:26

realAsma requested review from a team as code owners July 1, 2026 21:26

realAsma requested review from kinjalpatel27 and mxinO July 1, 2026 21:26

realAsma marked this pull request as draft July 1, 2026 21:28

realAsma marked this pull request as ready for review July 1, 2026 21:30

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread modelopt/torch/quantization/nn/modules/tensor_quantizer.py

realAsma requested a review from sugunav14 July 1, 2026 23:10

cjluo-nv reviewed Jul 1, 2026

View reviewed changes

realAsma commented Jul 2, 2026

View reviewed changes

Comment thread tests/gpu/torch/quantization/test_hadamard.py

coderabbitai Bot approved these changes Jul 2, 2026

View reviewed changes

realAsma mentioned this pull request Jul 2, 2026

Add TensorQuantizer Random Hadamard rotation seed #1883

Draft

realAsma marked this pull request as draft July 2, 2026 11:57

realAsma commented Jul 3, 2026

View reviewed changes

realAsma marked this pull request as ready for review July 3, 2026 21:35

realAsma requested review from a team as code owners July 3, 2026 21:35

coderabbitai Bot reviewed Jul 3, 2026

View reviewed changes

realAsma commented Jul 4, 2026

View reviewed changes

Comment thread examples/llm_eval/quantization_utils.py

realAsma force-pushed the asma/tensorquantizer-rotation-modes branch 2 times, most recently from 52b366a to 960716f Compare July 4, 2026 02:37

Add TensorQuantizer rotate-back mode

6a1f830

Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma added 7 commits July 4, 2026 13:01

Add Hadamard roundtrip GPU test

874dbd8

Signed-off-by: realAsma <akuriparambi@nvidia.com>

Address rotate-back review feedback

c0287fc

Signed-off-by: realAsma <akuriparambi@nvidia.com>

Refine TensorQuantizer rotate-back handling

5da851f

Signed-off-by: realAsma <akuriparambi@nvidia.com>

Address tensor quantizer rotation review comments

2d0ed6d

Signed-off-by: realAsma <akuriparambi@nvidia.com>

Add W4A4 rotate eval config workaround

df16ddf

Signed-off-by: realAsma <akuriparambi@nvidia.com>

Clear rotation when disabling tensor quantizers

da8eb4a

Signed-off-by: realAsma <akuriparambi@nvidia.com>

Remove temporary W4A4 rotate eval config

f5451f2

Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma force-pushed the asma/tensorquantizer-rotation-modes branch from 960716f to f5451f2 Compare July 4, 2026 13:02

Uh oh!

Conversation

realAsma commented Jul 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Latest Update

Example

Validation

Eval Results

Review

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jul 1, 2026

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-07-04 13:05 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

realAsma left a comment

Choose a reason for hiding this comment

Uh oh!

realAsma left a comment

Choose a reason for hiding this comment

Uh oh!

realAsma Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

realAsma commented Jul 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

github-actions Bot commented Jul 1, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-07-04 13:05 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented Jul 1, 2026 •

edited

Loading