Skip to content

Add TensorQuantizer rotate-back mode#1879

Open
realAsma wants to merge 8 commits into
mainfrom
asma/tensorquantizer-rotation-modes
Open

Add TensorQuantizer rotate-back mode#1879
realAsma wants to merge 8 commits into
mainfrom
asma/tensorquantizer-rotation-modes

Conversation

@realAsma

@realAsma realAsma commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add RotateConfig.mode with the existing rotate behavior as the default
  • support rotate={"enable": true, "mode": "rotate_back"} for rotate -> quantize -> rotate back in fake quantization
  • reject rotate-back mode for real quantization instead of silently skipping inverse rotation
  • add NVFP4_W4A4_ROTATE eval config with the config-level embed-token rotate workaround and TODO for the longer-term fix
  • add focused CPU coverage for config serialization, default rotation, rotate-back, and real-quant rejection

Latest Update

  • fixed a critical fakequant eval/export bug where fold_weight applied rotation into the stored weight, disabled the quantizer, but left rotation enabled, causing the folded weight to be rotated a second time on later use
  • fused the Hadamard rotation scale into the quantization path and removed the redundant autograd wrapper
  • added a config-level workaround for W4A4 rotate: *embed_tokens*quantizer has rotate explicitly disabled because the embedding token/input path creates the rotate issue; a TODO tracks replacing this with a better fix

Example

quant_cfg = {
    "quant_cfg": [
        {
            "quantizer_name": "*weight_quantizer",
            "cfg": {
                "num_bits": 4,
                "block_sizes": {-1: 128},
                "fake_quant": True,
                "rotate": {
                    "enable": True,
                    "mode": "rotate_back",
                },
            },
        },
    ],
    "algorithm": "max",
}

rotate_back selects rotate -> quantize -> rotate-back. The existing behavior stays as rotate=True or rotate={"enable": True, "mode": "rotate"}. rotate_back is supported for fake quant only; real quantization raises clearly instead of silently skipping the inverse rotation.

Validation

  • pre-commit run --files modelopt/torch/quantization/config.py modelopt/torch/quantization/nn/modules/tensor_quantizer.py tests/unit/torch/quantization/test_tensor_quant_cpu.py
  • pytest_pwd tests/unit/torch/quantization/test_tensor_quant_cpu.py -q --tb=short -x
  • python_pwd -m py_compile examples/llm_eval/quantization_utils.py
  • pre-commit run --files examples/llm_eval/quantization_utils.py

Eval Results

Qwen/Qwen3-8B MMLU eval from examples/llm_eval/lm_eval_hf.py using lm-eval==0.4.10, mmlu, 5-shot, batch_size=4, calib_size=512, and calib_batch_size=4 for quantized runs.

Configuration Quant config MMLU acc Stderr Runtime
BF16 - 0.7488249537102977 0.0034768059948852644 28m 42s
FP4 W/A NVFP4_DEFAULT_CFG 0.7097 0.0036 -
FP4 W/A rotate, embed-token rotate disabled NVFP4_W4A4_ROTATE 0.6916 0.0037 -
FP4 W/A rotate + rotate-back, embed-token rotate disabled NVFP4_W4A4_ROTATE_ROTATE_BACK 0.6945 0.0036 -

The previous near-random W4A4 rotate diagnostics were removed from this table. The corrected W4A4 rotate result is 0.6916 +/- 0.0037; split checks isolated the issue to embed_tokens retaining rotate in the token/input path.

Full run logs, status files, GPU snapshots, and JSON outputs are retained in the private BeeBot artifact bundles for the Qwen3-8B eval/debug runs.

Review

  • SE implementation: complete
  • SR review: clean on iteration 2
  • ST validation: pass on iteration 2

Summary by CodeRabbit

  • New Features

    • Added support for input rotation modes, including a new rotate-back flow for fake quantization.
    • Improved weight folding so rotation is baked into weights and won’t be reapplied on later passes.
    • Updated quantizer display text to better reflect rotation and disabled states.
  • Bug Fixes

    • Fixed rotation handling when quantizers are disabled or partially updated.
    • Improved consistency and validation for Hadamard transform behavior across CPU and GPU paths.

@copy-pr-bot

copy-pr-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a mode field ("rotate"/"rotate_back") to RotateConfig, updates config validation to skip nested rotate dicts, reworks normalized_hadamard_transform's block-size handling, and extends TensorQuantizer with rotate-back inverse rotation support, a disable_rotate() method, and a shared weight-folding helper used across export and plugin paths, with corresponding tests.

Changes

Rotate-back mode implementation

Layer / File(s) Summary
Config and Hadamard support
modelopt/torch/quantization/config.py, modelopt/torch/quantization/nn/functional.py, tests/gpu/torch/quantization/test_hadamard.py, CHANGELOG.rst
RotateConfig adds a mode field, recursive config validation skips the rotate subtree, normalized_hadamard_transform reworks block-size selection/scaling, and Hadamard tests assert roundtrip reconstruction.
TensorQuantizer rotate-back path
modelopt/torch/quantization/nn/modules/tensor_quantizer.py, tests/unit/torch/quantization/test_print.py, tests/unit/torch/quantization/test_quant_embedding.py, tests/unit/torch/quantization/test_tensor_quant_cpu.py
Adds rotate_back_is_enabled, disable_rotate(), _rotate_inputs(); forward() gates rotate-back with a fake_quant guard and applies inverse rotation; extra_repr() and SequentialQuantizer delegation updated; extensive unit tests added.
Weight folding helpers and export wiring
modelopt/torch/quantization/nn/modules/quant_module.py, modelopt/torch/quantization/plugins/huggingface.py, modelopt/torch/quantization/plugins/vllm.py, modelopt/torch/export/plugins/vllm_fakequant_hf.py, modelopt/torch/quantization/model_quant.py, tests/unit/torch/quantization/test_tensor_quant_cpu.py
Shared _fold_weight_quantizer helper folds and disables rotation for weights; huggingface/vllm plugins and vLLM fakequant export delegate to quantizer disable methods; fold_weight docstring updated; tests validate the fold path.
Fused-experts plugin tests
tests/unit/torch/quantization/plugins/test_fused_experts.py
Adds tests verifying fold_weight() disables per-expert quantizers and rotation, with amax retention controlled by keep_attrs.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant TensorQuantizer
  participant normalized_hadamard_transform

  Caller->>TensorQuantizer: forward(inputs)
  alt rotate_back_is_enabled and fake_quant is false
    TensorQuantizer-->>Caller: raise ValueError
  else fake_quant is true
    TensorQuantizer->>normalized_hadamard_transform: _rotate_inputs(inputs)
    TensorQuantizer->>TensorQuantizer: quantize(inputs)
    opt rotate_back_is_enabled
      TensorQuantizer->>normalized_hadamard_transform: _rotate_inputs(outputs)
    end
    TensorQuantizer-->>Caller: return outputs
  end
Loading
sequenceDiagram
  participant QuantModule
  participant TensorQuantizer
  participant ExportFlow

  QuantModule->>TensorQuantizer: _fold_weight_quantizer(weights, keep_attrs)
  TensorQuantizer->>TensorQuantizer: fold weight values into tensors
  TensorQuantizer->>TensorQuantizer: disable_rotate()
  TensorQuantizer->>TensorQuantizer: disable()
  ExportFlow->>TensorQuantizer: disable_rotate() during export
Loading
🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding rotate-back support to TensorQuantizer.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed No new torch.load/allow_pickle/trust_remote_code/eval/exec/nosec/dependency issues appear in the PR diff.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch asma/tensorquantizer-rotation-modes

Comment @coderabbitai help to get the list of available commands.

@realAsma realAsma force-pushed the asma/tensorquantizer-rotation-modes branch from a51f4da to 3c6b394 Compare July 1, 2026 20:48
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1879/

Built to branch gh-pages at 2026-07-04 13:05 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 81.69014% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.30%. Comparing base (75b5803) to head (f5451f2).

Files with missing lines Patch % Lines
modelopt/torch/quantization/nn/functional.py 0.00% 7 Missing ⚠️
modelopt/torch/export/plugins/vllm_fakequant_hf.py 0.00% 4 Missing ⚠️
modelopt/torch/quantization/plugins/vllm.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1879      +/-   ##
==========================================
+ Coverage   61.21%   61.30%   +0.08%     
==========================================
  Files         515      515              
  Lines       57245    57243       -2     
==========================================
+ Hits        35043    35092      +49     
+ Misses      22202    22151      -51     
Flag Coverage Δ
examples 32.63% <50.70%> (-0.24%) ⬇️
gpu 20.55% <50.70%> (+0.03%) ⬆️
regression 14.83% <23.94%> (+0.07%) ⬆️
unit 55.02% <81.69%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@realAsma realAsma marked this pull request as ready for review July 1, 2026 21:26
@realAsma realAsma requested review from a team as code owners July 1, 2026 21:26
@realAsma realAsma requested review from kinjalpatel27 and mxinO July 1, 2026 21:26
@realAsma realAsma marked this pull request as draft July 1, 2026 21:28
@realAsma realAsma marked this pull request as ready for review July 1, 2026 21:30

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/quantization/nn/modules/tensor_quantizer.py`:
- Around line 1110-1111: The rotate_back validation in TensorQuantizer should
happen before the first rotation path is entered, since _rotate_inputs() can
invoke normalized_hadamard_transform() and fail with lower-level errors first.
Move the unsupported rotate_back check in the TensorQuantizer flow ahead of the
initial rotation branch so the intended ValueError is raised consistently, and
update the related fake_quant=False path and test coverage to assert the early
rejection via the rotate_back/config validation rather than relying on a stubbed
transform.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6f9377bc-7e6a-4144-8fd9-109d398c314f

📥 Commits

Reviewing files that changed from the base of the PR and between fbbc598 and 3c6b394.

📒 Files selected for processing (3)
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/nn/modules/tensor_quantizer.py
  • tests/unit/torch/quantization/test_tensor_quant_cpu.py

Comment thread modelopt/torch/quantization/nn/modules/tensor_quantizer.py
@realAsma realAsma requested a review from sugunav14 July 1, 2026 23:10

@cjluo-nv cjluo-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Small, cohesive feature (+106/-9, 3 files) adding a rotate_back mode to TensorQuantizer (rotate → quantize → rotate-back for fake quant), with a clear rejection for real quantization. Code is clean and reuses existing machinery: a new _rotate_inputs helper deduplicates the two rotation call-sites, and rotate_back_is_enabled mirrors the existing rotate_is_* property pattern (dict/RotateConfig/bool backward-compat handled consistently). The field_name == "rotate" short-circuit in validate_config correctly stops the "only {'enable': True}" check from mis-firing on rotate={"enable": True} (a latent bug), and RotateConfig's own validators still cover the sub-fields.

CPU test coverage is solid: config serialization round-trip, default-path (transform called once), rotate-back path (transform called twice with matching fp32/block_size args), and the real-quant ValueError. Tests monkeypatch normalized_hadamard_transform so they don't need the fast_hadamard_transform package. No licensing concerns, no oversized diff, and no prompt-injection content in the PR fields.

Why nudge rather than approve:

  • Correctness rests on the Hadamard transform being self-inverse. rotate_back applies the same normalized_hadamard_transform a second time rather than an explicit inverse; this is only correct because the normalized FHT is orthogonal/symmetric (HᵀH = I). That holds for the full-dim and fixed-block paths, but the auto-selected block-size path (block_size=None_largest_pow2_divisor(dim)) picks the block size from the tensor shape at call time — worth a human confirming the forward and rotate-back calls always resolve to the same block size so the round-trip is truly identity. The CPU tests stub out the transform, so this numerical property is not actually exercised (GPU-only).
  • CHANGELOG not updated. This is a user-facing QuantizerAttributeConfig.rotate.mode addition; the 0.46 "New Features" section has no entry for it.

Comment thread tests/gpu/torch/quantization/test_hadamard.py
@realAsma realAsma marked this pull request as draft July 2, 2026 11:57

@realAsma realAsma left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify all the fold_weight doctoring -> no need of mentioning rotate functionality in particular - it is a Lower level details obvious in the code.

@realAsma realAsma left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of unittests; can we simplify or consolidate any of them?

are there any low level tests which are already covered by high-level tests - if yes should we remove them?

We can keep the unittests that are essential for testing the correctness or expected behavior

Comment thread modelopt/torch/quantization/config.py Outdated
Comment on lines 289 to 292
enable: bool = False
mode: Literal["rotate", "rotate_back"] = "rotate"
rotate_fp32: bool = False
block_size: int | None = None

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add doc for these fields? Not clear what these are for

@realAsma realAsma marked this pull request as ready for review July 3, 2026 21:35
@realAsma realAsma requested review from a team as code owners July 3, 2026 21:35

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/llm_eval/quantization_utils.py (1)

66-71: 🎯 Functional Correctness | 🔵 Trivial | 💤 Low value

Track the embed_tokens rotate workaround.

The TODO documents a known rotate issue for the embedding token path that's being worked around by force-disabling rotate/quantization for *embed_tokens*quantizer rather than fixing the root cause.

Do you want me to open a follow-up issue to investigate the root cause of the rotate incompatibility with the embedding path?

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_eval/quantization_utils.py` around lines 66 - 71, The
embed_tokens rotate workaround in quantization_utils is currently only
documented with a TODO, so it should be tracked with a real follow-up issue
instead of an inline note. Update the embed_tokens quantizer entry in the
quantization config to reference the tracking ticket or issue ID, and keep the
workaround description concise so the intent is clear for the *embed_tokens*
quantizer path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/llm_eval/quantization_utils.py`:
- Around line 66-71: The embed_tokens rotate workaround in quantization_utils is
currently only documented with a TODO, so it should be tracked with a real
follow-up issue instead of an inline note. Update the embed_tokens quantizer
entry in the quantization config to reference the tracking ticket or issue ID,
and keep the workaround description concise so the intent is clear for the
*embed_tokens* quantizer path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4f889424-843f-4693-8d5f-656c5e1c9110

📥 Commits

Reviewing files that changed from the base of the PR and between afcbf78 and 80be688.

📒 Files selected for processing (12)
  • examples/llm_eval/quantization_utils.py
  • modelopt/torch/export/plugins/vllm_fakequant_hf.py
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/model_quant.py
  • modelopt/torch/quantization/nn/functional.py
  • modelopt/torch/quantization/nn/modules/quant_module.py
  • modelopt/torch/quantization/nn/modules/tensor_quantizer.py
  • modelopt/torch/quantization/plugins/huggingface.py
  • modelopt/torch/quantization/plugins/vllm.py
  • tests/unit/torch/quantization/plugins/test_fused_experts.py
  • tests/unit/torch/quantization/test_print.py
  • tests/unit/torch/quantization/test_tensor_quant_cpu.py
✅ Files skipped from review due to trivial changes (1)
  • modelopt/torch/quantization/model_quant.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/nn/modules/tensor_quantizer.py

Comment thread examples/llm_eval/quantization_utils.py
@realAsma realAsma force-pushed the asma/tensorquantizer-rotation-modes branch 2 times, most recently from 52b366a to 960716f Compare July 4, 2026 02:37
Signed-off-by: realAsma <akuriparambi@nvidia.com>
realAsma added 7 commits July 4, 2026 13:01
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: realAsma <akuriparambi@nvidia.com>
@realAsma realAsma force-pushed the asma/tensorquantizer-rotation-modes branch from 960716f to f5451f2 Compare July 4, 2026 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants