Add nvfp4 attention support for vLLM serving by kaix-nv · Pull Request #1898 · NVIDIA/Model-Optimizer

kaix-nv · 2026-07-03T18:29:08Z

What does this PR do?

Type of change: ?

New feature.

Applies NVFP4 fake quantization to Q/K/P/V.
- Q uses dynamic scaling and an FP32 QDQ carrier.
- K/V use global scale 1.0.
- P uses amax 1.0, rounds to the model dtype before packing, and retains FP32 QDQ accumulation.
Adds paged prefill and a dedicated split-K decode kernel for long-context serving.
Avoids V-cache re-quantization:
- Complete 16-token groups are finalized once.
- The incomplete group remains pristine in FP16/BF16 and is QDQ on read.
Validates the complete attention plan before modifying the model and fails loudly for unsupported configurations.

Usage

 cd examples/vllm_serve

 python vllm_serve_sparse_attn.py <MODEL_PATH> -tp 8 \
   --attention-backend FLASH_ATTN \
   --no-enable-prefix-caching \
   --worker-cls quant_sparse_attn_worker.QuantSparseAttnWorker

 ### Testing

 Focused B200 kernel tests:

 PYTEST_VERSION=1 PYTHONPATH=$PWD python -m pytest -q \
   tests/gpu/torch/kernels/common/attention/test_triton_fa_p_qdq.py \
   tests/gpu/torch/kernels/common/attention/test_decode_attention.py \
   tests/gpu/torch/kernels/common/attention/test_triton_fa_paged.py

 Result: 47 passed.

 Focused vLLM integration tests:

 PYTHONPATH=$PWD python -m pytest -q \
   tests/gpu_vllm/torch/sparsity/attention_sparsity/test_sparse_attn_worker.py \
   tests/gpu_vllm/torch/sparsity/attention_sparsity/test_quant_sparse_attn_worker.py

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A
Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

copy-pr-bot · 2026-07-03T18:29:12Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-07-03T18:29:15Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e0503d1f-6c5d-431e-b7d3-28f99935ff24

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kaix/sparse_attn_quant_compact

_{Comment @coderabbitai help to get the list of available commands.}

github-actions · 2026-07-03T18:32:45Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1898/
Built to branch `gh-pages` at 2026-07-04 05:06 UTC. Preview will be ready when the GitHub Pages deployment is complete.

codecov · 2026-07-03T18:38:08Z

Codecov Report

❌ Patch coverage is 1.50943% with 261 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.92%. Comparing base (75b5803) to head (46744e1).

Files with missing lines	Patch %	Lines
...torch/kernels/common/attention/decode_attention.py	0.00%	118 Missing ⚠️
...delopt/torch/kernels/common/attention/triton_fa.py	9.09%	40 Missing ⚠️
...lopt/torch/kernels/quantization/attention/v_qdq.py	0.00%	40 Missing ⚠️
.../torch/sparsity/attention_sparsity/plugins/vllm.py	0.00%	40 Missing ⚠️
modelopt/torch/quantization/plugins/vllm.py	0.00%	19 Missing ⚠️
...t/torch/kernels/quantization/common/nvfp4_quant.py	0.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1898      +/-   ##
==========================================
- Coverage   61.21%   60.92%   -0.30%     
==========================================
  Files         515      517       +2     
  Lines       57245    57477     +232     
==========================================
- Hits        35043    35017      -26     
- Misses      22202    22460     +258

Flag	Coverage Δ
unit	`54.69% <1.50%> (-0.24%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Kai Xu <kaix@nvidia.com>

Adapt the per-head split-K and FP32 combine structure from internal ModelOpt commit 6c08d08. Reuse the compact branch paged loaders and NVFP4 P/V QDQ helpers, while preserving the Option-3 pristine V tail. Signed-off-by: Kai Xu <kaix@nvidia.com>

Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv added 7 commits July 3, 2026 18:57

Add compact vLLM attention quant carrier

21f697f

Signed-off-by: Kai Xu <kaix@nvidia.com>

Add compact NVFP4 V attention path

7436ba6

Signed-off-by: Kai Xu <kaix@nvidia.com>

Add compact NVFP4 vLLM attention worker

d392f96

Signed-off-by: Kai Xu <kaix@nvidia.com>

Document compact NVFP4 vLLM attention serving

08369ba

Signed-off-by: Kai Xu <kaix@nvidia.com>

Harden compact NVFP4 attention numerics

081e7e3

Signed-off-by: Kai Xu <kaix@nvidia.com>

Add split-K NVFP4 decode attention

94c60cc

Adapt the per-head split-K and FP32 combine structure from internal ModelOpt commit 6c08d08. Reuse the compact branch paged loaders and NVFP4 P/V QDQ helpers, while preserving the Option-3 pristine V tail. Signed-off-by: Kai Xu <kaix@nvidia.com>

Match NVFP4 attention fake quant to native numerics

50e1e0e

Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv force-pushed the kaix/sparse_attn_quant_compact branch from dcf4d2a to 50e1e0e Compare July 4, 2026 01:57

Fix NVFP4 attention tests

46744e1

Signed-off-by: Kai Xu <kaix@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add nvfp4 attention support for vLLM serving#1898

Add nvfp4 attention support for vLLM serving#1898
kaix-nv wants to merge 8 commits into
mainfrom
kaix/sparse_attn_quant_compact

kaix-nv commented Jul 3, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions Bot commented Jul 3, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-07-04 05:06 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kaix-nv commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-07-04 05:06 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kaix-nv commented Jul 3, 2026 •

edited

Loading

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading

github-actions Bot commented Jul 3, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-07-04 05:06 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented Jul 3, 2026 •

edited

Loading