Skip to content

Add AutoQuantize recipe support#1856

Open
juhi10071998 wants to merge 17 commits into
mainfrom
juhim/autoquant-recipe-v2
Open

Add AutoQuantize recipe support#1856
juhi10071998 wants to merge 17 commits into
mainfrom
juhim/autoquant-recipe-v2

Conversation

@juhi10071998

@juhi10071998 juhi10071998 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: New feature (AutoQuantize recipes). The --auto_quantize_* CLI flags are deprecated but still work (kept as a thin backward-compat shim) — not removed.

Makes AutoQuantize recipe-driven: mtq.auto_quantize is configured by a declarative YAML recipe (--recipe). The old --auto_quantize_* flags are converted into an AutoQuantizeConfig on the fly and run the exact same recipe path (emitting a DeprecationWarning), so old commands keep working. The recipe path is verified byte-identical to the CLI.

  • Cost model (quantization/config.py, algorithms.py): new effective_bits field on QuantizeConfig (recipe-level override) and QuantizerAttributeConfig (per-format default). estimate_quant_compression resolves recipe-level → per-entry → num_bits heuristic. configs/numerics/nvfp4.yaml ships effective_bits: 4.5 (block-scale-accurate) as the single source of truth.
  • Recipe schema (recipe/config.py, recipe/loader.py): RecipeType.AUTO_QUANTIZE + AutoQuantizeConfig / AutoQuantizeConstraints / AutoQuantizeCost. Fields: constraints (effective_bits, cost_model, cost.active_moe_expert_ratio), candidate_formats, auto_quantize_method (gradient/kl_div), score_size, disabled_layers, cost_excluded_layers (e.g. VL vision towers), kv_cache.
  • Dispatch (examples/hf_ptq/hf_ptq.py): recipe → mtq inputs via _mtq_inputs_from_auto_quantize_config; _match_candidate_to_preset resolves candidates to shipped presets and guards export-compatibility (rejects export-unsafe presets before the search).
  • Deprecated CLI shim: _auto_quantize_config_from_cli builds an AutoQuantizeConfig from the old flags and appends the shared base disabled_layers / cost_excluded_layers (loaded once as module constants in recipe/config.py, mirroring _default_disabled_quantizer_cfg). No model introspection, no new user flags.
  • Shipped recipes: general/auto_quantize/ (nvfp4_fp8_at_5p4bits, nvfp4_fp8_kl_div_at_5p4bits, nvfp4_mse_fp8_at_6p0bits, w4a8_awq_beta_fp8_at_6p0bits, w4a16_nvfp4_fp8_at_6p0bits-active_moe) and model-specific huggingface/qwen3_6_moe/auto_quantize/.... Shared configs/auto_quantize/units/base_disabled_layers + base_cost_excluded_layers spliced via $import.

Migration (deprecated flag → recipe field): --auto_quantize_bitsconstraints.effective_bits · --auto_quantize_methodauto_quantize_method · --auto_quantize_score_sizescore_size · --auto_quantize_cost_modelconstraints.cost_model · --auto_quantize_active_moe_expert_ratioconstraints.cost.active_moe_expert_ratio · --qformat fp8,nvfp4candidate_formats. --auto_quantize_checkpoint unchanged.

Usage

# Recipe (preferred)
python examples/hf_ptq/hf_ptq.py --pyt_ckpt_path <model> --recipe general/auto_quantize/nvfp4_fp8_at_5p4bits --export_path <out>

# Deprecated CLI (converted to a recipe on the fly, still works)
python examples/hf_ptq/hf_ptq.py --pyt_ckpt_path <model> --qformat nvfp4,fp8 --auto_quantize_bits 5.4 --export_path <out>

Testing

  • GPU-free unit tests: recipe loader; recipe→mtq.auto_quantize mapping incl. cost_excluded_layers; export-compat guard (reject/warn/no-bypass); deprecated-CLI→AutoQuantizeConfig conversion; effective_bits resolver + validators.
  • Byte-identical export smoke: recipe path on Qwen3.6-35B-A3B (fp8 + w4a16_nvfp4 @ 6.0, active_moe) → identical hf_quant_config.json across CLI/recipe; also confirmed the deprecated CLI shim ≡ recipe on the same VL MoE.

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅ Yes--auto_quantize_* flags are deprecated but still work (converted to a recipe on the fly + DeprecationWarning). Plain PTQ CLI unaffected.
  • New PIP dependency / copied code: N/A
  • New tests?: ✅
  • Updated Changelog?: ✅ (Deprecations)
  • Claude approval?: pending /claude review

🤖 Generated with Claude Code

@copy-pr-bot

copy-pr-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR replaces CLI-driven AutoQuantize configuration with declarative YAML recipes. It adds RecipeType.AUTO_QUANTIZE/AutoQuantizeConfig schema and loader validation, effective_bits overrides in quantization config and compression estimation, shipped recipe YAMLs, refactored hf_ptq.py recipe-driven flow, removed legacy CLI flags, and updated docs/tests.

Changes

Recipe-driven AutoQuantize

Layer / File(s) Summary
AutoQuantize recipe schema and loader
modelopt/recipe/config.py, modelopt/recipe/loader.py
Adds RecipeType.AUTO_QUANTIZE, AutoQuantizeCost, AutoQuantizeConstraints, AutoQuantizeConfig, ModelOptAutoQuantizeRecipe, registers it in RECIPE_TYPE_TO_CLASS, and updates loader required-section validation and error naming.
effective_bits config and compression estimation
modelopt/torch/quantization/config.py, modelopt/torch/quantization/algorithms.py, tests/unit/torch/quantization/test_autoquant.py
Adds validated effective_bits fields to QuantizerAttributeConfig/QuantizeConfig, updates estimate_quant_compression precedence logic, and updates/adds tests for the new cost behavior.
Shipped AutoQuantize recipe YAMLs
modelopt_recipes/configs/auto_quantize/units/base_disabled_layers.yaml, modelopt_recipes/configs/numerics/nvfp4.yaml, modelopt_recipes/general/auto_quantize/*, modelopt_recipes/huggingface/qwen3_6_moe/auto_quantize/*
Adds shared disabled-layer patterns, NVFP4 effective_bits: 4.5 override, and several general/model-specific AutoQuantize recipe YAMLs with candidate formats, effective_bits targets, and cost model settings.
hf_ptq.py recipe-driven AutoQuantize flow
examples/hf_ptq/example_utils.py, examples/hf_ptq/hf_ptq.py
Removes legacy Qwen/VLM disabled-layer helpers, adds recipe-to-MTQ mapping helpers, redesigns auto_quantize()/make_calib_dataloader() signatures, and reroutes quantize_main() execution/validation based on ModelOptAutoQuantizeRecipe.
CLI cleanup and documentation updates
examples/hf_ptq/scripts/parser.sh, examples/hf_ptq/scripts/huggingface_example.sh, examples/hf_ptq/README.md, CHANGELOG.rst
Removes legacy --auto_quantize_bits/method/score_size flags, adds checkpoint-path passthrough/generation logic, and updates README/changelog to describe recipe-based AutoQuantize and effective_bits terminology.
Test updates for recipe-driven AutoQuantize
tests/_test_utils/examples/hf_ptq_utils.py, tests/_test_utils/examples/run_command.py, tests/examples/hf_ptq/test_hf_ptq_args.py, tests/examples/hf_ptq/test_llm_ptq.py, tests/unit/recipe/test_loader.py
Makes quant optional and adds recipe field with mutual-exclusivity validation, replaces CLI-argument tests with recipe-based MTQ input tests, updates PTQ parametrization to use recipes, and adds loader coverage for AutoQuantize recipes.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant quantize_main
  participant load_recipe
  participant auto_quantize
  participant mtq

  User->>quantize_main: run hf_ptq.py --recipe <auto_quantize.yaml>
  quantize_main->>load_recipe: load_recipe(path)
  load_recipe-->>quantize_main: ModelOptAutoQuantizeRecipe
  quantize_main->>auto_quantize: auto_quantize(args, model, calib_dataloader, aq_config)
  auto_quantize->>auto_quantize: build constraints, candidate_formats, disabled_layers, kv_cache config
  auto_quantize->>mtq: mtq.auto_quantize(model, constraints, candidates, disabled_layers)
  mtq-->>auto_quantize: searched model
  auto_quantize->>auto_quantize: apply KV-cache quantization post-step
  auto_quantize-->>quantize_main: quantized model
Loading

Suggested reviewers: kevalmorabia97, meenchen, cjluo-nv

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly matches the main change: adding declarative AutoQuantize recipe support.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed No touched Python file hardcodes weights_only=False, allow_pickle=True, trust_remote_code=True, eval/exec, or # nosec; only harmless model.eval() and caller-controlled trust_remote_code params appear.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch juhim/autoquant-recipe-v2

Comment @coderabbitai help to get the list of available commands.

@juhi10071998 juhi10071998 force-pushed the juhim/autoquant-recipe-v2 branch 2 times, most recently from 6e4d430 to 0d85360 Compare June 30, 2026 19:26
@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1856/

Built to branch gh-pages at 2026-07-02 23:16 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@juhi10071998 juhi10071998 marked this pull request as ready for review June 30, 2026 19:33
@juhi10071998 juhi10071998 requested review from a team as code owners June 30, 2026 19:33
@codecov

codecov Bot commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.08197% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.99%. Comparing base (4b9225b) to head (c53ae9f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/quantization/algorithms.py 57.14% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1856      +/-   ##
==========================================
+ Coverage   70.21%   76.99%   +6.77%     
==========================================
  Files         515      515              
  Lines       57244    57303      +59     
==========================================
+ Hits        40196    44122    +3926     
+ Misses      17048    13181    -3867     
Flag Coverage Δ
examples 43.00% <91.80%> (+10.14%) ⬆️
gpu 57.88% <81.96%> (+7.83%) ⬆️
regression 14.89% <73.77%> (+0.12%) ⬆️
unit 54.94% <95.08%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97 kevalmorabia97 requested a review from jenchen13 June 30, 2026 19:45

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 4

🧹 Nitpick comments (1)
tests/_test_utils/examples/hf_ptq_utils.py (1)

27-28: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Enforce the quant/recipe invariant in the helper.

Making both fields optional leaves PTQCommand() and PTQCommand(quant=..., recipe=...) as valid constructions, so bad test inputs now fail downstream in the shell layer instead of at this boundary. A small __post_init__ or run() check that requires exactly one of them would keep the matrix honest.

Suggested guard
 class PTQCommand:
     quant: str | None = None
     recipe: str | None = None
@@
+    def __post_init__(self):
+        if (self.quant is None) == (self.recipe is None):
+            raise ValueError("Exactly one of `quant` or `recipe` must be set.")
+
     def run(self, model_path: str):

As per coding guidelines, "Validate external input once at the interface boundary; internal code can trust those checks and avoid redundant assertions."

Also applies to: 64-65

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/_test_utils/examples/hf_ptq_utils.py` around lines 27 - 28, The
PTQCommand helper currently allows both quant and recipe to be missing or both
to be set, so add a boundary check in PTQCommand itself to enforce that exactly
one of those fields is provided. Implement the validation in PTQCommand’s
__post_init__ or run() method so invalid test inputs fail immediately, and keep
the rest of the helper logic assuming the invariant holds.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/hf_ptq/hf_ptq.py`:
- Around line 318-322: The image-calibration guard in hf_ptq.py is being
triggered unconditionally for Nemotron-VL AutoQuantize because the default
calibration flag is set before the recipe type is resolved. Update the control
flow around load_model() and the args.calib_with_images assignment so this
default only applies to plain PTQ, or move recipe loading earlier and branch on
the resolved recipe type in the AutoQuantize path. Use the existing
args.calib_with_images check and the recipe-loading logic near the AutoQuantize
setup to keep AutoQuantize on the text-only calibration path.

In `@modelopt/recipe/config.py`:
- Around line 134-138: The `active_moe_expert_ratio` field in `config.py` is
documented as being in (0, 1], but it currently accepts any float. Add
schema-level validation on the `ModeloptField`/config model so invalid values
are rejected at parse time, using the `active_moe_expert_ratio` symbol to locate
the field. Enforce the lower and upper bounds directly at the boundary (for
example, via field constraints or a validator on the owning config class) so
malformed recipes fail fast before the `active_moe` cost model uses them.
- Around line 175-179: The `candidate_formats` field in `ModeloptField` is
currently using a default empty list without validating that default, so an
omitted AutoQuantize config can pass schema validation incorrectly. Update the
`candidate_formats` definition in `config.py` to enable default validation
(using `validate_default=True` or the equivalent in the surrounding model/field
setup) so the empty default is rejected immediately. Keep the change localized
to the `candidate_formats` field and ensure the existing “at least 2 required”
constraint is enforced even when the field is not explicitly provided.

In `@tests/examples/hf_ptq/test_hf_ptq_args.py`:
- Around line 41-45: The test module has imports for load_recipe and
QUANT_CFG_CHOICES inside test functions, which should be moved to module scope
so import errors fail during collection. Update the import placement at the top
of the file in test_hf_ptq_args, and remove the redundant in-test imports from
the affected test helpers such as test_autoquant_recipe_builds_mtq_inputs and
the other test block referenced by the review.

---

Nitpick comments:
In `@tests/_test_utils/examples/hf_ptq_utils.py`:
- Around line 27-28: The PTQCommand helper currently allows both quant and
recipe to be missing or both to be set, so add a boundary check in PTQCommand
itself to enforce that exactly one of those fields is provided. Implement the
validation in PTQCommand’s __post_init__ or run() method so invalid test inputs
fail immediately, and keep the rest of the helper logic assuming the invariant
holds.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 645d2f45-fa86-458e-815b-54966bd80497

📥 Commits

Reviewing files that changed from the base of the PR and between d70c48c and 0d85360.

📒 Files selected for processing (23)
  • CHANGELOG.rst
  • examples/hf_ptq/README.md
  • examples/hf_ptq/example_utils.py
  • examples/hf_ptq/hf_ptq.py
  • examples/hf_ptq/scripts/huggingface_example.sh
  • examples/hf_ptq/scripts/parser.sh
  • modelopt/recipe/config.py
  • modelopt/recipe/loader.py
  • modelopt/torch/quantization/algorithms.py
  • modelopt/torch/quantization/config.py
  • modelopt_recipes/configs/auto_quantize/units/base_disabled_layers.yaml
  • modelopt_recipes/configs/numerics/nvfp4.yaml
  • modelopt_recipes/general/auto_quantize/nvfp4_fp8_at_4p8bits.yaml
  • modelopt_recipes/general/auto_quantize/nvfp4_mse_fp8_at_6p0bits.yaml
  • modelopt_recipes/general/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yaml
  • modelopt_recipes/general/auto_quantize/w4a8_awq_beta_fp8_at_6p0bits.yaml
  • modelopt_recipes/huggingface/qwen3_6_moe/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yaml
  • tests/_test_utils/examples/hf_ptq_utils.py
  • tests/_test_utils/examples/run_command.py
  • tests/examples/hf_ptq/test_hf_ptq_args.py
  • tests/examples/hf_ptq/test_llm_ptq.py
  • tests/unit/recipe/test_loader.py
  • tests/unit/torch/quantization/test_autoquant.py
💤 Files with no reviewable changes (1)
  • examples/hf_ptq/example_utils.py

Comment thread examples/hf_ptq/hf_ptq.py
Comment thread modelopt/recipe/config.py
Comment thread modelopt/recipe/config.py
Comment thread tests/examples/hf_ptq/test_hf_ptq_args.py Outdated
PTQ_ARGS+=" --low_memory_mode "
fi

if [ -n "$AUTO_QUANTIZE_BITS" ]; then

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we leave old arguments for 1 release and add deprecation warning if user uses them instead of new recipe argument? Otherwise we will break bw compatibility without notice

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense, I think the hf_ptq.py has significant changes if we add it back. If I understand correctly, are you suggesting we keep the flag with the warning, but remove the functionality?
From what I understood from discussions with @shengliangxu and @realAsma we don't have lot of users of AutoQuant so should be safe to deprecate the CLI.

@kevalmorabia97 kevalmorabia97 Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we want to make sure previous CLI args work and we internally convert it into a yaml recipe file on the fly and rest of the example logic operates on the yaml directly

@juhi10071998 juhi10071998 Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree on not breaking BW compat silently — I took a deeper look and why is it not just a cli-> recipe remapping.

The scalar flags map 1:1 onto the new AutoQuantizeConfig, so wrapping those is trivial:

  • --auto_quantize_bits → constraints.effective_bits,
  • --auto_quantize_method → auto_quantize_method,
  • --auto_quantize_score_size → num_score_steps,
  • --auto_quantize_cost_model → constraints.cost_model,
  • --auto_quantize_active_moe_expert_ratio → constraints.cost.active_moe_expert_ratio,
  • and --qformat fp8,nvfp4 → candidate_formats.

The wrinkle is disabled_layers / cost_excluded_layers + a new added functionality to be able to specify the per-candidate effective bits (override) the existing if needed (also newly added)

On main these were never CLI inputs — they come from model introspection that branches on the qwen model class.
hf_ptq.py:418 disabled layers and cost_excluded_patterns coming from example_utils get_excluded list and example utils get excluded cost→ removed in this PR.

I intentionally removed that (per @meenchen earlier point about moving arch knowledge out of hf_ptq.py/example_utils.py and into the recipes). So a flags→recipe wrapper can't reconstruct the old behavior from the flags alone — it has to source disabled_layers from somewhere.

Would also appreciate inputs from @meenchen, @shengliangxu on this. I think if we add the hardcoded exclusion list we will have two sources of truth then and currently it is not pure CLI actually (due to some info being leaked into hf_ptq)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Juhi's reasoning.

AutoQuantize now has a lot of arguments and patches to work correctly (VL support, ActiveMoE cost etc.). We also want AutoQuantize to be extensible for internal formats.

Supporting both CLI and recipes make things more complex.

We could add a note that CLI support for AutoQuantize has been removed and users could refer to 0.45 branch for AutoQuantize CLI support.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juhi10071998 why do we need to add back disabled_layers and cost_excluded? I dont see them in your recipe files. If we keep other clis like auto_quantize_bits, auto_quantize_method, etc; cant we create a recipe file on the fly similar to how your recipe files look right now and the rest of the code can assume a recipe file input?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevalmorabia97 they are present here-

. I think these are needed for correctness.
let me review and see the minimal changes needed to construct the recipe on the fly, if the user want to extend those for new models, they will have to use a recipe though. I think for the default disabled layers we can do this.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in quantization/config.py, we load the yaml configs and keep them as module constants, including the non-auto-quant disabled layers etc. We can do the same for the autoquant.

@juhi10071998 juhi10071998 Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevalmorabia97 Yes — included in commit (10f0691) does.

We kept --auto_quantize_bits/method/score_size/cost_model/active_moe_expert_ratio (+ --qformat for the candidates), and _auto_quantize_config_from_cli builds an AutoQuantizeConfig on the fly from them; quantize_main then runs the same recipe-driven path, so the rest of the code assumes a config/recipe input (no separate CLI code path). A DeprecationWarning is emitted.

On disabled_layers / cost_excluded_layers — did not add them as CLI args. They're appended internally from the shared base units (configs/auto_quantize/units/base_disabled_layers + base_cost_excluded_layers), the same units the recipes splice via $import.

So the on-the-fly config mirrors a recipe: candidates from --qformat, constraints from the scalar flags, and the base layer patterns from those shared units. Arch-specific patterns (e.g. Qwen's *shared_expert_gate*) stay in the model-specific recipe; the CLI shim carries only the base set.

Verified CLI == recipe (byte-identical hf_quant_config.json) on the Qwen3.6 VL MoE.

@juhi10071998 juhi10071998 Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @kevalmorabia97 , please let me know if the recent changes of restoring CLI align when you get a chance, thanks!

@juhi10071998

Copy link
Copy Markdown
Contributor Author

Addressed the CodeRabbit comments in 14fcc04:

  • VL/AutoQuantize control-flow bugload_model auto-enabled image-text calibration for Nemotron-VL models, which auto_quantize() rejects, so AutoQuantize on a Nemotron-VL checkpoint raised NotImplementedError unconditionally. The image-calib default is now skipped when the run is an AutoQuantize recipe.
  • active_moe_expert_ratio — validated ∈ (0, 1] at the schema boundary.
  • candidate_formatsvalidate_default=True, so an omitted/empty list now fails the "≥2 candidates" check at parse time.
  • test_hf_ptq_args — moved load_recipe / QUANT_CFG_CHOICES imports to module scope.
  • PTQCommand — enforces exactly one of quant / recipe via __post_init__.

On the CLI backward-compatibility point (keeping --auto_quantize_* as an on-the-fly recipe shim + deprecation warning vs. hard removal): gathering more input before deciding the approach — will follow up in that thread.

@cjluo-nv cjluo-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

AutoQuantize recipe support (+900/-413, 23 files). Replaces the --auto_quantize_* CLI flags with a declarative RecipeType.AUTO_QUANTIZE recipe driving mtq.auto_quantize, plus shipped general/model-specific recipes and a shared base_disabled_layers unit.

Design review (gate fired): satisfied. This extends the existing modelopt.recipe system (new RecipeType alongside PTQ/speculative), not a competing one — the natural in-repo pattern. The PR body documents the CLI→recipe field mapping and the "CLI path untouched as equivalence baseline" approach. Loader change correctly strips only the speculative_ prefix so AUTO_QUANTIZE keeps its full name. _canonical_candidate_dict compares model_dump(exclude_unset=True) against QUANT_CFG_CHOICES values (also exclude_unset dumps), so preset identity is preserved consistently. Licensing clean (standard NVIDIA header on new files, no vendored code).

Reasons for nudge rather than approve:

  • PR is explicitly a draft — body states "Draft for early review", Changelog "will add before ready", and "Did you get Claude approval: ❌ (draft)". Not ready for merge sign-off.

  • effective_bits is a broad, under-advertised side effect. Adding effective_bits: 4.5 to configs/numerics/nvfp4.yaml puts the field on QuantizerAttributeConfig, so TensorQuantizer.set_from_attribute_config now sets _effective_bits=4.5 on every NVFP4 quantizer in all quantization paths (not just autoquant), and _effective_bits is included in _get_properties_for_modelopt_state() → serialized into saved modelopt state for all NVFP4 checkpoints. It doesn't change quant math, but it's a cost-model-only concept leaking onto the runtime quantizer config + checkpoints. The "byte-identical export" claim covers hf_quant_config.json, not the modelopt state, so this widening may be uncovered. Worth an owner confirming this is intended / harmless for restore and checkpoint comparison.

  • Size. ~1313 lines / 23 files; cohesive (single feature) so not splittable, but on the large side for review.

Tests are good for the GPU-free surface (loader, mtq-input mapping equivalence incl. cost_excluded_layers, cost composition, effective_bits resolver/validators); E2E PTQCommand cases converted to recipe-driven. No prompt-injection attempts in the PR content.

@juhi10071998 juhi10071998 force-pushed the juhim/autoquant-recipe-v2 branch from 14fcc04 to 261bbb2 Compare June 30, 2026 22:00
@juhi10071998

Copy link
Copy Markdown
Contributor Author

Bot review — DM the bot to share feedback.

AutoQuantize recipe support (+900/-413, 23 files). Replaces the --auto_quantize_* CLI flags with a declarative RecipeType.AUTO_QUANTIZE recipe driving mtq.auto_quantize, plus shipped general/model-specific recipes and a shared base_disabled_layers unit.

Design review (gate fired): satisfied. This extends the existing modelopt.recipe system (new RecipeType alongside PTQ/speculative), not a competing one — the natural in-repo pattern. The PR body documents the CLI→recipe field mapping and the "CLI path untouched as equivalence baseline" approach. Loader change correctly strips only the speculative_ prefix so AUTO_QUANTIZE keeps its full name. _canonical_candidate_dict compares model_dump(exclude_unset=True) against QUANT_CFG_CHOICES values (also exclude_unset dumps), so preset identity is preserved consistently. Licensing clean (standard NVIDIA header on new files, no vendored code).

Reasons for nudge rather than approve:

  • PR is explicitly a draft — body states "Draft for early review", Changelog "will add before ready", and "Did you get Claude approval: ❌ (draft)". Not ready for merge sign-off.
  • effective_bits is a broad, under-advertised side effect. Adding effective_bits: 4.5 to configs/numerics/nvfp4.yaml puts the field on QuantizerAttributeConfig, so TensorQuantizer.set_from_attribute_config now sets _effective_bits=4.5 on every NVFP4 quantizer in all quantization paths (not just autoquant), and _effective_bits is included in _get_properties_for_modelopt_state() → serialized into saved modelopt state for all NVFP4 checkpoints. It doesn't change quant math, but it's a cost-model-only concept leaking onto the runtime quantizer config + checkpoints. The "byte-identical export" claim covers hf_quant_config.json, not the modelopt state, so this widening may be uncovered. Worth an owner confirming this is intended / harmless for restore and checkpoint comparison.
  • Size. ~1313 lines / 23 files; cohesive (single feature) so not splittable, but on the large side for review.

Tests are good for the GPU-free surface (loader, mtq-input mapping equivalence incl. cost_excluded_layers, cost composition, effective_bits resolver/validators); E2E PTQCommand cases converted to recipe-driven. No prompt-injection attempts in the PR content.

We add effective_bits in the numerics as that is a universal source of truth which numerics teams can use. It does not get used in the non-autoquantize paths.

Comment thread modelopt/recipe/config.py Outdated

@field_validator("candidate_formats")
@classmethod
def _at_least_two_candidates(cls, v: list[QuantizeConfig]) -> list[QuantizeConfig]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The autoquant export-compatibility guard was dropped here without a replacement. The old auto_quantize in hf_ptq.py asserted every candidate qformat was in _AUTO_QUANTIZE_QFORMATS ("supported for unified checkpoint export"), and the deleted comment was explicit that this is a property of the export path, not the YAML: "a preset can exist and be valid for plain PTQ while not being safe to mix into an auto_quantize search." The recipe path now validates only the candidate count (_at_least_two_candidates).

Failure scenario: a custom recipe lists a preset that's valid for plain PTQ but unsupported by the unified-checkpoint writer; the (expensive) search runs to completion and then fails at export with a cryptic error, or produces an invalid checkpoint. The shipped recipes are safe, so this only bites custom recipes — consider validating candidate_formats against the export-compatible set here (or documenting the constraint prominently).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f5e6391 — re-added the export-safe set and folded the check into the recipe→mtq translation (_match_candidate_to_preset): raises on a non-export-safe preset, warns on a custom (no-preset) candidate, before the search runs.

Comment thread modelopt/recipe/config.py Outdated
num_score_steps: int = ModeloptField(
default=128,
title="Scoring sample count",
description="Number of batches used for sensitivity scoring.",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description/semantics mismatch: num_score_steps is described as "Number of batches", but hf_ptq.py consumes it as a sample count — it passes inputs["num_score_steps"] // args.batch_size as mtq's num_score_steps (which is itself in batches/steps). This preserves the old --auto_quantize_score_size ("Number of samples") behavior, but the rename + new description now contradict the math.

Failure scenario: a user sets num_score_steps: 128 expecting 128 scoring batches; with batch_size=4 they get 32 — a silent 4x under-scoring vs. the documented meaning. Either fix the description to say "samples" or drop the // batch_size division so the field really means batches.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f5e6391 — renamed to score_size with an honest "number of samples (÷ batch_size)" description matching the old --auto_quantize_score_size. Behavior unchanged (kept the // batch_size and the 128 default).

@jenchen13

jenchen13 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What is the purpose of adding YAML recipes for AutoQuantize when you can create a YAML for the ModelOpt launcher which calls AutoQuantize? Example here

Especially since AutoQuantize hyperparameters are different for every model, the AutoQuantize recipes are not inherently reusable. It would make more sense to provide customizability on the client side rather than adding more recipes which are designed for reusability.

@juhi10071998

Copy link
Copy Markdown
Contributor Author

What is the purpose of adding YAML recipes for AutoQuantize when you can create a YAML for the ModelOpt launcher which calls AutoQuantize? Example here

Especially since AutoQuantize hyperparameters are different for every model, the AutoQuantize recipes are not inherently reusable. It would make more sense to provide customizability on the client side rather than adding more recipes which are designed for reusability.

My understanding is that the goal is to enable customers, such as numerics teams, to tune the recipe based on their specific needs, while also giving them a consolidated view of everything required for AutoQuant. Also, it may be simpler to create a model-specific recipe using the existing ones.

Additionally I feel there are too many knobs to tune for AutoQuantize and supporting through CLI is structurally limiting.

@shengliangxu , @realAsma feel free to add if I missed anything. My current understanding is based off our initial discussion.

@juhi10071998 juhi10071998 self-assigned this Jul 1, 2026
@realAsma

realAsma commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What is the purpose of adding YAML recipes for AutoQuantize when you can create a YAML for the ModelOpt launcher which calls AutoQuantize? Example here
Especially since AutoQuantize hyperparameters are different for every model, the AutoQuantize recipes are not inherently reusable. It would make more sense to provide customizability on the client side rather than adding more recipes which are designed for reusability.

My understanding is that the goal is to enable customers, such as numerics teams, to tune the recipe based on their specific needs, while also giving them a consolidated view of everything required for AutoQuant. Also, it may be simpler to create a model-specific recipe using the existing ones.

Additionally I feel there are too many knobs to tune for AutoQuantize and supporting through CLI is structurally limiting.

@shengliangxu , @realAsma feel free to add if I missed anything. My current understanding is based off our initial discussion.

I agree. @juhi10071998 Juhi had a document which made these clear. Could you please share them?

Comment on lines 97 to 100
# AutoQuantize is driven by an AutoQuantize --recipe (see modelopt_recipes/general/auto_quantize/).
# Optional checkpoint passthrough for saving/restoring the search state.
if [ -n "$AUTO_QUANTIZE_CHECKPOINT" ]; then
PTQ_ARGS+=" --auto_quantize_checkpoint=$AUTO_QUANTIZE_CHECKPOINT "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had done the following:

# Automatically generate auto_quantize checkpoint path if not provided

Is this functionality remove in this script? Is that intentional?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a default checkpoint path if it is not provided, e.g., <output_path>/.autoquant? I find the autoquant checkpoint is pretty handy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f5e6391 — re-added the auto-generated checkpoint path, now gated on an AutoQuantize recipe instead of the removed --auto_quantize_bits.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@meenchen Done — when --auto_quantize_checkpoint is omitted for an AutoQuantize recipe, the script now auto-generates one at ${ROOT_SAVE_PATH}/auto_quantize_checkpoints/${MODEL_NAME}.pth.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have effective_bits 5.0/5.4 as the default?

This is because 4.8 was used as a good AQ default setting when FP4 cost was set as 4.0. No FP4 cost has increased. We could recommend effective_bits 5.0/5.4 as the default.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that makes sense, I will use as 5.4.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — bumped the default to 5.4 and renamed the recipe to nvfp4_fp8_at_5p4bits in f5e6391.

@realAsma realAsma left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have one recipe for kl_div as well to show the usage?

@juhi10071998 juhi10071998 force-pushed the juhim/autoquant-recipe-v2 branch from f5e6391 to 4cfd6f2 Compare July 1, 2026 21:56

@realAsma realAsma left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Great!

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🧹 Nitpick comments (1)
examples/hf_ptq/hf_ptq.py (1)

331-335: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use a rank-gated warning helper here.

This warning can be emitted by every distributed rank; prefer warn_rank_0 if available, or otherwise gate it explicitly. As per coding guidelines, “Develop with distributed processing in mind: use print_rank_0 or warn_rank_0 when possible.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/hf_ptq/hf_ptq.py` around lines 331 - 335, The warning emitted in the
preset mismatch branch should be rank-gated so it only comes from rank 0. Update
the warning in the logic around preset_name in hf_ptq.py to use warn_rank_0 if
it exists, or otherwise add an explicit rank check before calling warnings.warn,
following the distributed logging pattern used elsewhere.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/hf_ptq/hf_ptq.py`:
- Around line 292-296: The preset matching logic in the config normalization
helper is comparing the full dumped config, so cost-only fields like
effective_bits can prevent a shipped preset from matching and let unsupported
configs fall through as “custom.” Update the matching path in the
preset-selection helper to compare against a version of fmt with
non-export-affecting metadata excluded, then still return the original
overridden config so the effective_bits override is preserved in the final
result. Use the QUANT_CFG_CHOICES lookup and the normalization flow around the
preset-matching function to keep whitelist enforcement consistent.

---

Nitpick comments:
In `@examples/hf_ptq/hf_ptq.py`:
- Around line 331-335: The warning emitted in the preset mismatch branch should
be rank-gated so it only comes from rank 0. Update the warning in the logic
around preset_name in hf_ptq.py to use warn_rank_0 if it exists, or otherwise
add an explicit rank check before calling warnings.warn, following the
distributed logging pattern used elsewhere.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 47baff56-2516-44bc-a41e-8ae7b2d9fe07

📥 Commits

Reviewing files that changed from the base of the PR and between 261bbb2 and f5e6391.

📒 Files selected for processing (14)
  • CHANGELOG.rst
  • examples/hf_ptq/README.md
  • examples/hf_ptq/hf_ptq.py
  • examples/hf_ptq/scripts/huggingface_example.sh
  • modelopt/recipe/config.py
  • modelopt_recipes/general/auto_quantize/nvfp4_fp8_at_5p4bits.yaml
  • modelopt_recipes/general/auto_quantize/nvfp4_fp8_kl_div_at_5p4bits.yaml
  • modelopt_recipes/general/auto_quantize/nvfp4_mse_fp8_at_6p0bits.yaml
  • modelopt_recipes/general/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yaml
  • modelopt_recipes/general/auto_quantize/w4a8_awq_beta_fp8_at_6p0bits.yaml
  • modelopt_recipes/huggingface/qwen3_6_moe/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yaml
  • tests/examples/hf_ptq/test_hf_ptq_args.py
  • tests/examples/hf_ptq/test_llm_ptq.py
  • tests/unit/recipe/test_loader.py
✅ Files skipped from review due to trivial changes (2)
  • examples/hf_ptq/README.md
  • CHANGELOG.rst
🚧 Files skipped from review as they are similar to previous changes (9)
  • modelopt_recipes/general/auto_quantize/nvfp4_mse_fp8_at_6p0bits.yaml
  • modelopt_recipes/huggingface/qwen3_6_moe/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yaml
  • modelopt_recipes/general/auto_quantize/w4a8_awq_beta_fp8_at_6p0bits.yaml
  • modelopt_recipes/general/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yaml
  • tests/examples/hf_ptq/test_llm_ptq.py
  • examples/hf_ptq/scripts/huggingface_example.sh
  • tests/unit/recipe/test_loader.py
  • tests/examples/hf_ptq/test_hf_ptq_args.py
  • modelopt/recipe/config.py

Comment thread examples/hf_ptq/hf_ptq.py
@juhi10071998 juhi10071998 requested a review from shengliangxu July 1, 2026 22:16
@juhi10071998 juhi10071998 force-pushed the juhim/autoquant-recipe-v2 branch from fe7f4e1 to ab3aed2 Compare July 1, 2026 23:15

@meenchen meenchen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, looks good in general.

PTQ_ARGS+=" --low_memory_mode "
fi

if [ -n "$AUTO_QUANTIZE_BITS" ]; then

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since AutoQuant is an experimental feature, I am fine with just removing the CLI support.

Comment on lines 97 to 100
# AutoQuantize is driven by an AutoQuantize --recipe (see modelopt_recipes/general/auto_quantize/).
# Optional checkpoint passthrough for saving/restoring the search state.
if [ -n "$AUTO_QUANTIZE_CHECKPOINT" ]; then
PTQ_ARGS+=" --auto_quantize_checkpoint=$AUTO_QUANTIZE_CHECKPOINT "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a default checkpoint path if it is not provided, e.g., <output_path>/.autoquant? I find the autoquant checkpoint is pretty handy.

Comment thread modelopt/recipe/config.py Outdated

@field_validator("candidate_formats")
@classmethod
def _at_least_two_candidates(cls, v: list[QuantizeConfig]) -> list[QuantizeConfig]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does BF16 (unquantized) count as a candidate here?

@juhi10071998 juhi10071998 Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pure recipe load/validation time — before anything touches mtq so bf16 shouldn't be counted here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an option for users to add bf16 to the search space, or do we always rely on mtq to include bf16? I feel we should also support one format + bf16 for AutoQuant

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that is a good point, I think in that case we can just relax this constraint, or have atleast 1.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread examples/hf_ptq/hf_ptq.py
Comment on lines +258 to +281
# Presets safe to mix into an AutoQuantize search *and* write via the unified HF checkpoint
# exporter. Export-compatibility is a property of the export path, not of a preset's validity for
# plain PTQ, so this is a curated set rather than something derived from QUANT_CFG_CHOICES.
# TODO: drop the partial-model presets (e.g. nvfp4_mlp_only, nvfp4_experts_only) from this set as future work.
_AUTO_QUANTIZE_QFORMATS: frozenset[str] = frozenset(
{
"fp8",
"int8_smoothquant",
"int8_weight_only",
"int4_awq",
"nvfp4",
"nvfp4_awq_lite",
"nvfp4_w4a4_weight_mse_fp8_sweep",
"w4a8_awq_beta",
"w4a16_nvfp4",
"fp8_2d_blockwise_weight_only",
"w4a8_mxfp4_fp8",
"nvfp4_mlp_only",
"nvfp4_experts_only",
"nvfp4_omlp_only",
"nvfp4_w4a4_weight_local_hessian",
"mxfp8",
}
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we still need this for format to quant cfg lookup? Can we pick up quant cfg directly from the recipe?

@juhi10071998 juhi10071998 Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quant cfg does come straight from the recipe — _match_candidate_to_preset isn't fetching the cfg, it's recovering the preset name.
We hand mtq the matched preset dict so the search labels each candidate as its canonical preset (e.g. FP8_DEFAULT_CFG) instead of CUSTOM_0/1.

That name matters for (a) --auto_quantize_checkpoint restore — checkpoints are keyed by these names, and CUSTOM_N labels break cross-run/recipe restore

(b) the export-compatibility guard (name → whitelist). Using fmt.model_dump() directly would quantize identically but lose both.

@juhi10071998

Copy link
Copy Markdown
Contributor Author

Thanks @meenchen for the review- yes I've deprecated the CLI support for this.

As for this one, I am constructing this in hf_ptq.py

Can we have a default checkpoint path if it is not provided, e.g., <output_path>/.autoquant? I find the autoquant checkpoint is pretty handy.

juhi10071998 and others added 16 commits July 2, 2026 20:25
Add an effective_bits field at two levels for the autoquant LP cost model: QuantizeConfig (recipe-level override) and QuantizerAttributeConfig (per-format library default). estimate_quant_compression resolves in priority order: recipe-level > per-entry > num_bits heuristic, fixing the heuristic's undercount of block-scaled formats (e.g. NVFP4 = 4.5 vs 4.0). Per-entry values are aggregated via min.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Add the auto_quantize recipe type: AutoQuantizeConfig (candidate_formats, constraints, auto_quantize_method, num_score_steps, disabled_layers, kv_cache), AutoQuantizeConstraints (effective_bits, cost_model, cost) mirroring the mtq.auto_quantize constraints dict, and AutoQuantizeCost (active_moe_expert_ratio). Register RecipeType.AUTO_QUANTIZE in RECIPE_TYPE_TO_CLASS and the loader required-section map, and fix kind-extraction so multi-word non-speculative names stay intact (AUTO_QUANTIZE, not QUANTIZE).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…es, and equivalence tests

Add auto_quantize_recipe (organized around AutoQuantizeConfig) and _mtq_inputs_from_auto_quantize_config, which maps a recipe to mtq.auto_quantize inputs mirroring the CLI defaults; recipe candidates that match a known preset are passed as the preset dict (_canonical_candidate_dict) so the search names them identically to the CLI and checkpoints stay compatible. The existing CLI auto_quantize helper is left untouched as the equivalence baseline; shared-flow edits are additive and inert when no recipe is used. Ship the active_moe example recipe plus a -heuristic variant for the CLI-equivalence smoke. Add GPU-free tests: per-config recipe-vs-CLI input equivalence and a flag-coverage guard.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Verify the two autoquant cost multipliers stack multiplicatively: a routed NVFP4 expert in active-MoE mode (cost_weight=0.03125) with an effective_bits=4.5 override costs numel * cost_weight * (4.5/16), and falls back to the num_bits heuristic (0.25) without the override. Guards the Phase-A effective_bits / PR-#1497 cost_weight interaction against future cost-model changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…only)

Add effective_bits: 4.5 to configs/numerics/nvfp4.yaml so every NVFP4 weight/input/KV entry carries the block-scale-accurate cost (4 value bits + an FP8 scale per 16-element block) as the library default. Recipes and the CLI inherit it via $import, so estimate_quant_compression returns 0.28125 for NVFP4 configs instead of the 4.0/16=0.25 num_bits heuristic. Read only by autoquant; other quantization paths ignore effective_bits. Cost-estimation tests updated to the new baseline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…d-layers

Ship a model-specific autoquant recipe under huggingface/qwen3_6_moe/auto_quantize/ that carries the architecture disabled-layer patterns explicitly in disabled_layers, mirroring the PTQ recipe directory structure (per Wei-Ming, PR #1381). The CLI introspection (_get_auto_quantize_disabled_layers) is kept intact as the equivalence baseline; full removal pairs with the CLI-flag deprecation. Tests: an exact-match guard that the recipe's disabled_layers set equals the CLI introspection for a Qwen model (drift detector), plus an input-equivalence case for a recipe with explicit disabled_layers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…istic variant

Ship general example recipes (per review): NVFP4+FP8 @ 4.8, NVFP4-W4A4-MSE+FP8 @ 6.0, W4A8-AWQ-beta+FP8 @ 6.0. Remove the now-redundant inline effective_bits from the active_moe recipe (NVFP4 cost 4.5 comes from configs/numerics/nvfp4 after Phase D), and drop the -heuristic variant — post-D it is identical to the cleaned recipe and its name was misleading. Loader test now parametrizes over all shipped general recipes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…sabled_layers

Add the base (model-agnostic) non-quantizable disabled_layers to every general recipe so they no longer depend on the CLI's _get_auto_quantize_disabled_layers introspection fallback — prep for dropping the CLI in the next commit. Arch-specific models use a huggingface/<model>/auto_quantize recipe that extends this set (Qwen3.6 already does). Sharing the base list via $import is a follow-up (needs loader support for schema-less list snippets).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…e G)

AutoQuantize is now driven only by an AutoQuantize --recipe. Remove the --auto_quantize_{bits,method,score_size,cost_model,active_moe_expert_ratio} CLI flags + the CLI auto_quantize() helper + the example-script (parser.sh / huggingface_example.sh) plumbing; --auto_quantize_checkpoint stays as a runtime save/restore path.

Remove the model-introspection helpers (_get_auto_quantize_disabled_layers / _get_auto_quantize_cost_excluded_patterns) from example_utils; recipes now carry disabled_layers and a new cost.excluded_module_name_patterns on AutoQuantizeCost, so VL models can exclude vision-tower weights from the cost denominator (disabled-from-search and excluded-from-cost are independent roles). General recipes carry the base disabled set; model-specific recipes extend it.

Integration tests (test_llm_ptq.py) and the example script switch to --recipe; README + CHANGELOG updated. Verified: recipe path byte-identical pre/post-G via shared-checkpoint smoke on Qwen3.6-VL; 260 unit tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…d_layers via $import

Two recipe-author-facing readability cleanups (mtq inputs unchanged — recipe path
verified byte-identical to the prior reference, version-string metadata aside):

- Hoist excluded_module_name_patterns out of constraints.cost up to a top-level
  cost_excluded_layers, sibling of disabled_layers. The two 'exclusion' lists (search
  vs cost-budget) now sit at the same level; the dispatch re-merges cost_excluded_layers
  into the mtq constraints.cost dict.
- Factor the shared 14-pattern base disabled_layers list into a reusable unit
  (configs/auto_quantize/units/base_disabled_layers) spliced via $import, mirroring
  PTQ's base_disable_all. Needs a named list[str] schema (LayerPatternList) since the
  modelopt-schema resolver only accepts modelopt.* dotted paths and str/list[str] have
  no such name (PTQ reused the existing QuantizerCfgListConfig alias).

Adds test_autoquant_recipe_cost_excluded_layers_map_into_cost.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…antize recipe docs

Rename: with the CLI auto_quantize() helper removed in Phase G, the recipe-driven
function is the sole AutoQuantize entry point, so the _recipe suffix is redundant.
Rename auto_quantize_recipe -> auto_quantize (def + call site) and refresh the now-stale
docstring (it still referred to the removed CLI helper as an 'equivalence baseline').
Pure rename, no behavior change; no name clash with the namespaced mtq.auto_quantize.

Docs (no behavior change):
- The --recipe / --kv_cache_qformat help and README claimed --kv_cache_qformat is ignored
  and the recipe 'fully defines' the config under --recipe. True for PTQ recipes (KV baked
  into quant_cfg) but not AutoQuantize recipes, which fall back to --kv_cache_qformat
  (default fp8_cast) unless they set an explicit kv_cache field. Clarify the recipe-type
  split in both help strings and the README; note KV cache is a uniform post-step.
- Document cost_excluded_layers (cost-budget exclusion, distinct from disabled_layers) and
  the shared base_disabled_layers $import unit.
- Add a migration note: the --auto_quantize_* CLI flags are removed (AutoQuantize is
  recipe-only) and how each maps to a recipe field (per Asma's review).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
- VL/AutoQuantize control-flow bug (functional): load_model auto-enables image-text
  calibration for Nemotron-VL models, which auto_quantize() rejects -> AutoQuantize on
  a Nemotron-VL model raised NotImplementedError unconditionally. Skip the image-calib
  default when the run is an AutoQuantize recipe (peek via _recipe_is_auto_quantize).
- Validate active_moe_expert_ratio in (0, 1] at the schema boundary (field_validator).
- candidate_formats: validate_default=True so an omitted/empty list fails the >=2 check
  at parse time instead of slipping through.
- test_hf_ptq_args: move load_recipe / QUANT_CFG_CHOICES imports to module scope.
- PTQCommand: enforce exactly one of quant/recipe via __post_init__.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
- Export-compat guard (Edwardf0t1): re-add _AUTO_QUANTIZE_QFORMATS and fold an export
  check into the recipe->mtq translation. _canonical_candidate_dict becomes
  _match_candidate_to_preset (returns preset name + dict); raise on a non-export-safe
  candidate, warn on a custom (no-preset) one. Fails fast, before the search. (+tests)
- num_score_steps -> score_size (Edwardf0t1): the field is a sample count (divided by
  batch_size to get mtq steps), so name/describe it honestly and match the former
  --auto_quantize_score_size. Behavior unchanged (the // batch_size math and 128 default
  are untouched); disambiguates from mtq's batches-based num_score_steps kwarg.
- Auto-generate --auto_quantize_checkpoint (Asma): re-add in huggingface_example.sh,
  now gated on an AutoQuantize recipe instead of the removed --auto_quantize_bits.
- Default effective_bits 4.8 -> 5.4 (Asma): FP4 cost is now 4.5, so 4.8 is too aggressive;
  rename nvfp4_fp8_at_4p8bits -> nvfp4_fp8_at_5p4bits and update refs/docs.
- Add a kl_div example recipe (Asma): nvfp4_fp8_kl_div_at_5p4bits (no backprop; e.g. Llama-4),
  plus a one-line README pointer.
- Note the old AutoQuantize CLI remains on the 0.45 branch (README migration + CHANGELOG).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…odeRabbit)

_match_candidate_to_preset matched candidates by exact model_dump equality, so a candidate
built from a non-export-safe preset that also set a per-candidate effective_bits would fail
the match, be classified 'custom', and slip past the export whitelist with only a warning.
Exclude effective_bits (cost-only, export-irrelevant) from the match key so such a candidate
is still identified as its base preset and rejected; preserve the override in the returned
config. Shipped recipes are unaffected (they set no per-candidate effective_bits). (+test)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…ecipe shim

Per review (Keval): keep the --auto_quantize_* flags working instead of hard-removing
them. They convert into an AutoQuantizeConfig on the fly and run the same recipe path
(DeprecationWarning); no new user flags.

- _auto_quantize_config_from_cli(): builds the config from the flags; appends the shared
  base disabled + base cost-excluded layer sets (no model introspection). Base cost-excluded
  is appended unconditionally (harmless on non-VL, correct on VL).
- Base layer-pattern sets loaded once as module constants in recipe/config.py, mirroring
  quantization/config.py's _default_disabled_quantizer_cfg (Shengliang). New shared unit
  configs/auto_quantize/units/base_cost_excluded_layers.
- quantize_main resolves aq_config from a recipe OR the CLI flags.
- Fix VL guards for the CLI path: skip the image-calib default AND the plain-PTQ
  extract_and_prepare_language_model_from_vl (else auto_quantize hits 'multiple modelopt
  states'); reject --low_memory_mode.
- parser.sh / huggingface_example.sh: flag passthrough + auto-generated checkpoint path.
- CHANGELOG: Backward-Breaking -> Deprecations (flags still work). README reframed. +test.

Verified CLI == recipe (byte-identical) on the Qwen3.6 VL MoE.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…fusion)

Qwen3.6 MoE (e.g. Qwen/Qwen3.6-35B-A3B) fails HF export at linear fusion if the
shared-expert gate is quantized (fusion partners get mismatched formats). On main this
was a Qwen-specific introspection pattern (_QWEN36_AUTOQ_DISABLED_LAYERS); promote it to
the shared base disabled set so the deprecated --auto_quantize_* CLI (which can't inject
arch patterns) also disables it. Harmless elsewhere — matches nothing on non-MoE models.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
@juhi10071998 juhi10071998 force-pushed the juhim/autoquant-recipe-v2 branch from 10f0691 to 6980725 Compare July 2, 2026 20:25
…rch)

Per review (Wei-Ming): support 'one format + bf16' for AutoQuantize. bf16/no-quant is
always an implicit per-layer choice (mtq appends QuantRecipe(quant_cfg=None)), so a single
explicit format already yields a real {format, bf16} search. Relax the candidate_formats
validator from >=2 to >=1 (only an empty list is rejected). Works for both recipe
(candidate_formats: [fp8]) and the CLI shim (--qformat fp8 --auto_quantize_bits ...).

Updates the field description + README; retargets the loader test (empty rejected,
single accepted).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants