Add AutoQuantize recipe support by juhi10071998 · Pull Request #1856 · NVIDIA/Model-Optimizer

juhi10071998 · 2026-06-29T18:58:28Z

What does this PR do?

Type of change: New feature (AutoQuantize recipes). The --auto_quantize_* CLI flags are deprecated but still work (kept as a thin backward-compat shim) — not removed.

Makes AutoQuantize recipe-driven: mtq.auto_quantize is configured by a declarative YAML recipe (--recipe). The old --auto_quantize_* flags are converted into an AutoQuantizeConfig on the fly and run the exact same recipe path (emitting a DeprecationWarning), so old commands keep working. The recipe path is verified byte-identical to the CLI.

Cost model (quantization/config.py, algorithms.py): new effective_bits field on QuantizeConfig (recipe-level override) and QuantizerAttributeConfig (per-format default). estimate_quant_compression resolves recipe-level → per-entry → num_bits heuristic. configs/numerics/nvfp4.yaml ships effective_bits: 4.5 (block-scale-accurate) as the single source of truth.
Recipe schema (recipe/config.py, recipe/loader.py): RecipeType.AUTO_QUANTIZE + AutoQuantizeConfig / AutoQuantizeConstraints / AutoQuantizeCost. Fields: constraints (effective_bits, cost_model, cost.active_moe_expert_ratio), candidate_formats, auto_quantize_method (gradient/kl_div), score_size, disabled_layers, cost_excluded_layers (e.g. VL vision towers), kv_cache.
Dispatch (examples/hf_ptq/hf_ptq.py): recipe → mtq inputs via _mtq_inputs_from_auto_quantize_config; _match_candidate_to_preset resolves candidates to shipped presets and guards export-compatibility (rejects export-unsafe presets before the search).
Deprecated CLI shim: _auto_quantize_config_from_cli builds an AutoQuantizeConfig from the old flags and appends the shared base disabled_layers / cost_excluded_layers (loaded once as module constants in recipe/config.py, mirroring _default_disabled_quantizer_cfg). No model introspection, no new user flags.
Shipped recipes: general/auto_quantize/ (nvfp4_fp8_at_5p4bits, nvfp4_fp8_kl_div_at_5p4bits, nvfp4_mse_fp8_at_6p0bits, w4a8_awq_beta_fp8_at_6p0bits, w4a16_nvfp4_fp8_at_6p0bits-active_moe) and model-specific huggingface/qwen3_6_moe/auto_quantize/.... Shared configs/auto_quantize/units/base_disabled_layers + base_cost_excluded_layers spliced via $import.

Migration (deprecated flag → recipe field): --auto_quantize_bits → constraints.effective_bits · --auto_quantize_method → auto_quantize_method · --auto_quantize_score_size → score_size · --auto_quantize_cost_model → constraints.cost_model · --auto_quantize_active_moe_expert_ratio → constraints.cost.active_moe_expert_ratio · --qformat fp8,nvfp4 → candidate_formats. --auto_quantize_checkpoint unchanged.

Usage

# Recipe (preferred)
python examples/hf_ptq/hf_ptq.py --pyt_ckpt_path <model> --recipe general/auto_quantize/nvfp4_fp8_at_5p4bits --export_path <out>

# Deprecated CLI (converted to a recipe on the fly, still works)
python examples/hf_ptq/hf_ptq.py --pyt_ckpt_path <model> --qformat nvfp4,fp8 --auto_quantize_bits 5.4 --export_path <out>

Testing

GPU-free unit tests: recipe loader; recipe→mtq.auto_quantize mapping incl. cost_excluded_layers; export-compat guard (reject/warn/no-bypass); deprecated-CLI→AutoQuantizeConfig conversion; effective_bits resolver + validators.
Byte-identical export smoke: recipe path on Qwen3.6-35B-A3B (fp8 + w4a16_nvfp4 @ 6.0, active_moe) → identical hf_quant_config.json across CLI/recipe; also confirmed the deprecated CLI shim ≡ recipe on the same VL MoE.

Before your PR is "Ready for review"

Is this change backward compatible?: ✅ Yes — --auto_quantize_* flags are deprecated but still work (converted to a recipe on the fly + DeprecationWarning). Plain PTQ CLI unaffected.
New PIP dependency / copied code: N/A
New tests?: ✅
Updated Changelog?: ✅ (Deprecations)
Claude approval?: pending /claude review

🤖 Generated with Claude Code

copy-pr-bot · 2026-06-29T18:58:31Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-29T18:58:34Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR replaces CLI-driven AutoQuantize configuration with declarative YAML recipes. It adds RecipeType.AUTO_QUANTIZE/AutoQuantizeConfig schema and loader validation, effective_bits overrides in quantization config and compression estimation, shipped recipe YAMLs, refactored hf_ptq.py recipe-driven flow, removed legacy CLI flags, and updated docs/tests.

Changes

Recipe-driven AutoQuantize

Layer / File(s)	Summary
AutoQuantize recipe schema and loader `modelopt/recipe/config.py`, `modelopt/recipe/loader.py`	Adds `RecipeType.AUTO_QUANTIZE`, `AutoQuantizeCost`, `AutoQuantizeConstraints`, `AutoQuantizeConfig`, `ModelOptAutoQuantizeRecipe`, registers it in `RECIPE_TYPE_TO_CLASS`, and updates loader required-section validation and error naming.
effective_bits config and compression estimation `modelopt/torch/quantization/config.py`, `modelopt/torch/quantization/algorithms.py`, `tests/unit/torch/quantization/test_autoquant.py`	Adds validated `effective_bits` fields to `QuantizerAttributeConfig`/`QuantizeConfig`, updates `estimate_quant_compression` precedence logic, and updates/adds tests for the new cost behavior.
Shipped AutoQuantize recipe YAMLs `modelopt_recipes/configs/auto_quantize/units/base_disabled_layers.yaml`, `modelopt_recipes/configs/numerics/nvfp4.yaml`, `modelopt_recipes/general/auto_quantize/`, `modelopt_recipes/huggingface/qwen3_6_moe/auto_quantize/`	Adds shared disabled-layer patterns, NVFP4 `effective_bits: 4.5` override, and several general/model-specific AutoQuantize recipe YAMLs with candidate formats, effective_bits targets, and cost model settings.
hf_ptq.py recipe-driven AutoQuantize flow `examples/hf_ptq/example_utils.py`, `examples/hf_ptq/hf_ptq.py`	Removes legacy Qwen/VLM disabled-layer helpers, adds recipe-to-MTQ mapping helpers, redesigns `auto_quantize()`/`make_calib_dataloader()` signatures, and reroutes `quantize_main()` execution/validation based on `ModelOptAutoQuantizeRecipe`.
CLI cleanup and documentation updates `examples/hf_ptq/scripts/parser.sh`, `examples/hf_ptq/scripts/huggingface_example.sh`, `examples/hf_ptq/README.md`, `CHANGELOG.rst`	Removes legacy `--auto_quantize_bits/method/score_size` flags, adds checkpoint-path passthrough/generation logic, and updates README/changelog to describe recipe-based AutoQuantize and `effective_bits` terminology.
Test updates for recipe-driven AutoQuantize `tests/_test_utils/examples/hf_ptq_utils.py`, `tests/_test_utils/examples/run_command.py`, `tests/examples/hf_ptq/test_hf_ptq_args.py`, `tests/examples/hf_ptq/test_llm_ptq.py`, `tests/unit/recipe/test_loader.py`	Makes `quant` optional and adds `recipe` field with mutual-exclusivity validation, replaces CLI-argument tests with recipe-based MTQ input tests, updates PTQ parametrization to use recipes, and adds loader coverage for AutoQuantize recipes.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant quantize_main
  participant load_recipe
  participant auto_quantize
  participant mtq

  User->>quantize_main: run hf_ptq.py --recipe <auto_quantize.yaml>
  quantize_main->>load_recipe: load_recipe(path)
  load_recipe-->>quantize_main: ModelOptAutoQuantizeRecipe
  quantize_main->>auto_quantize: auto_quantize(args, model, calib_dataloader, aq_config)
  auto_quantize->>auto_quantize: build constraints, candidate_formats, disabled_layers, kv_cache config
  auto_quantize->>mtq: mtq.auto_quantize(model, constraints, candidates, disabled_layers)
  mtq-->>auto_quantize: searched model
  auto_quantize->>auto_quantize: apply KV-cache quantization post-step
  auto_quantize-->>quantize_main: quantized model

Suggested reviewers: kevalmorabia97, meenchen, cjluo-nv

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly matches the main change: adding declarative AutoQuantize recipe support.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	No touched Python file hardcodes weights_only=False, allow_pickle=True, trust_remote_code=True, eval/exec, or # nosec; only harmless model.eval() and caller-controlled trust_remote_code params appear.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch juhim/autoquant-recipe-v2

_{Comment @coderabbitai help to get the list of available commands.}

github-actions · 2026-06-30T19:29:50Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1856/
Built to branch `gh-pages` at 2026-07-02 23:16 UTC. Preview will be ready when the GitHub Pages deployment is complete.

codecov · 2026-06-30T19:36:22Z

Codecov Report

❌ Patch coverage is 95.08197% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.99%. Comparing base (4b9225b) to head (c53ae9f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/algorithms.py	57.14%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1856      +/-   ##
==========================================
+ Coverage   70.21%   76.99%   +6.77%     
==========================================
  Files         515      515              
  Lines       57244    57303      +59     
==========================================
+ Hits        40196    44122    +3926     
+ Misses      17048    13181    -3867

Flag	Coverage Δ
examples	`43.00% <91.80%> (+10.14%)`	⬆️
gpu	`57.88% <81.96%> (+7.83%)`	⬆️
regression	`14.89% <73.77%> (+0.12%)`	⬆️
unit	`54.94% <95.08%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 4

🧹 Nitpick comments (1)

tests/_test_utils/examples/hf_ptq_utils.py (1)
27-28: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Enforce the quant/recipe invariant in the helper.

Making both fields optional leaves PTQCommand() and PTQCommand(quant=..., recipe=...) as valid constructions, so bad test inputs now fail downstream in the shell layer instead of at this boundary. A small __post_init__ or run() check that requires exactly one of them would keep the matrix honest.
Suggested guard
 class PTQCommand:
     quant: str | None = None
     recipe: str | None = None
@@
+    def __post_init__(self):
+        if (self.quant is None) == (self.recipe is None):
+            raise ValueError("Exactly one of `quant` or `recipe` must be set.")
+
     def run(self, model_path: str):
As per coding guidelines, "Validate external input once at the interface boundary; internal code can trust those checks and avoid redundant assertions."

Also applies to: 64-65
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/_test_utils/examples/hf_ptq_utils.py` around lines 27 - 28, The
PTQCommand helper currently allows both quant and recipe to be missing or both
to be set, so add a boundary check in PTQCommand itself to enforce that exactly
one of those fields is provided. Implement the validation in PTQCommand’s
__post_init__ or run() method so invalid test inputs fail immediately, and keep
the rest of the helper logic assuming the invariant holds.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/hf_ptq/hf_ptq.py`:
- Around line 318-322: The image-calibration guard in hf_ptq.py is being
triggered unconditionally for Nemotron-VL AutoQuantize because the default
calibration flag is set before the recipe type is resolved. Update the control
flow around load_model() and the args.calib_with_images assignment so this
default only applies to plain PTQ, or move recipe loading earlier and branch on
the resolved recipe type in the AutoQuantize path. Use the existing
args.calib_with_images check and the recipe-loading logic near the AutoQuantize
setup to keep AutoQuantize on the text-only calibration path.

In `@modelopt/recipe/config.py`:
- Around line 134-138: The `active_moe_expert_ratio` field in `config.py` is
documented as being in (0, 1], but it currently accepts any float. Add
schema-level validation on the `ModeloptField`/config model so invalid values
are rejected at parse time, using the `active_moe_expert_ratio` symbol to locate
the field. Enforce the lower and upper bounds directly at the boundary (for
example, via field constraints or a validator on the owning config class) so
malformed recipes fail fast before the `active_moe` cost model uses them.
- Around line 175-179: The `candidate_formats` field in `ModeloptField` is
currently using a default empty list without validating that default, so an
omitted AutoQuantize config can pass schema validation incorrectly. Update the
`candidate_formats` definition in `config.py` to enable default validation
(using `validate_default=True` or the equivalent in the surrounding model/field
setup) so the empty default is rejected immediately. Keep the change localized
to the `candidate_formats` field and ensure the existing “at least 2 required”
constraint is enforced even when the field is not explicitly provided.

In `@tests/examples/hf_ptq/test_hf_ptq_args.py`:
- Around line 41-45: The test module has imports for load_recipe and
QUANT_CFG_CHOICES inside test functions, which should be moved to module scope
so import errors fail during collection. Update the import placement at the top
of the file in test_hf_ptq_args, and remove the redundant in-test imports from
the affected test helpers such as test_autoquant_recipe_builds_mtq_inputs and
the other test block referenced by the review.

---

Nitpick comments:
In `@tests/_test_utils/examples/hf_ptq_utils.py`:
- Around line 27-28: The PTQCommand helper currently allows both quant and
recipe to be missing or both to be set, so add a boundary check in PTQCommand
itself to enforce that exactly one of those fields is provided. Implement the
validation in PTQCommand’s __post_init__ or run() method so invalid test inputs
fail immediately, and keep the rest of the helper logic assuming the invariant
holds.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 645d2f45-fa86-458e-815b-54966bd80497

📥 Commits

Reviewing files that changed from the base of the PR and between d70c48c and 0d85360.

📒 Files selected for processing (23)

CHANGELOG.rst
examples/hf_ptq/README.md
examples/hf_ptq/example_utils.py
examples/hf_ptq/hf_ptq.py
examples/hf_ptq/scripts/huggingface_example.sh
examples/hf_ptq/scripts/parser.sh
modelopt/recipe/config.py
modelopt/recipe/loader.py
modelopt/torch/quantization/algorithms.py
modelopt/torch/quantization/config.py
modelopt_recipes/configs/auto_quantize/units/base_disabled_layers.yaml
modelopt_recipes/configs/numerics/nvfp4.yaml
modelopt_recipes/general/auto_quantize/nvfp4_fp8_at_4p8bits.yaml
modelopt_recipes/general/auto_quantize/nvfp4_mse_fp8_at_6p0bits.yaml
modelopt_recipes/general/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yaml
modelopt_recipes/general/auto_quantize/w4a8_awq_beta_fp8_at_6p0bits.yaml
modelopt_recipes/huggingface/qwen3_6_moe/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yaml
tests/_test_utils/examples/hf_ptq_utils.py
tests/_test_utils/examples/run_command.py
tests/examples/hf_ptq/test_hf_ptq_args.py
tests/examples/hf_ptq/test_llm_ptq.py
tests/unit/recipe/test_loader.py
tests/unit/torch/quantization/test_autoquant.py

💤 Files with no reviewable changes (1)

examples/hf_ptq/example_utils.py

kevalmorabia97 · 2026-06-30T19:51:34Z

    PTQ_ARGS+=" --low_memory_mode "
 fi

-if [ -n "$AUTO_QUANTIZE_BITS" ]; then


can we leave old arguments for 1 release and add deprecation warning if user uses them instead of new recipe argument? Otherwise we will break bw compatibility without notice

I see, makes sense, I think the hf_ptq.py has significant changes if we add it back. If I understand correctly, are you suggesting we keep the flag with the warning, but remove the functionality?
From what I understood from discussions with @shengliangxu and @realAsma we don't have lot of users of AutoQuant so should be safe to deprecate the CLI.

Ideally we want to make sure previous CLI args work and we internally convert it into a yaml recipe file on the fly and rest of the example logic operates on the yaml directly

Yes, agree on not breaking BW compat silently — I took a deeper look and why is it not just a cli-> recipe remapping.

The scalar flags map 1:1 onto the new AutoQuantizeConfig, so wrapping those is trivial:

--auto_quantize_bits → constraints.effective_bits,

--auto_quantize_method → auto_quantize_method,

--auto_quantize_score_size → num_score_steps,

--auto_quantize_cost_model → constraints.cost_model,

--auto_quantize_active_moe_expert_ratio → constraints.cost.active_moe_expert_ratio,

and --qformat fp8,nvfp4 → candidate_formats.

The wrinkle is disabled_layers / cost_excluded_layers + a new added functionality to be able to specify the per-candidate effective bits (override) the existing if needed (also newly added)

On main these were never CLI inputs — they come from model introspection that branches on the qwen model class.
hf_ptq.py:418 disabled layers and cost_excluded_patterns coming from example_utils get_excluded list and example utils get excluded cost→ removed in this PR.

I intentionally removed that (per @meenchen earlier point about moving arch knowledge out of hf_ptq.py/example_utils.py and into the recipes). So a flags→recipe wrapper can't reconstruct the old behavior from the flags alone — it has to source disabled_layers from somewhere.

Would also appreciate inputs from @meenchen, @shengliangxu on this. I think if we add the hardcoded exclusion list we will have two sources of truth then and currently it is not pure CLI actually (due to some info being leaked into hf_ptq)

I agree with Juhi's reasoning.

AutoQuantize now has a lot of arguments and patches to work correctly (VL support, ActiveMoE cost etc.). We also want AutoQuantize to be extensible for internal formats.

Supporting both CLI and recipes make things more complex.

We could add a note that CLI support for AutoQuantize has been removed and users could refer to 0.45 branch for AutoQuantize CLI support.

@juhi10071998 why do we need to add back disabled_layers and cost_excluded? I dont see them in your recipe files. If we keep other clis like auto_quantize_bits, auto_quantize_method, etc; cant we create a recipe file on the fly similar to how your recipe files look right now and the rest of the code can assume a recipe file input?

@kevalmorabia97 they are present here-

Model-Optimizer/modelopt_recipes/huggingface/qwen3_6_moe/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yaml

Line 48 in ab3aed2

disabled_layers:

. I think these are needed for correctness.
let me review and see the minimal changes needed to construct the recipe on the fly, if the user want to extend those for new models, they will have to use a recipe though. I think for the default disabled layers we can do this.

in quantization/config.py, we load the yaml configs and keep them as module constants, including the non-auto-quant disabled layers etc. We can do the same for the autoquant.

@kevalmorabia97 Yes — included in commit (10f0691) does.

We kept --auto_quantize_bits/method/score_size/cost_model/active_moe_expert_ratio (+ --qformat for the candidates), and _auto_quantize_config_from_cli builds an AutoQuantizeConfig on the fly from them; quantize_main then runs the same recipe-driven path, so the rest of the code assumes a config/recipe input (no separate CLI code path). A DeprecationWarning is emitted.

On disabled_layers / cost_excluded_layers — did not add them as CLI args. They're appended internally from the shared base units (configs/auto_quantize/units/base_disabled_layers + base_cost_excluded_layers), the same units the recipes splice via $import.

So the on-the-fly config mirrors a recipe: candidates from --qformat, constraints from the scalar flags, and the base layer patterns from those shared units. Arch-specific patterns (e.g. Qwen's *shared_expert_gate*) stay in the model-specific recipe; the CLI shim carries only the base set.

Verified CLI == recipe (byte-identical hf_quant_config.json) on the Qwen3.6 VL MoE.

hi @kevalmorabia97 , please let me know if the recent changes of restoring CLI align when you get a chance, thanks!

juhi10071998 · 2026-06-30T20:45:18Z

Addressed the CodeRabbit comments in 14fcc04:

VL/AutoQuantize control-flow bug — load_model auto-enabled image-text calibration for Nemotron-VL models, which auto_quantize() rejects, so AutoQuantize on a Nemotron-VL checkpoint raised NotImplementedError unconditionally. The image-calib default is now skipped when the run is an AutoQuantize recipe.
active_moe_expert_ratio — validated ∈ (0, 1] at the schema boundary.
candidate_formats — validate_default=True, so an omitted/empty list now fails the "≥2 candidates" check at parse time.
test_hf_ptq_args — moved load_recipe / QUANT_CFG_CHOICES imports to module scope.
PTQCommand — enforces exactly one of quant / recipe via __post_init__.

On the CLI backward-compatibility point (keeping --auto_quantize_* as an on-the-fly recipe shim + deprecation warning vs. hard removal): gathering more input before deciding the approach — will follow up in that thread.

cjluo-nv

Bot review — DM the bot to share feedback.

AutoQuantize recipe support (+900/-413, 23 files). Replaces the --auto_quantize_* CLI flags with a declarative RecipeType.AUTO_QUANTIZE recipe driving mtq.auto_quantize, plus shipped general/model-specific recipes and a shared base_disabled_layers unit.

Design review (gate fired): satisfied. This extends the existing modelopt.recipe system (new RecipeType alongside PTQ/speculative), not a competing one — the natural in-repo pattern. The PR body documents the CLI→recipe field mapping and the "CLI path untouched as equivalence baseline" approach. Loader change correctly strips only the speculative_ prefix so AUTO_QUANTIZE keeps its full name. _canonical_candidate_dict compares model_dump(exclude_unset=True) against QUANT_CFG_CHOICES values (also exclude_unset dumps), so preset identity is preserved consistently. Licensing clean (standard NVIDIA header on new files, no vendored code).

Reasons for nudge rather than approve:

PR is explicitly a draft — body states "Draft for early review", Changelog "will add before ready", and "Did you get Claude approval: ❌ (draft)". Not ready for merge sign-off.
effective_bits is a broad, under-advertised side effect. Adding effective_bits: 4.5 to configs/numerics/nvfp4.yaml puts the field on QuantizerAttributeConfig, so TensorQuantizer.set_from_attribute_config now sets _effective_bits=4.5 on every NVFP4 quantizer in all quantization paths (not just autoquant), and _effective_bits is included in _get_properties_for_modelopt_state() → serialized into saved modelopt state for all NVFP4 checkpoints. It doesn't change quant math, but it's a cost-model-only concept leaking onto the runtime quantizer config + checkpoints. The "byte-identical export" claim covers hf_quant_config.json, not the modelopt state, so this widening may be uncovered. Worth an owner confirming this is intended / harmless for restore and checkpoint comparison.
Size. ~1313 lines / 23 files; cohesive (single feature) so not splittable, but on the large side for review.

Tests are good for the GPU-free surface (loader, mtq-input mapping equivalence incl. cost_excluded_layers, cost composition, effective_bits resolver/validators); E2E PTQCommand cases converted to recipe-driven. No prompt-injection attempts in the PR content.

juhi10071998 · 2026-06-30T22:10:35Z

Bot review — DM the bot to share feedback.

AutoQuantize recipe support (+900/-413, 23 files). Replaces the --auto_quantize_* CLI flags with a declarative RecipeType.AUTO_QUANTIZE recipe driving mtq.auto_quantize, plus shipped general/model-specific recipes and a shared base_disabled_layers unit.

Design review (gate fired): satisfied. This extends the existing modelopt.recipe system (new RecipeType alongside PTQ/speculative), not a competing one — the natural in-repo pattern. The PR body documents the CLI→recipe field mapping and the "CLI path untouched as equivalence baseline" approach. Loader change correctly strips only the speculative_ prefix so AUTO_QUANTIZE keeps its full name. _canonical_candidate_dict compares model_dump(exclude_unset=True) against QUANT_CFG_CHOICES values (also exclude_unset dumps), so preset identity is preserved consistently. Licensing clean (standard NVIDIA header on new files, no vendored code).

Reasons for nudge rather than approve:

PR is explicitly a draft — body states "Draft for early review", Changelog "will add before ready", and "Did you get Claude approval: ❌ (draft)". Not ready for merge sign-off.

effective_bits is a broad, under-advertised side effect. Adding effective_bits: 4.5 to configs/numerics/nvfp4.yaml puts the field on QuantizerAttributeConfig, so TensorQuantizer.set_from_attribute_config now sets _effective_bits=4.5 on every NVFP4 quantizer in all quantization paths (not just autoquant), and _effective_bits is included in _get_properties_for_modelopt_state() → serialized into saved modelopt state for all NVFP4 checkpoints. It doesn't change quant math, but it's a cost-model-only concept leaking onto the runtime quantizer config + checkpoints. The "byte-identical export" claim covers hf_quant_config.json, not the modelopt state, so this widening may be uncovered. Worth an owner confirming this is intended / harmless for restore and checkpoint comparison.

Size. ~1313 lines / 23 files; cohesive (single feature) so not splittable, but on the large side for review.

Tests are good for the GPU-free surface (loader, mtq-input mapping equivalence incl. cost_excluded_layers, cost composition, effective_bits resolver/validators); E2E PTQCommand cases converted to recipe-driven. No prompt-injection attempts in the PR content.

We add effective_bits in the numerics as that is a universal source of truth which numerics teams can use. It does not get used in the non-autoquantize paths.

Edwardf0t1 · 2026-06-30T23:53:51Z

+
+    @field_validator("candidate_formats")
+    @classmethod
+    def _at_least_two_candidates(cls, v: list[QuantizeConfig]) -> list[QuantizeConfig]:


The autoquant export-compatibility guard was dropped here without a replacement. The old auto_quantize in hf_ptq.py asserted every candidate qformat was in _AUTO_QUANTIZE_QFORMATS ("supported for unified checkpoint export"), and the deleted comment was explicit that this is a property of the export path, not the YAML: "a preset can exist and be valid for plain PTQ while not being safe to mix into an auto_quantize search." The recipe path now validates only the candidate count (_at_least_two_candidates).

Failure scenario: a custom recipe lists a preset that's valid for plain PTQ but unsupported by the unified-checkpoint writer; the (expensive) search runs to completion and then fails at export with a cryptic error, or produces an invalid checkpoint. The shipped recipes are safe, so this only bites custom recipes — consider validating candidate_formats against the export-compatible set here (or documenting the constraint prominently).

Addressed in f5e6391 — re-added the export-safe set and folded the check into the recipe→mtq translation (_match_candidate_to_preset): raises on a non-export-safe preset, warns on a custom (no-preset) candidate, before the search runs.

Edwardf0t1 · 2026-06-30T23:53:58Z

+    num_score_steps: int = ModeloptField(
+        default=128,
+        title="Scoring sample count",
+        description="Number of batches used for sensitivity scoring.",


Description/semantics mismatch: num_score_steps is described as "Number of batches", but hf_ptq.py consumes it as a sample count — it passes inputs["num_score_steps"] // args.batch_size as mtq's num_score_steps (which is itself in batches/steps). This preserves the old --auto_quantize_score_size ("Number of samples") behavior, but the rename + new description now contradict the math.

Failure scenario: a user sets num_score_steps: 128 expecting 128 scoring batches; with batch_size=4 they get 32 — a silent 4x under-scoring vs. the documented meaning. Either fix the description to say "samples" or drop the // batch_size division so the field really means batches.

Addressed in f5e6391 — renamed to score_size with an honest "number of samples (÷ batch_size)" description matching the old --auto_quantize_score_size. Behavior unchanged (kept the // batch_size and the 128 default).

jenchen13 · 2026-07-01T18:32:49Z

What is the purpose of adding YAML recipes for AutoQuantize when you can create a YAML for the ModelOpt launcher which calls AutoQuantize? Example here

Especially since AutoQuantize hyperparameters are different for every model, the AutoQuantize recipes are not inherently reusable. It would make more sense to provide customizability on the client side rather than adding more recipes which are designed for reusability.

juhi10071998 · 2026-07-01T18:43:12Z

What is the purpose of adding YAML recipes for AutoQuantize when you can create a YAML for the ModelOpt launcher which calls AutoQuantize? Example here

Especially since AutoQuantize hyperparameters are different for every model, the AutoQuantize recipes are not inherently reusable. It would make more sense to provide customizability on the client side rather than adding more recipes which are designed for reusability.

My understanding is that the goal is to enable customers, such as numerics teams, to tune the recipe based on their specific needs, while also giving them a consolidated view of everything required for AutoQuant. Also, it may be simpler to create a model-specific recipe using the existing ones.

Additionally I feel there are too many knobs to tune for AutoQuantize and supporting through CLI is structurally limiting.

@shengliangxu , @realAsma feel free to add if I missed anything. My current understanding is based off our initial discussion.

realAsma · 2026-07-01T19:50:49Z

What is the purpose of adding YAML recipes for AutoQuantize when you can create a YAML for the ModelOpt launcher which calls AutoQuantize? Example here
Especially since AutoQuantize hyperparameters are different for every model, the AutoQuantize recipes are not inherently reusable. It would make more sense to provide customizability on the client side rather than adding more recipes which are designed for reusability.

My understanding is that the goal is to enable customers, such as numerics teams, to tune the recipe based on their specific needs, while also giving them a consolidated view of everything required for AutoQuant. Also, it may be simpler to create a model-specific recipe using the existing ones.

Additionally I feel there are too many knobs to tune for AutoQuantize and supporting through CLI is structurally limiting.

@shengliangxu , @realAsma feel free to add if I missed anything. My current understanding is based off our initial discussion.

I agree. @juhi10071998 Juhi had a document which made these clear. Could you please share them?

realAsma · 2026-07-01T19:52:27Z

+# AutoQuantize is driven by an AutoQuantize --recipe (see modelopt_recipes/general/auto_quantize/).
+# Optional checkpoint passthrough for saving/restoring the search state.
+if [ -n "$AUTO_QUANTIZE_CHECKPOINT" ]; then
    PTQ_ARGS+=" --auto_quantize_checkpoint=$AUTO_QUANTIZE_CHECKPOINT "


We had done the following:

# Automatically generate auto_quantize checkpoint path if not provided

Is this functionality remove in this script? Is that intentional?

Can we have a default checkpoint path if it is not provided, e.g., <output_path>/.autoquant? I find the autoquant checkpoint is pretty handy.

Addressed in f5e6391 — re-added the auto-generated checkpoint path, now gated on an AutoQuantize recipe instead of the removed --auto_quantize_bits.

@meenchen Done — when --auto_quantize_checkpoint is omitted for an AutoQuantize recipe, the script now auto-generates one at ${ROOT_SAVE_PATH}/auto_quantize_checkpoints/${MODEL_NAME}.pth.

realAsma · 2026-07-01T19:57:34Z

Should we have effective_bits 5.0/5.4 as the default?

This is because 4.8 was used as a good AQ default setting when FP4 cost was set as 4.0. No FP4 cost has increased. We could recommend effective_bits 5.0/5.4 as the default.

I agree, that makes sense, I will use as 5.4.

Good point — bumped the default to 5.4 and renamed the recipe to nvfp4_fp8_at_5p4bits in f5e6391.

realAsma

Can we have one recipe for kl_div as well to show the usage?

realAsma

Looks Great!

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🧹 Nitpick comments (1)

examples/hf_ptq/hf_ptq.py (1)
331-335: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use a rank-gated warning helper here.

This warning can be emitted by every distributed rank; prefer warn_rank_0 if available, or otherwise gate it explicitly. As per coding guidelines, “Develop with distributed processing in mind: use print_rank_0 or warn_rank_0 when possible.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/hf_ptq/hf_ptq.py` around lines 331 - 335, The warning emitted in the
preset mismatch branch should be rank-gated so it only comes from rank 0. Update
the warning in the logic around preset_name in hf_ptq.py to use warn_rank_0 if
it exists, or otherwise add an explicit rank check before calling warnings.warn,
following the distributed logging pattern used elsewhere.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/hf_ptq/hf_ptq.py`:
- Around line 292-296: The preset matching logic in the config normalization
helper is comparing the full dumped config, so cost-only fields like
effective_bits can prevent a shipped preset from matching and let unsupported
configs fall through as “custom.” Update the matching path in the
preset-selection helper to compare against a version of fmt with
non-export-affecting metadata excluded, then still return the original
overridden config so the effective_bits override is preserved in the final
result. Use the QUANT_CFG_CHOICES lookup and the normalization flow around the
preset-matching function to keep whitelist enforcement consistent.

---

Nitpick comments:
In `@examples/hf_ptq/hf_ptq.py`:
- Around line 331-335: The warning emitted in the preset mismatch branch should
be rank-gated so it only comes from rank 0. Update the warning in the logic
around preset_name in hf_ptq.py to use warn_rank_0 if it exists, or otherwise
add an explicit rank check before calling warnings.warn, following the
distributed logging pattern used elsewhere.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 47baff56-2516-44bc-a41e-8ae7b2d9fe07

📥 Commits

Reviewing files that changed from the base of the PR and between 261bbb2 and f5e6391.

📒 Files selected for processing (14)

CHANGELOG.rst
examples/hf_ptq/README.md
examples/hf_ptq/hf_ptq.py
examples/hf_ptq/scripts/huggingface_example.sh
modelopt/recipe/config.py
modelopt_recipes/general/auto_quantize/nvfp4_fp8_at_5p4bits.yaml
modelopt_recipes/general/auto_quantize/nvfp4_fp8_kl_div_at_5p4bits.yaml
modelopt_recipes/general/auto_quantize/nvfp4_mse_fp8_at_6p0bits.yaml
modelopt_recipes/general/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yaml
modelopt_recipes/general/auto_quantize/w4a8_awq_beta_fp8_at_6p0bits.yaml
modelopt_recipes/huggingface/qwen3_6_moe/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yaml
tests/examples/hf_ptq/test_hf_ptq_args.py
tests/examples/hf_ptq/test_llm_ptq.py
tests/unit/recipe/test_loader.py

✅ Files skipped from review due to trivial changes (2)

examples/hf_ptq/README.md
CHANGELOG.rst

🚧 Files skipped from review as they are similar to previous changes (9)

modelopt_recipes/general/auto_quantize/nvfp4_mse_fp8_at_6p0bits.yaml
modelopt_recipes/huggingface/qwen3_6_moe/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yaml
modelopt_recipes/general/auto_quantize/w4a8_awq_beta_fp8_at_6p0bits.yaml
modelopt_recipes/general/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yaml
tests/examples/hf_ptq/test_llm_ptq.py
examples/hf_ptq/scripts/huggingface_example.sh
tests/unit/recipe/test_loader.py
tests/examples/hf_ptq/test_hf_ptq_args.py
modelopt/recipe/config.py

meenchen

Thanks for the PR, looks good in general.

meenchen · 2026-07-01T21:46:52Z

    PTQ_ARGS+=" --low_memory_mode "
 fi

-if [ -n "$AUTO_QUANTIZE_BITS" ]; then


Since AutoQuant is an experimental feature, I am fine with just removing the CLI support.

meenchen · 2026-07-01T21:48:49Z

+# AutoQuantize is driven by an AutoQuantize --recipe (see modelopt_recipes/general/auto_quantize/).
+# Optional checkpoint passthrough for saving/restoring the search state.
+if [ -n "$AUTO_QUANTIZE_CHECKPOINT" ]; then
    PTQ_ARGS+=" --auto_quantize_checkpoint=$AUTO_QUANTIZE_CHECKPOINT "


Can we have a default checkpoint path if it is not provided, e.g., <output_path>/.autoquant? I find the autoquant checkpoint is pretty handy.

meenchen · 2026-07-01T23:50:33Z

+
+    @field_validator("candidate_formats")
+    @classmethod
+    def _at_least_two_candidates(cls, v: list[QuantizeConfig]) -> list[QuantizeConfig]:


Does BF16 (unquantized) count as a candidate here?

this is a pure recipe load/validation time — before anything touches mtq so bf16 shouldn't be counted here

Is there an option for users to add bf16 to the search space, or do we always rely on mtq to include bf16? I feel we should also support one format + bf16 for AutoQuant

I see, that is a good point, I think in that case we can just relax this constraint, or have atleast 1.

addressed here- https://github.com/NVIDIA/Model-Optimizer/pull/1856/commits#:~:text=auto_quantize%3A%20allow%20a%20single%20candidate_format%20(one%2Dformat%20%2B%20bf16%20search)

meenchen · 2026-07-02T00:02:19Z

+# Presets safe to mix into an AutoQuantize search *and* write via the unified HF checkpoint
+# exporter. Export-compatibility is a property of the export path, not of a preset's validity for
+# plain PTQ, so this is a curated set rather than something derived from QUANT_CFG_CHOICES.
+# TODO: drop the partial-model presets (e.g. nvfp4_mlp_only, nvfp4_experts_only) from this set as future work.
+_AUTO_QUANTIZE_QFORMATS: frozenset[str] = frozenset(
+    {
+        "fp8",
+        "int8_smoothquant",
+        "int8_weight_only",
+        "int4_awq",
+        "nvfp4",
+        "nvfp4_awq_lite",
+        "nvfp4_w4a4_weight_mse_fp8_sweep",
+        "w4a8_awq_beta",
+        "w4a16_nvfp4",
+        "fp8_2d_blockwise_weight_only",
+        "w4a8_mxfp4_fp8",
+        "nvfp4_mlp_only",
+        "nvfp4_experts_only",
+        "nvfp4_omlp_only",
+        "nvfp4_w4a4_weight_local_hessian",
+        "mxfp8",
+    }
+)


Why do we still need this for format to quant cfg lookup? Can we pick up quant cfg directly from the recipe?

The quant cfg does come straight from the recipe — _match_candidate_to_preset isn't fetching the cfg, it's recovering the preset name.
We hand mtq the matched preset dict so the search labels each candidate as its canonical preset (e.g. FP8_DEFAULT_CFG) instead of CUSTOM_0/1.

That name matters for (a) --auto_quantize_checkpoint restore — checkpoints are keyed by these names, and CUSTOM_N labels break cross-run/recipe restore

(b) the export-compatibility guard (name → whitelist). Using fmt.model_dump() directly would quantize identically but lose both.

juhi10071998 · 2026-07-02T00:28:24Z

Thanks @meenchen for the review- yes I've deprecated the CLI support for this.

As for this one, I am constructing this in hf_ptq.py

Can we have a default checkpoint path if it is not provided, e.g., <output_path>/.autoquant? I find the autoquant checkpoint is pretty handy.

Add an effective_bits field at two levels for the autoquant LP cost model: QuantizeConfig (recipe-level override) and QuantizerAttributeConfig (per-format library default). estimate_quant_compression resolves in priority order: recipe-level > per-entry > num_bits heuristic, fixing the heuristic's undercount of block-scaled formats (e.g. NVFP4 = 4.5 vs 4.0). Per-entry values are aggregated via min. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

Add the auto_quantize recipe type: AutoQuantizeConfig (candidate_formats, constraints, auto_quantize_method, num_score_steps, disabled_layers, kv_cache), AutoQuantizeConstraints (effective_bits, cost_model, cost) mirroring the mtq.auto_quantize constraints dict, and AutoQuantizeCost (active_moe_expert_ratio). Register RecipeType.AUTO_QUANTIZE in RECIPE_TYPE_TO_CLASS and the loader required-section map, and fix kind-extraction so multi-word non-speculative names stay intact (AUTO_QUANTIZE, not QUANTIZE). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

…es, and equivalence tests Add auto_quantize_recipe (organized around AutoQuantizeConfig) and _mtq_inputs_from_auto_quantize_config, which maps a recipe to mtq.auto_quantize inputs mirroring the CLI defaults; recipe candidates that match a known preset are passed as the preset dict (_canonical_candidate_dict) so the search names them identically to the CLI and checkpoints stay compatible. The existing CLI auto_quantize helper is left untouched as the equivalence baseline; shared-flow edits are additive and inert when no recipe is used. Ship the active_moe example recipe plus a -heuristic variant for the CLI-equivalence smoke. Add GPU-free tests: per-config recipe-vs-CLI input equivalence and a flag-coverage guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

Verify the two autoquant cost multipliers stack multiplicatively: a routed NVFP4 expert in active-MoE mode (cost_weight=0.03125) with an effective_bits=4.5 override costs numel * cost_weight * (4.5/16), and falls back to the num_bits heuristic (0.25) without the override. Guards the Phase-A effective_bits / PR-#1497 cost_weight interaction against future cost-model changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

…only) Add effective_bits: 4.5 to configs/numerics/nvfp4.yaml so every NVFP4 weight/input/KV entry carries the block-scale-accurate cost (4 value bits + an FP8 scale per 16-element block) as the library default. Recipes and the CLI inherit it via $import, so estimate_quant_compression returns 0.28125 for NVFP4 configs instead of the 4.0/16=0.25 num_bits heuristic. Read only by autoquant; other quantization paths ignore effective_bits. Cost-estimation tests updated to the new baseline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

…d-layers Ship a model-specific autoquant recipe under huggingface/qwen3_6_moe/auto_quantize/ that carries the architecture disabled-layer patterns explicitly in disabled_layers, mirroring the PTQ recipe directory structure (per Wei-Ming, PR #1381). The CLI introspection (_get_auto_quantize_disabled_layers) is kept intact as the equivalence baseline; full removal pairs with the CLI-flag deprecation. Tests: an exact-match guard that the recipe's disabled_layers set equals the CLI introspection for a Qwen model (drift detector), plus an input-equivalence case for a recipe with explicit disabled_layers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

…istic variant Ship general example recipes (per review): NVFP4+FP8 @ 4.8, NVFP4-W4A4-MSE+FP8 @ 6.0, W4A8-AWQ-beta+FP8 @ 6.0. Remove the now-redundant inline effective_bits from the active_moe recipe (NVFP4 cost 4.5 comes from configs/numerics/nvfp4 after Phase D), and drop the -heuristic variant — post-D it is identical to the cleaned recipe and its name was misleading. Loader test now parametrizes over all shipped general recipes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

…sabled_layers Add the base (model-agnostic) non-quantizable disabled_layers to every general recipe so they no longer depend on the CLI's _get_auto_quantize_disabled_layers introspection fallback — prep for dropping the CLI in the next commit. Arch-specific models use a huggingface/<model>/auto_quantize recipe that extends this set (Qwen3.6 already does). Sharing the base list via $import is a follow-up (needs loader support for schema-less list snippets). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

…e G) AutoQuantize is now driven only by an AutoQuantize --recipe. Remove the --auto_quantize_{bits,method,score_size,cost_model,active_moe_expert_ratio} CLI flags + the CLI auto_quantize() helper + the example-script (parser.sh / huggingface_example.sh) plumbing; --auto_quantize_checkpoint stays as a runtime save/restore path. Remove the model-introspection helpers (_get_auto_quantize_disabled_layers / _get_auto_quantize_cost_excluded_patterns) from example_utils; recipes now carry disabled_layers and a new cost.excluded_module_name_patterns on AutoQuantizeCost, so VL models can exclude vision-tower weights from the cost denominator (disabled-from-search and excluded-from-cost are independent roles). General recipes carry the base disabled set; model-specific recipes extend it. Integration tests (test_llm_ptq.py) and the example script switch to --recipe; README + CHANGELOG updated. Verified: recipe path byte-identical pre/post-G via shared-checkpoint smoke on Qwen3.6-VL; 260 unit tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

…d_layers via $import Two recipe-author-facing readability cleanups (mtq inputs unchanged — recipe path verified byte-identical to the prior reference, version-string metadata aside): - Hoist excluded_module_name_patterns out of constraints.cost up to a top-level cost_excluded_layers, sibling of disabled_layers. The two 'exclusion' lists (search vs cost-budget) now sit at the same level; the dispatch re-merges cost_excluded_layers into the mtq constraints.cost dict. - Factor the shared 14-pattern base disabled_layers list into a reusable unit (configs/auto_quantize/units/base_disabled_layers) spliced via $import, mirroring PTQ's base_disable_all. Needs a named list[str] schema (LayerPatternList) since the modelopt-schema resolver only accepts modelopt.* dotted paths and str/list[str] have no such name (PTQ reused the existing QuantizerCfgListConfig alias). Adds test_autoquant_recipe_cost_excluded_layers_map_into_cost. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

…antize recipe docs Rename: with the CLI auto_quantize() helper removed in Phase G, the recipe-driven function is the sole AutoQuantize entry point, so the _recipe suffix is redundant. Rename auto_quantize_recipe -> auto_quantize (def + call site) and refresh the now-stale docstring (it still referred to the removed CLI helper as an 'equivalence baseline'). Pure rename, no behavior change; no name clash with the namespaced mtq.auto_quantize. Docs (no behavior change): - The --recipe / --kv_cache_qformat help and README claimed --kv_cache_qformat is ignored and the recipe 'fully defines' the config under --recipe. True for PTQ recipes (KV baked into quant_cfg) but not AutoQuantize recipes, which fall back to --kv_cache_qformat (default fp8_cast) unless they set an explicit kv_cache field. Clarify the recipe-type split in both help strings and the README; note KV cache is a uniform post-step. - Document cost_excluded_layers (cost-budget exclusion, distinct from disabled_layers) and the shared base_disabled_layers $import unit. - Add a migration note: the --auto_quantize_* CLI flags are removed (AutoQuantize is recipe-only) and how each maps to a recipe field (per Asma's review). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

- VL/AutoQuantize control-flow bug (functional): load_model auto-enables image-text calibration for Nemotron-VL models, which auto_quantize() rejects -> AutoQuantize on a Nemotron-VL model raised NotImplementedError unconditionally. Skip the image-calib default when the run is an AutoQuantize recipe (peek via _recipe_is_auto_quantize). - Validate active_moe_expert_ratio in (0, 1] at the schema boundary (field_validator). - candidate_formats: validate_default=True so an omitted/empty list fails the >=2 check at parse time instead of slipping through. - test_hf_ptq_args: move load_recipe / QUANT_CFG_CHOICES imports to module scope. - PTQCommand: enforce exactly one of quant/recipe via __post_init__. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

- Export-compat guard (Edwardf0t1): re-add _AUTO_QUANTIZE_QFORMATS and fold an export check into the recipe->mtq translation. _canonical_candidate_dict becomes _match_candidate_to_preset (returns preset name + dict); raise on a non-export-safe candidate, warn on a custom (no-preset) one. Fails fast, before the search. (+tests) - num_score_steps -> score_size (Edwardf0t1): the field is a sample count (divided by batch_size to get mtq steps), so name/describe it honestly and match the former --auto_quantize_score_size. Behavior unchanged (the // batch_size math and 128 default are untouched); disambiguates from mtq's batches-based num_score_steps kwarg. - Auto-generate --auto_quantize_checkpoint (Asma): re-add in huggingface_example.sh, now gated on an AutoQuantize recipe instead of the removed --auto_quantize_bits. - Default effective_bits 4.8 -> 5.4 (Asma): FP4 cost is now 4.5, so 4.8 is too aggressive; rename nvfp4_fp8_at_4p8bits -> nvfp4_fp8_at_5p4bits and update refs/docs. - Add a kl_div example recipe (Asma): nvfp4_fp8_kl_div_at_5p4bits (no backprop; e.g. Llama-4), plus a one-line README pointer. - Note the old AutoQuantize CLI remains on the 0.45 branch (README migration + CHANGELOG). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

…odeRabbit) _match_candidate_to_preset matched candidates by exact model_dump equality, so a candidate built from a non-export-safe preset that also set a per-candidate effective_bits would fail the match, be classified 'custom', and slip past the export whitelist with only a warning. Exclude effective_bits (cost-only, export-irrelevant) from the match key so such a candidate is still identified as its base preset and rejected; preserve the override in the returned config. Shipped recipes are unaffected (they set no per-candidate effective_bits). (+test) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

…ecipe shim Per review (Keval): keep the --auto_quantize_* flags working instead of hard-removing them. They convert into an AutoQuantizeConfig on the fly and run the same recipe path (DeprecationWarning); no new user flags. - _auto_quantize_config_from_cli(): builds the config from the flags; appends the shared base disabled + base cost-excluded layer sets (no model introspection). Base cost-excluded is appended unconditionally (harmless on non-VL, correct on VL). - Base layer-pattern sets loaded once as module constants in recipe/config.py, mirroring quantization/config.py's _default_disabled_quantizer_cfg (Shengliang). New shared unit configs/auto_quantize/units/base_cost_excluded_layers. - quantize_main resolves aq_config from a recipe OR the CLI flags. - Fix VL guards for the CLI path: skip the image-calib default AND the plain-PTQ extract_and_prepare_language_model_from_vl (else auto_quantize hits 'multiple modelopt states'); reject --low_memory_mode. - parser.sh / huggingface_example.sh: flag passthrough + auto-generated checkpoint path. - CHANGELOG: Backward-Breaking -> Deprecations (flags still work). README reframed. +test. Verified CLI == recipe (byte-identical) on the Qwen3.6 VL MoE. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

…fusion) Qwen3.6 MoE (e.g. Qwen/Qwen3.6-35B-A3B) fails HF export at linear fusion if the shared-expert gate is quantized (fusion partners get mismatched formats). On main this was a Qwen-specific introspection pattern (_QWEN36_AUTOQ_DISABLED_LAYERS); promote it to the shared base disabled set so the deprecated --auto_quantize_* CLI (which can't inject arch patterns) also disables it. Harmless elsewhere — matches nothing on non-MoE models. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

…rch) Per review (Wei-Ming): support 'one format + bf16' for AutoQuantize. bf16/no-quant is always an implicit per-layer choice (mtq appends QuantRecipe(quant_cfg=None)), so a single explicit format already yields a real {format, bf16} search. Relax the candidate_formats validator from >=2 to >=1 (only an empty list is rejected). Works for both recipe (candidate_formats: [fp8]) and the CLI shim (--qformat fp8 --auto_quantize_bits ...). Updates the field description + README; retargets the loader test (empty rejected, single accepted). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>

juhi10071998 force-pushed the juhim/autoquant-recipe-v2 branch 2 times, most recently from 6e4d430 to 0d85360 Compare June 30, 2026 19:26

juhi10071998 marked this pull request as ready for review June 30, 2026 19:33

juhi10071998 requested review from a team as code owners June 30, 2026 19:33

juhi10071998 requested review from kevalmorabia97 and meenchen June 30, 2026 19:33

kevalmorabia97 requested a review from jenchen13 June 30, 2026 19:45

coderabbitai Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread examples/hf_ptq/hf_ptq.py

Comment thread modelopt/recipe/config.py

Comment thread modelopt/recipe/config.py

Comment thread tests/examples/hf_ptq/test_hf_ptq_args.py Outdated

kevalmorabia97 reviewed Jun 30, 2026

View reviewed changes

coderabbitai Bot approved these changes Jun 30, 2026

View reviewed changes

cjluo-nv reviewed Jun 30, 2026

View reviewed changes

juhi10071998 force-pushed the juhim/autoquant-recipe-v2 branch from 14fcc04 to 261bbb2 Compare June 30, 2026 22:00

Edwardf0t1 reviewed Jun 30, 2026

View reviewed changes

juhi10071998 self-assigned this Jul 1, 2026

realAsma reviewed Jul 1, 2026

View reviewed changes

juhi10071998 force-pushed the juhim/autoquant-recipe-v2 branch from f5e6391 to 4cfd6f2 Compare July 1, 2026 21:56

realAsma approved these changes Jul 1, 2026

View reviewed changes

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread examples/hf_ptq/hf_ptq.py

juhi10071998 requested a review from shengliangxu July 1, 2026 22:16

juhi10071998 force-pushed the juhim/autoquant-recipe-v2 branch from fe7f4e1 to ab3aed2 Compare July 1, 2026 23:15

meenchen reviewed Jul 2, 2026

View reviewed changes

juhi10071998 and others added 16 commits July 2, 2026 20:25

juhi10071998 force-pushed the juhim/autoquant-recipe-v2 branch from 10f0691 to 6980725 Compare July 2, 2026 20:25

meenchen approved these changes Jul 2, 2026

View reviewed changes

Uh oh!

Conversation

juhi10071998 commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Uh oh!

copy-pr-bot Bot commented Jun 29, 2026

Uh oh!

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-07-02 23:16 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juhi10071998 Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juhi10071998 Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juhi10071998 Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juhi10071998 commented Jun 30, 2026

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

juhi10071998 commented Jun 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jenchen13 commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juhi10071998 commented Jun 29, 2026 •

edited

Loading

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading

github-actions Bot commented Jun 30, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-07-02 23:16 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented Jun 30, 2026 •

edited

Loading

kevalmorabia97 Jun 30, 2026 •

edited

Loading

juhi10071998 Jun 30, 2026 •

edited

Loading

juhi10071998 Jul 2, 2026 •

edited

Loading

juhi10071998 Jul 4, 2026 •

edited

Loading

jenchen13 commented Jul 1, 2026 •

edited

Loading

juhi10071998 Jul 2, 2026 •

edited

Loading