Add AutoQuantize recipe support#1856
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR replaces CLI-driven AutoQuantize configuration with declarative YAML recipes. It adds ChangesRecipe-driven AutoQuantize
Estimated code review effort: 4 (Complex) | ~60 minutes Sequence Diagram(s)sequenceDiagram
participant User
participant quantize_main
participant load_recipe
participant auto_quantize
participant mtq
User->>quantize_main: run hf_ptq.py --recipe <auto_quantize.yaml>
quantize_main->>load_recipe: load_recipe(path)
load_recipe-->>quantize_main: ModelOptAutoQuantizeRecipe
quantize_main->>auto_quantize: auto_quantize(args, model, calib_dataloader, aq_config)
auto_quantize->>auto_quantize: build constraints, candidate_formats, disabled_layers, kv_cache config
auto_quantize->>mtq: mtq.auto_quantize(model, constraints, candidates, disabled_layers)
mtq-->>auto_quantize: searched model
auto_quantize->>auto_quantize: apply KV-cache quantization post-step
auto_quantize-->>quantize_main: quantized model
Suggested reviewers: 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
6e4d430 to
0d85360
Compare
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1856 +/- ##
==========================================
+ Coverage 70.21% 76.99% +6.77%
==========================================
Files 515 515
Lines 57244 57303 +59
==========================================
+ Hits 40196 44122 +3926
+ Misses 17048 13181 -3867
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 4
🧹 Nitpick comments (1)
tests/_test_utils/examples/hf_ptq_utils.py (1)
27-28: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winEnforce the
quant/recipeinvariant in the helper.Making both fields optional leaves
PTQCommand()andPTQCommand(quant=..., recipe=...)as valid constructions, so bad test inputs now fail downstream in the shell layer instead of at this boundary. A small__post_init__orrun()check that requires exactly one of them would keep the matrix honest.Suggested guard
class PTQCommand: quant: str | None = None recipe: str | None = None @@ + def __post_init__(self): + if (self.quant is None) == (self.recipe is None): + raise ValueError("Exactly one of `quant` or `recipe` must be set.") + def run(self, model_path: str):As per coding guidelines, "Validate external input once at the interface boundary; internal code can trust those checks and avoid redundant assertions."
Also applies to: 64-65
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/_test_utils/examples/hf_ptq_utils.py` around lines 27 - 28, The PTQCommand helper currently allows both quant and recipe to be missing or both to be set, so add a boundary check in PTQCommand itself to enforce that exactly one of those fields is provided. Implement the validation in PTQCommand’s __post_init__ or run() method so invalid test inputs fail immediately, and keep the rest of the helper logic assuming the invariant holds.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/hf_ptq/hf_ptq.py`:
- Around line 318-322: The image-calibration guard in hf_ptq.py is being
triggered unconditionally for Nemotron-VL AutoQuantize because the default
calibration flag is set before the recipe type is resolved. Update the control
flow around load_model() and the args.calib_with_images assignment so this
default only applies to plain PTQ, or move recipe loading earlier and branch on
the resolved recipe type in the AutoQuantize path. Use the existing
args.calib_with_images check and the recipe-loading logic near the AutoQuantize
setup to keep AutoQuantize on the text-only calibration path.
In `@modelopt/recipe/config.py`:
- Around line 134-138: The `active_moe_expert_ratio` field in `config.py` is
documented as being in (0, 1], but it currently accepts any float. Add
schema-level validation on the `ModeloptField`/config model so invalid values
are rejected at parse time, using the `active_moe_expert_ratio` symbol to locate
the field. Enforce the lower and upper bounds directly at the boundary (for
example, via field constraints or a validator on the owning config class) so
malformed recipes fail fast before the `active_moe` cost model uses them.
- Around line 175-179: The `candidate_formats` field in `ModeloptField` is
currently using a default empty list without validating that default, so an
omitted AutoQuantize config can pass schema validation incorrectly. Update the
`candidate_formats` definition in `config.py` to enable default validation
(using `validate_default=True` or the equivalent in the surrounding model/field
setup) so the empty default is rejected immediately. Keep the change localized
to the `candidate_formats` field and ensure the existing “at least 2 required”
constraint is enforced even when the field is not explicitly provided.
In `@tests/examples/hf_ptq/test_hf_ptq_args.py`:
- Around line 41-45: The test module has imports for load_recipe and
QUANT_CFG_CHOICES inside test functions, which should be moved to module scope
so import errors fail during collection. Update the import placement at the top
of the file in test_hf_ptq_args, and remove the redundant in-test imports from
the affected test helpers such as test_autoquant_recipe_builds_mtq_inputs and
the other test block referenced by the review.
---
Nitpick comments:
In `@tests/_test_utils/examples/hf_ptq_utils.py`:
- Around line 27-28: The PTQCommand helper currently allows both quant and
recipe to be missing or both to be set, so add a boundary check in PTQCommand
itself to enforce that exactly one of those fields is provided. Implement the
validation in PTQCommand’s __post_init__ or run() method so invalid test inputs
fail immediately, and keep the rest of the helper logic assuming the invariant
holds.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 645d2f45-fa86-458e-815b-54966bd80497
📒 Files selected for processing (23)
CHANGELOG.rstexamples/hf_ptq/README.mdexamples/hf_ptq/example_utils.pyexamples/hf_ptq/hf_ptq.pyexamples/hf_ptq/scripts/huggingface_example.shexamples/hf_ptq/scripts/parser.shmodelopt/recipe/config.pymodelopt/recipe/loader.pymodelopt/torch/quantization/algorithms.pymodelopt/torch/quantization/config.pymodelopt_recipes/configs/auto_quantize/units/base_disabled_layers.yamlmodelopt_recipes/configs/numerics/nvfp4.yamlmodelopt_recipes/general/auto_quantize/nvfp4_fp8_at_4p8bits.yamlmodelopt_recipes/general/auto_quantize/nvfp4_mse_fp8_at_6p0bits.yamlmodelopt_recipes/general/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yamlmodelopt_recipes/general/auto_quantize/w4a8_awq_beta_fp8_at_6p0bits.yamlmodelopt_recipes/huggingface/qwen3_6_moe/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yamltests/_test_utils/examples/hf_ptq_utils.pytests/_test_utils/examples/run_command.pytests/examples/hf_ptq/test_hf_ptq_args.pytests/examples/hf_ptq/test_llm_ptq.pytests/unit/recipe/test_loader.pytests/unit/torch/quantization/test_autoquant.py
💤 Files with no reviewable changes (1)
- examples/hf_ptq/example_utils.py
| PTQ_ARGS+=" --low_memory_mode " | ||
| fi | ||
|
|
||
| if [ -n "$AUTO_QUANTIZE_BITS" ]; then |
There was a problem hiding this comment.
can we leave old arguments for 1 release and add deprecation warning if user uses them instead of new recipe argument? Otherwise we will break bw compatibility without notice
There was a problem hiding this comment.
I see, makes sense, I think the hf_ptq.py has significant changes if we add it back. If I understand correctly, are you suggesting we keep the flag with the warning, but remove the functionality?
From what I understood from discussions with @shengliangxu and @realAsma we don't have lot of users of AutoQuant so should be safe to deprecate the CLI.
There was a problem hiding this comment.
Ideally we want to make sure previous CLI args work and we internally convert it into a yaml recipe file on the fly and rest of the example logic operates on the yaml directly
There was a problem hiding this comment.
Yes, agree on not breaking BW compat silently — I took a deeper look and why is it not just a cli-> recipe remapping.
The scalar flags map 1:1 onto the new AutoQuantizeConfig, so wrapping those is trivial:
- --auto_quantize_bits → constraints.effective_bits,
- --auto_quantize_method → auto_quantize_method,
- --auto_quantize_score_size → num_score_steps,
- --auto_quantize_cost_model → constraints.cost_model,
- --auto_quantize_active_moe_expert_ratio → constraints.cost.active_moe_expert_ratio,
- and --qformat fp8,nvfp4 → candidate_formats.
The wrinkle is disabled_layers / cost_excluded_layers + a new added functionality to be able to specify the per-candidate effective bits (override) the existing if needed (also newly added)
On main these were never CLI inputs — they come from model introspection that branches on the qwen model class.
hf_ptq.py:418 disabled layers and cost_excluded_patterns coming from example_utils get_excluded list and example utils get excluded cost→ removed in this PR.
I intentionally removed that (per @meenchen earlier point about moving arch knowledge out of hf_ptq.py/example_utils.py and into the recipes). So a flags→recipe wrapper can't reconstruct the old behavior from the flags alone — it has to source disabled_layers from somewhere.
Would also appreciate inputs from @meenchen, @shengliangxu on this. I think if we add the hardcoded exclusion list we will have two sources of truth then and currently it is not pure CLI actually (due to some info being leaked into hf_ptq)
There was a problem hiding this comment.
I agree with Juhi's reasoning.
AutoQuantize now has a lot of arguments and patches to work correctly (VL support, ActiveMoE cost etc.). We also want AutoQuantize to be extensible for internal formats.
Supporting both CLI and recipes make things more complex.
We could add a note that CLI support for AutoQuantize has been removed and users could refer to 0.45 branch for AutoQuantize CLI support.
There was a problem hiding this comment.
@juhi10071998 why do we need to add back disabled_layers and cost_excluded? I dont see them in your recipe files. If we keep other clis like auto_quantize_bits, auto_quantize_method, etc; cant we create a recipe file on the fly similar to how your recipe files look right now and the rest of the code can assume a recipe file input?
There was a problem hiding this comment.
@kevalmorabia97 they are present here-
let me review and see the minimal changes needed to construct the recipe on the fly, if the user want to extend those for new models, they will have to use a recipe though. I think for the default disabled layers we can do this.
There was a problem hiding this comment.
in quantization/config.py, we load the yaml configs and keep them as module constants, including the non-auto-quant disabled layers etc. We can do the same for the autoquant.
There was a problem hiding this comment.
@kevalmorabia97 Yes — included in commit (10f0691) does.
We kept --auto_quantize_bits/method/score_size/cost_model/active_moe_expert_ratio (+ --qformat for the candidates), and _auto_quantize_config_from_cli builds an AutoQuantizeConfig on the fly from them; quantize_main then runs the same recipe-driven path, so the rest of the code assumes a config/recipe input (no separate CLI code path). A DeprecationWarning is emitted.
On disabled_layers / cost_excluded_layers — did not add them as CLI args. They're appended internally from the shared base units (configs/auto_quantize/units/base_disabled_layers + base_cost_excluded_layers), the same units the recipes splice via $import.
So the on-the-fly config mirrors a recipe: candidates from --qformat, constraints from the scalar flags, and the base layer patterns from those shared units. Arch-specific patterns (e.g. Qwen's *shared_expert_gate*) stay in the model-specific recipe; the CLI shim carries only the base set.
Verified CLI == recipe (byte-identical hf_quant_config.json) on the Qwen3.6 VL MoE.
There was a problem hiding this comment.
hi @kevalmorabia97 , please let me know if the recent changes of restoring CLI align when you get a chance, thanks!
|
Addressed the CodeRabbit comments in 14fcc04:
On the CLI backward-compatibility point (keeping |
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
AutoQuantize recipe support (+900/-413, 23 files). Replaces the --auto_quantize_* CLI flags with a declarative RecipeType.AUTO_QUANTIZE recipe driving mtq.auto_quantize, plus shipped general/model-specific recipes and a shared base_disabled_layers unit.
Design review (gate fired): satisfied. This extends the existing modelopt.recipe system (new RecipeType alongside PTQ/speculative), not a competing one — the natural in-repo pattern. The PR body documents the CLI→recipe field mapping and the "CLI path untouched as equivalence baseline" approach. Loader change correctly strips only the speculative_ prefix so AUTO_QUANTIZE keeps its full name. _canonical_candidate_dict compares model_dump(exclude_unset=True) against QUANT_CFG_CHOICES values (also exclude_unset dumps), so preset identity is preserved consistently. Licensing clean (standard NVIDIA header on new files, no vendored code).
Reasons for nudge rather than approve:
-
PR is explicitly a draft — body states "Draft for early review", Changelog "will add before ready", and "Did you get Claude approval: ❌ (draft)". Not ready for merge sign-off.
-
effective_bitsis a broad, under-advertised side effect. Addingeffective_bits: 4.5toconfigs/numerics/nvfp4.yamlputs the field onQuantizerAttributeConfig, soTensorQuantizer.set_from_attribute_confignow sets_effective_bits=4.5on every NVFP4 quantizer in all quantization paths (not just autoquant), and_effective_bitsis included in_get_properties_for_modelopt_state()→ serialized into saved modelopt state for all NVFP4 checkpoints. It doesn't change quant math, but it's a cost-model-only concept leaking onto the runtime quantizer config + checkpoints. The "byte-identical export" claim covershf_quant_config.json, not the modelopt state, so this widening may be uncovered. Worth an owner confirming this is intended / harmless for restore and checkpoint comparison. -
Size. ~1313 lines / 23 files; cohesive (single feature) so not splittable, but on the large side for review.
Tests are good for the GPU-free surface (loader, mtq-input mapping equivalence incl. cost_excluded_layers, cost composition, effective_bits resolver/validators); E2E PTQCommand cases converted to recipe-driven. No prompt-injection attempts in the PR content.
14fcc04 to
261bbb2
Compare
We add effective_bits in the numerics as that is a universal source of truth which numerics teams can use. It does not get used in the non-autoquantize paths. |
|
|
||
| @field_validator("candidate_formats") | ||
| @classmethod | ||
| def _at_least_two_candidates(cls, v: list[QuantizeConfig]) -> list[QuantizeConfig]: |
There was a problem hiding this comment.
The autoquant export-compatibility guard was dropped here without a replacement. The old auto_quantize in hf_ptq.py asserted every candidate qformat was in _AUTO_QUANTIZE_QFORMATS ("supported for unified checkpoint export"), and the deleted comment was explicit that this is a property of the export path, not the YAML: "a preset can exist and be valid for plain PTQ while not being safe to mix into an auto_quantize search." The recipe path now validates only the candidate count (_at_least_two_candidates).
Failure scenario: a custom recipe lists a preset that's valid for plain PTQ but unsupported by the unified-checkpoint writer; the (expensive) search runs to completion and then fails at export with a cryptic error, or produces an invalid checkpoint. The shipped recipes are safe, so this only bites custom recipes — consider validating candidate_formats against the export-compatible set here (or documenting the constraint prominently).
There was a problem hiding this comment.
Addressed in f5e6391 — re-added the export-safe set and folded the check into the recipe→mtq translation (_match_candidate_to_preset): raises on a non-export-safe preset, warns on a custom (no-preset) candidate, before the search runs.
| num_score_steps: int = ModeloptField( | ||
| default=128, | ||
| title="Scoring sample count", | ||
| description="Number of batches used for sensitivity scoring.", |
There was a problem hiding this comment.
Description/semantics mismatch: num_score_steps is described as "Number of batches", but hf_ptq.py consumes it as a sample count — it passes inputs["num_score_steps"] // args.batch_size as mtq's num_score_steps (which is itself in batches/steps). This preserves the old --auto_quantize_score_size ("Number of samples") behavior, but the rename + new description now contradict the math.
Failure scenario: a user sets num_score_steps: 128 expecting 128 scoring batches; with batch_size=4 they get 32 — a silent 4x under-scoring vs. the documented meaning. Either fix the description to say "samples" or drop the // batch_size division so the field really means batches.
There was a problem hiding this comment.
Addressed in f5e6391 — renamed to score_size with an honest "number of samples (÷ batch_size)" description matching the old --auto_quantize_score_size. Behavior unchanged (kept the // batch_size and the 128 default).
|
What is the purpose of adding YAML recipes for AutoQuantize when you can create a YAML for the ModelOpt launcher which calls AutoQuantize? Example here Especially since AutoQuantize hyperparameters are different for every model, the AutoQuantize recipes are not inherently reusable. It would make more sense to provide customizability on the client side rather than adding more recipes which are designed for reusability. |
My understanding is that the goal is to enable customers, such as numerics teams, to tune the recipe based on their specific needs, while also giving them a consolidated view of everything required for AutoQuant. Also, it may be simpler to create a model-specific recipe using the existing ones. Additionally I feel there are too many knobs to tune for AutoQuantize and supporting through CLI is structurally limiting. @shengliangxu , @realAsma feel free to add if I missed anything. My current understanding is based off our initial discussion. |
I agree. @juhi10071998 Juhi had a document which made these clear. Could you please share them? |
| # AutoQuantize is driven by an AutoQuantize --recipe (see modelopt_recipes/general/auto_quantize/). | ||
| # Optional checkpoint passthrough for saving/restoring the search state. | ||
| if [ -n "$AUTO_QUANTIZE_CHECKPOINT" ]; then | ||
| PTQ_ARGS+=" --auto_quantize_checkpoint=$AUTO_QUANTIZE_CHECKPOINT " |
There was a problem hiding this comment.
We had done the following:
# Automatically generate auto_quantize checkpoint path if not provided
Is this functionality remove in this script? Is that intentional?
There was a problem hiding this comment.
Can we have a default checkpoint path if it is not provided, e.g., <output_path>/.autoquant? I find the autoquant checkpoint is pretty handy.
There was a problem hiding this comment.
Addressed in f5e6391 — re-added the auto-generated checkpoint path, now gated on an AutoQuantize recipe instead of the removed --auto_quantize_bits.
There was a problem hiding this comment.
@meenchen Done — when --auto_quantize_checkpoint is omitted for an AutoQuantize recipe, the script now auto-generates one at ${ROOT_SAVE_PATH}/auto_quantize_checkpoints/${MODEL_NAME}.pth.
There was a problem hiding this comment.
Should we have effective_bits 5.0/5.4 as the default?
This is because 4.8 was used as a good AQ default setting when FP4 cost was set as 4.0. No FP4 cost has increased. We could recommend effective_bits 5.0/5.4 as the default.
There was a problem hiding this comment.
I agree, that makes sense, I will use as 5.4.
There was a problem hiding this comment.
Good point — bumped the default to 5.4 and renamed the recipe to nvfp4_fp8_at_5p4bits in f5e6391.
realAsma
left a comment
There was a problem hiding this comment.
Can we have one recipe for kl_div as well to show the usage?
f5e6391 to
4cfd6f2
Compare
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🧹 Nitpick comments (1)
examples/hf_ptq/hf_ptq.py (1)
331-335: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse a rank-gated warning helper here.
This warning can be emitted by every distributed rank; prefer
warn_rank_0if available, or otherwise gate it explicitly. As per coding guidelines, “Develop with distributed processing in mind: useprint_rank_0orwarn_rank_0when possible.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/hf_ptq/hf_ptq.py` around lines 331 - 335, The warning emitted in the preset mismatch branch should be rank-gated so it only comes from rank 0. Update the warning in the logic around preset_name in hf_ptq.py to use warn_rank_0 if it exists, or otherwise add an explicit rank check before calling warnings.warn, following the distributed logging pattern used elsewhere.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/hf_ptq/hf_ptq.py`:
- Around line 292-296: The preset matching logic in the config normalization
helper is comparing the full dumped config, so cost-only fields like
effective_bits can prevent a shipped preset from matching and let unsupported
configs fall through as “custom.” Update the matching path in the
preset-selection helper to compare against a version of fmt with
non-export-affecting metadata excluded, then still return the original
overridden config so the effective_bits override is preserved in the final
result. Use the QUANT_CFG_CHOICES lookup and the normalization flow around the
preset-matching function to keep whitelist enforcement consistent.
---
Nitpick comments:
In `@examples/hf_ptq/hf_ptq.py`:
- Around line 331-335: The warning emitted in the preset mismatch branch should
be rank-gated so it only comes from rank 0. Update the warning in the logic
around preset_name in hf_ptq.py to use warn_rank_0 if it exists, or otherwise
add an explicit rank check before calling warnings.warn, following the
distributed logging pattern used elsewhere.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 47baff56-2516-44bc-a41e-8ae7b2d9fe07
📒 Files selected for processing (14)
CHANGELOG.rstexamples/hf_ptq/README.mdexamples/hf_ptq/hf_ptq.pyexamples/hf_ptq/scripts/huggingface_example.shmodelopt/recipe/config.pymodelopt_recipes/general/auto_quantize/nvfp4_fp8_at_5p4bits.yamlmodelopt_recipes/general/auto_quantize/nvfp4_fp8_kl_div_at_5p4bits.yamlmodelopt_recipes/general/auto_quantize/nvfp4_mse_fp8_at_6p0bits.yamlmodelopt_recipes/general/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yamlmodelopt_recipes/general/auto_quantize/w4a8_awq_beta_fp8_at_6p0bits.yamlmodelopt_recipes/huggingface/qwen3_6_moe/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yamltests/examples/hf_ptq/test_hf_ptq_args.pytests/examples/hf_ptq/test_llm_ptq.pytests/unit/recipe/test_loader.py
✅ Files skipped from review due to trivial changes (2)
- examples/hf_ptq/README.md
- CHANGELOG.rst
🚧 Files skipped from review as they are similar to previous changes (9)
- modelopt_recipes/general/auto_quantize/nvfp4_mse_fp8_at_6p0bits.yaml
- modelopt_recipes/huggingface/qwen3_6_moe/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yaml
- modelopt_recipes/general/auto_quantize/w4a8_awq_beta_fp8_at_6p0bits.yaml
- modelopt_recipes/general/auto_quantize/w4a16_nvfp4_fp8_at_6p0bits-active_moe.yaml
- tests/examples/hf_ptq/test_llm_ptq.py
- examples/hf_ptq/scripts/huggingface_example.sh
- tests/unit/recipe/test_loader.py
- tests/examples/hf_ptq/test_hf_ptq_args.py
- modelopt/recipe/config.py
fe7f4e1 to
ab3aed2
Compare
meenchen
left a comment
There was a problem hiding this comment.
Thanks for the PR, looks good in general.
| PTQ_ARGS+=" --low_memory_mode " | ||
| fi | ||
|
|
||
| if [ -n "$AUTO_QUANTIZE_BITS" ]; then |
There was a problem hiding this comment.
Since AutoQuant is an experimental feature, I am fine with just removing the CLI support.
| # AutoQuantize is driven by an AutoQuantize --recipe (see modelopt_recipes/general/auto_quantize/). | ||
| # Optional checkpoint passthrough for saving/restoring the search state. | ||
| if [ -n "$AUTO_QUANTIZE_CHECKPOINT" ]; then | ||
| PTQ_ARGS+=" --auto_quantize_checkpoint=$AUTO_QUANTIZE_CHECKPOINT " |
There was a problem hiding this comment.
Can we have a default checkpoint path if it is not provided, e.g., <output_path>/.autoquant? I find the autoquant checkpoint is pretty handy.
|
|
||
| @field_validator("candidate_formats") | ||
| @classmethod | ||
| def _at_least_two_candidates(cls, v: list[QuantizeConfig]) -> list[QuantizeConfig]: |
There was a problem hiding this comment.
Does BF16 (unquantized) count as a candidate here?
There was a problem hiding this comment.
this is a pure recipe load/validation time — before anything touches mtq so bf16 shouldn't be counted here
There was a problem hiding this comment.
Is there an option for users to add bf16 to the search space, or do we always rely on mtq to include bf16? I feel we should also support one format + bf16 for AutoQuant
There was a problem hiding this comment.
I see, that is a good point, I think in that case we can just relax this constraint, or have atleast 1.
There was a problem hiding this comment.
| # Presets safe to mix into an AutoQuantize search *and* write via the unified HF checkpoint | ||
| # exporter. Export-compatibility is a property of the export path, not of a preset's validity for | ||
| # plain PTQ, so this is a curated set rather than something derived from QUANT_CFG_CHOICES. | ||
| # TODO: drop the partial-model presets (e.g. nvfp4_mlp_only, nvfp4_experts_only) from this set as future work. | ||
| _AUTO_QUANTIZE_QFORMATS: frozenset[str] = frozenset( | ||
| { | ||
| "fp8", | ||
| "int8_smoothquant", | ||
| "int8_weight_only", | ||
| "int4_awq", | ||
| "nvfp4", | ||
| "nvfp4_awq_lite", | ||
| "nvfp4_w4a4_weight_mse_fp8_sweep", | ||
| "w4a8_awq_beta", | ||
| "w4a16_nvfp4", | ||
| "fp8_2d_blockwise_weight_only", | ||
| "w4a8_mxfp4_fp8", | ||
| "nvfp4_mlp_only", | ||
| "nvfp4_experts_only", | ||
| "nvfp4_omlp_only", | ||
| "nvfp4_w4a4_weight_local_hessian", | ||
| "mxfp8", | ||
| } | ||
| ) |
There was a problem hiding this comment.
Why do we still need this for format to quant cfg lookup? Can we pick up quant cfg directly from the recipe?
There was a problem hiding this comment.
The quant cfg does come straight from the recipe — _match_candidate_to_preset isn't fetching the cfg, it's recovering the preset name.
We hand mtq the matched preset dict so the search labels each candidate as its canonical preset (e.g. FP8_DEFAULT_CFG) instead of CUSTOM_0/1.
That name matters for (a) --auto_quantize_checkpoint restore — checkpoints are keyed by these names, and CUSTOM_N labels break cross-run/recipe restore
(b) the export-compatibility guard (name → whitelist). Using fmt.model_dump() directly would quantize identically but lose both.
|
Thanks @meenchen for the review- yes I've deprecated the CLI support for this. As for this one, I am constructing this in hf_ptq.py
|
Add an effective_bits field at two levels for the autoquant LP cost model: QuantizeConfig (recipe-level override) and QuantizerAttributeConfig (per-format library default). estimate_quant_compression resolves in priority order: recipe-level > per-entry > num_bits heuristic, fixing the heuristic's undercount of block-scaled formats (e.g. NVFP4 = 4.5 vs 4.0). Per-entry values are aggregated via min. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Add the auto_quantize recipe type: AutoQuantizeConfig (candidate_formats, constraints, auto_quantize_method, num_score_steps, disabled_layers, kv_cache), AutoQuantizeConstraints (effective_bits, cost_model, cost) mirroring the mtq.auto_quantize constraints dict, and AutoQuantizeCost (active_moe_expert_ratio). Register RecipeType.AUTO_QUANTIZE in RECIPE_TYPE_TO_CLASS and the loader required-section map, and fix kind-extraction so multi-word non-speculative names stay intact (AUTO_QUANTIZE, not QUANTIZE). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…es, and equivalence tests Add auto_quantize_recipe (organized around AutoQuantizeConfig) and _mtq_inputs_from_auto_quantize_config, which maps a recipe to mtq.auto_quantize inputs mirroring the CLI defaults; recipe candidates that match a known preset are passed as the preset dict (_canonical_candidate_dict) so the search names them identically to the CLI and checkpoints stay compatible. The existing CLI auto_quantize helper is left untouched as the equivalence baseline; shared-flow edits are additive and inert when no recipe is used. Ship the active_moe example recipe plus a -heuristic variant for the CLI-equivalence smoke. Add GPU-free tests: per-config recipe-vs-CLI input equivalence and a flag-coverage guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Verify the two autoquant cost multipliers stack multiplicatively: a routed NVFP4 expert in active-MoE mode (cost_weight=0.03125) with an effective_bits=4.5 override costs numel * cost_weight * (4.5/16), and falls back to the num_bits heuristic (0.25) without the override. Guards the Phase-A effective_bits / PR-#1497 cost_weight interaction against future cost-model changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…only) Add effective_bits: 4.5 to configs/numerics/nvfp4.yaml so every NVFP4 weight/input/KV entry carries the block-scale-accurate cost (4 value bits + an FP8 scale per 16-element block) as the library default. Recipes and the CLI inherit it via $import, so estimate_quant_compression returns 0.28125 for NVFP4 configs instead of the 4.0/16=0.25 num_bits heuristic. Read only by autoquant; other quantization paths ignore effective_bits. Cost-estimation tests updated to the new baseline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…d-layers Ship a model-specific autoquant recipe under huggingface/qwen3_6_moe/auto_quantize/ that carries the architecture disabled-layer patterns explicitly in disabled_layers, mirroring the PTQ recipe directory structure (per Wei-Ming, PR #1381). The CLI introspection (_get_auto_quantize_disabled_layers) is kept intact as the equivalence baseline; full removal pairs with the CLI-flag deprecation. Tests: an exact-match guard that the recipe's disabled_layers set equals the CLI introspection for a Qwen model (drift detector), plus an input-equivalence case for a recipe with explicit disabled_layers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…istic variant Ship general example recipes (per review): NVFP4+FP8 @ 4.8, NVFP4-W4A4-MSE+FP8 @ 6.0, W4A8-AWQ-beta+FP8 @ 6.0. Remove the now-redundant inline effective_bits from the active_moe recipe (NVFP4 cost 4.5 comes from configs/numerics/nvfp4 after Phase D), and drop the -heuristic variant — post-D it is identical to the cleaned recipe and its name was misleading. Loader test now parametrizes over all shipped general recipes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…sabled_layers Add the base (model-agnostic) non-quantizable disabled_layers to every general recipe so they no longer depend on the CLI's _get_auto_quantize_disabled_layers introspection fallback — prep for dropping the CLI in the next commit. Arch-specific models use a huggingface/<model>/auto_quantize recipe that extends this set (Qwen3.6 already does). Sharing the base list via $import is a follow-up (needs loader support for schema-less list snippets). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…e G)
AutoQuantize is now driven only by an AutoQuantize --recipe. Remove the --auto_quantize_{bits,method,score_size,cost_model,active_moe_expert_ratio} CLI flags + the CLI auto_quantize() helper + the example-script (parser.sh / huggingface_example.sh) plumbing; --auto_quantize_checkpoint stays as a runtime save/restore path.
Remove the model-introspection helpers (_get_auto_quantize_disabled_layers / _get_auto_quantize_cost_excluded_patterns) from example_utils; recipes now carry disabled_layers and a new cost.excluded_module_name_patterns on AutoQuantizeCost, so VL models can exclude vision-tower weights from the cost denominator (disabled-from-search and excluded-from-cost are independent roles). General recipes carry the base disabled set; model-specific recipes extend it.
Integration tests (test_llm_ptq.py) and the example script switch to --recipe; README + CHANGELOG updated. Verified: recipe path byte-identical pre/post-G via shared-checkpoint smoke on Qwen3.6-VL; 260 unit tests pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…d_layers via $import Two recipe-author-facing readability cleanups (mtq inputs unchanged — recipe path verified byte-identical to the prior reference, version-string metadata aside): - Hoist excluded_module_name_patterns out of constraints.cost up to a top-level cost_excluded_layers, sibling of disabled_layers. The two 'exclusion' lists (search vs cost-budget) now sit at the same level; the dispatch re-merges cost_excluded_layers into the mtq constraints.cost dict. - Factor the shared 14-pattern base disabled_layers list into a reusable unit (configs/auto_quantize/units/base_disabled_layers) spliced via $import, mirroring PTQ's base_disable_all. Needs a named list[str] schema (LayerPatternList) since the modelopt-schema resolver only accepts modelopt.* dotted paths and str/list[str] have no such name (PTQ reused the existing QuantizerCfgListConfig alias). Adds test_autoquant_recipe_cost_excluded_layers_map_into_cost. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…antize recipe docs Rename: with the CLI auto_quantize() helper removed in Phase G, the recipe-driven function is the sole AutoQuantize entry point, so the _recipe suffix is redundant. Rename auto_quantize_recipe -> auto_quantize (def + call site) and refresh the now-stale docstring (it still referred to the removed CLI helper as an 'equivalence baseline'). Pure rename, no behavior change; no name clash with the namespaced mtq.auto_quantize. Docs (no behavior change): - The --recipe / --kv_cache_qformat help and README claimed --kv_cache_qformat is ignored and the recipe 'fully defines' the config under --recipe. True for PTQ recipes (KV baked into quant_cfg) but not AutoQuantize recipes, which fall back to --kv_cache_qformat (default fp8_cast) unless they set an explicit kv_cache field. Clarify the recipe-type split in both help strings and the README; note KV cache is a uniform post-step. - Document cost_excluded_layers (cost-budget exclusion, distinct from disabled_layers) and the shared base_disabled_layers $import unit. - Add a migration note: the --auto_quantize_* CLI flags are removed (AutoQuantize is recipe-only) and how each maps to a recipe field (per Asma's review). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
- VL/AutoQuantize control-flow bug (functional): load_model auto-enables image-text calibration for Nemotron-VL models, which auto_quantize() rejects -> AutoQuantize on a Nemotron-VL model raised NotImplementedError unconditionally. Skip the image-calib default when the run is an AutoQuantize recipe (peek via _recipe_is_auto_quantize). - Validate active_moe_expert_ratio in (0, 1] at the schema boundary (field_validator). - candidate_formats: validate_default=True so an omitted/empty list fails the >=2 check at parse time instead of slipping through. - test_hf_ptq_args: move load_recipe / QUANT_CFG_CHOICES imports to module scope. - PTQCommand: enforce exactly one of quant/recipe via __post_init__. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
- Export-compat guard (Edwardf0t1): re-add _AUTO_QUANTIZE_QFORMATS and fold an export check into the recipe->mtq translation. _canonical_candidate_dict becomes _match_candidate_to_preset (returns preset name + dict); raise on a non-export-safe candidate, warn on a custom (no-preset) one. Fails fast, before the search. (+tests) - num_score_steps -> score_size (Edwardf0t1): the field is a sample count (divided by batch_size to get mtq steps), so name/describe it honestly and match the former --auto_quantize_score_size. Behavior unchanged (the // batch_size math and 128 default are untouched); disambiguates from mtq's batches-based num_score_steps kwarg. - Auto-generate --auto_quantize_checkpoint (Asma): re-add in huggingface_example.sh, now gated on an AutoQuantize recipe instead of the removed --auto_quantize_bits. - Default effective_bits 4.8 -> 5.4 (Asma): FP4 cost is now 4.5, so 4.8 is too aggressive; rename nvfp4_fp8_at_4p8bits -> nvfp4_fp8_at_5p4bits and update refs/docs. - Add a kl_div example recipe (Asma): nvfp4_fp8_kl_div_at_5p4bits (no backprop; e.g. Llama-4), plus a one-line README pointer. - Note the old AutoQuantize CLI remains on the 0.45 branch (README migration + CHANGELOG). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…odeRabbit) _match_candidate_to_preset matched candidates by exact model_dump equality, so a candidate built from a non-export-safe preset that also set a per-candidate effective_bits would fail the match, be classified 'custom', and slip past the export whitelist with only a warning. Exclude effective_bits (cost-only, export-irrelevant) from the match key so such a candidate is still identified as its base preset and rejected; preserve the override in the returned config. Shipped recipes are unaffected (they set no per-candidate effective_bits). (+test) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…ecipe shim Per review (Keval): keep the --auto_quantize_* flags working instead of hard-removing them. They convert into an AutoQuantizeConfig on the fly and run the same recipe path (DeprecationWarning); no new user flags. - _auto_quantize_config_from_cli(): builds the config from the flags; appends the shared base disabled + base cost-excluded layer sets (no model introspection). Base cost-excluded is appended unconditionally (harmless on non-VL, correct on VL). - Base layer-pattern sets loaded once as module constants in recipe/config.py, mirroring quantization/config.py's _default_disabled_quantizer_cfg (Shengliang). New shared unit configs/auto_quantize/units/base_cost_excluded_layers. - quantize_main resolves aq_config from a recipe OR the CLI flags. - Fix VL guards for the CLI path: skip the image-calib default AND the plain-PTQ extract_and_prepare_language_model_from_vl (else auto_quantize hits 'multiple modelopt states'); reject --low_memory_mode. - parser.sh / huggingface_example.sh: flag passthrough + auto-generated checkpoint path. - CHANGELOG: Backward-Breaking -> Deprecations (flags still work). README reframed. +test. Verified CLI == recipe (byte-identical) on the Qwen3.6 VL MoE. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
…fusion) Qwen3.6 MoE (e.g. Qwen/Qwen3.6-35B-A3B) fails HF export at linear fusion if the shared-expert gate is quantized (fusion partners get mismatched formats). On main this was a Qwen-specific introspection pattern (_QWEN36_AUTOQ_DISABLED_LAYERS); promote it to the shared base disabled set so the deprecated --auto_quantize_* CLI (which can't inject arch patterns) also disables it. Harmless elsewhere — matches nothing on non-MoE models. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com>
10f0691 to
6980725
Compare
…rch)
Per review (Wei-Ming): support 'one format + bf16' for AutoQuantize. bf16/no-quant is
always an implicit per-layer choice (mtq appends QuantRecipe(quant_cfg=None)), so a single
explicit format already yields a real {format, bf16} search. Relax the candidate_formats
validator from >=2 to >=1 (only an empty list is rejected). Works for both recipe
(candidate_formats: [fp8]) and the CLI shim (--qformat fp8 --auto_quantize_bits ...).
Updates the field description + README; retargets the loader test (empty rejected,
single accepted).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
What does this PR do?
Type of change: New feature (AutoQuantize recipes). The
--auto_quantize_*CLI flags are deprecated but still work (kept as a thin backward-compat shim) — not removed.Makes AutoQuantize recipe-driven:
mtq.auto_quantizeis configured by a declarative YAML recipe (--recipe). The old--auto_quantize_*flags are converted into anAutoQuantizeConfigon the fly and run the exact same recipe path (emitting aDeprecationWarning), so old commands keep working. The recipe path is verified byte-identical to the CLI.quantization/config.py,algorithms.py): neweffective_bitsfield onQuantizeConfig(recipe-level override) andQuantizerAttributeConfig(per-format default).estimate_quant_compressionresolves recipe-level → per-entry →num_bitsheuristic.configs/numerics/nvfp4.yamlshipseffective_bits: 4.5(block-scale-accurate) as the single source of truth.recipe/config.py,recipe/loader.py):RecipeType.AUTO_QUANTIZE+AutoQuantizeConfig/AutoQuantizeConstraints/AutoQuantizeCost. Fields:constraints(effective_bits,cost_model,cost.active_moe_expert_ratio),candidate_formats,auto_quantize_method(gradient/kl_div),score_size,disabled_layers,cost_excluded_layers(e.g. VL vision towers),kv_cache.examples/hf_ptq/hf_ptq.py): recipe → mtq inputs via_mtq_inputs_from_auto_quantize_config;_match_candidate_to_presetresolves candidates to shipped presets and guards export-compatibility (rejects export-unsafe presets before the search)._auto_quantize_config_from_clibuilds anAutoQuantizeConfigfrom the old flags and appends the shared basedisabled_layers/cost_excluded_layers(loaded once as module constants inrecipe/config.py, mirroring_default_disabled_quantizer_cfg). No model introspection, no new user flags.general/auto_quantize/(nvfp4_fp8_at_5p4bits,nvfp4_fp8_kl_div_at_5p4bits,nvfp4_mse_fp8_at_6p0bits,w4a8_awq_beta_fp8_at_6p0bits,w4a16_nvfp4_fp8_at_6p0bits-active_moe) and model-specifichuggingface/qwen3_6_moe/auto_quantize/.... Sharedconfigs/auto_quantize/units/base_disabled_layers+base_cost_excluded_layersspliced via$import.Migration (deprecated flag → recipe field):
--auto_quantize_bits→constraints.effective_bits·--auto_quantize_method→auto_quantize_method·--auto_quantize_score_size→score_size·--auto_quantize_cost_model→constraints.cost_model·--auto_quantize_active_moe_expert_ratio→constraints.cost.active_moe_expert_ratio·--qformat fp8,nvfp4→candidate_formats.--auto_quantize_checkpointunchanged.Usage
Testing
mtq.auto_quantizemapping incl.cost_excluded_layers; export-compat guard (reject/warn/no-bypass); deprecated-CLI→AutoQuantizeConfigconversion;effective_bitsresolver + validators.fp8 + w4a16_nvfp4 @ 6.0,active_moe) → identicalhf_quant_config.jsonacross CLI/recipe; also confirmed the deprecated CLI shim ≡ recipe on the same VL MoE.Before your PR is "Ready for review"
--auto_quantize_*flags are deprecated but still work (converted to a recipe on the fly +DeprecationWarning). Plain PTQ CLI unaffected./claude review🤖 Generated with Claude Code