Skip to content

Puzzletron tutorial fixes for runtime optimization#1803

Open
grzegorz-k-karch wants to merge 8 commits into
mainfrom
gkarch/puzzletron-tutorial-fixes
Open

Puzzletron tutorial fixes for runtime optimization#1803
grzegorz-k-karch wants to merge 8 commits into
mainfrom
gkarch/puzzletron-tutorial-fixes

Conversation

@grzegorz-k-karch

@grzegorz-k-karch grzegorz-k-karch commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: Bug fix

Fixes some issues related to runtime optimization

  • Solved OOM - fix: reduced GPU memory utilization
  • Correctly export AnyModel config for vLLM - use namespace instead of dict to correctly read config
  • Fixed validate_model_defaults not found error - runtime optimization has now its own separate config files instead of reusing memory optimization files

Usage

(does not apply)

Testing

Tested by running the whole pipeline as described in the tutorial

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A
  • Did you update Changelog?: ❌
  • Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • New Features
    • Added runtime pruning presets for attention heads, FFN channels, and hidden dimensions.
    • Added/updated default validation and solution-validation configs for the Llama 3.1 8B pruning workflow.
    • Added support for converting model configs to a vLLM-compatible “AnyModel” format and capping GPU memory usage during latency benchmarks.
  • Bug Fixes
    • Updated pruning/validation presets to use the new validation-based configuration flow.
    • Reduced scoring evaluation samples for faster runs.

@grzegorz-k-karch grzegorz-k-karch self-assigned this Jun 23, 2026
@copy-pr-bot

copy-pr-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 52b0eff7-d1c9-470f-abb1-75d9ccc9f5b2

📥 Commits

Reviewing files that changed from the base of the PR and between bfb3619 and 32bd535.

📒 Files selected for processing (1)
  • modelopt/torch/puzzletron/subblock_stats/runtime_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt/torch/puzzletron/subblock_stats/runtime_utils.py

📝 Walkthrough

Walkthrough

Adds vLLM GPU memory utilization support and config conversion helpers, and introduces new Llama-3.1-8B pruneffn runtime validation and pruning YAML defaults.

Changes

vLLM GPU Memory Utilization Support

Layer / File(s) Summary
RuntimeConfig field and calc_runtime_stats wiring
modelopt/torch/puzzletron/subblock_stats/runtime_utils.py, modelopt/torch/puzzletron/subblock_stats/calc_runtime_stats.py
Adds gpu_memory_utilization to RuntimeConfig and passes the value from runtime_stats_config into construction.
convert_config_to_vllm_anymodel helper and runtime_vllm.py update
modelopt/torch/puzzletron/subblock_stats/runtime_utils.py, modelopt/torch/puzzletron/subblock_stats/runtime_vllm.py
Adds config.json-to-AnyModel conversion, updates config serialization, and passes --gpu-memory-utilization to the vLLM benchmark command.

Llama-3.1-8B pruneffn Runtime YAML Configs

Layer / File(s) Summary
Validation model and solution defaults
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/validate_model_defaults.yaml, examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/validate_solutions_defaults.yaml
Adds validation runtime defaults and solution-validation controls.
Pruning defaults base config
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/pruning_defaults.yaml
Adds shared pruning defaults and pruning mode configuration.
Attention, FFN, and hidden-dim pruning strategy configs
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/attn_pruning.yaml, examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/ffn_pruning.yaml, examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/hidden_dim_pruning.yaml
Adds three pruning strategy configs for attention heads, FFN channels, and hidden dimensions.
Top-level Llama-3_1-8B.yaml defaults update
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/Llama-3_1-8B.yaml
Rewires defaults to the new validation defaults and lowers scoring eval samples.

Estimated code review effort: 2 (Simple) | ~12 minutes

Sequence Diagram(s)

sequenceDiagram
  participant calc_runtime_for_subblocks
  participant RuntimeConfig
  participant run_vllm_latency_benchmark
  participant convert_config_to_vllm_anymodel
  participant vLLM subprocess

  calc_runtime_for_subblocks->>RuntimeConfig: construct with gpu_memory_utilization=0.5
  run_vllm_latency_benchmark->>convert_config_to_vllm_anymodel: load config.json and rewrite AnyModel config
  run_vllm_latency_benchmark->>vLLM subprocess: pass --gpu-memory-utilization
Loading
🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is concise and clearly relates to the PR’s runtime optimization fixes, though it is broader than the specific config and vLLM changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed The PR diff adds no forbidden patterns; added Python lines contain no torch.load/numpy.load/trust_remote_code/eval/exec/nosec usage.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch gkarch/puzzletron-tutorial-fixes

Comment @coderabbitai help to get the list of available commands.

@grzegorz-k-karch grzegorz-k-karch marked this pull request as ready for review June 23, 2026 09:33
@grzegorz-k-karch grzegorz-k-karch requested a review from a team as a code owner June 23, 2026 09:33
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1803/

Built to branch gh-pages at 2026-07-03 08:37 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.54%. Comparing base (9038b71) to head (32bd535).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...t/torch/puzzletron/subblock_stats/runtime_utils.py 0.00% 20 Missing ⚠️
...pt/torch/puzzletron/subblock_stats/runtime_vllm.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1803      +/-   ##
==========================================
- Coverage   70.21%   62.54%   -7.68%     
==========================================
  Files         515      516       +1     
  Lines       57244    57511     +267     
==========================================
- Hits        40196    35970    -4226     
- Misses      17048    21541    +4493     
Flag Coverage Δ
unit 54.89% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
modelopt/torch/puzzletron/subblock_stats/runtime_utils.py (1)

91-114: 📐 Maintainability & Code Quality | 🔵 Trivial

Add return type annotation and document (or parameterize) hardcoded Llama architecture assumption.

  1. Return type hint: Add -> None to the function signature (line 93). The function has no explicit return statement.

  2. Hardcoded base_architecture: Line 107 unconditionally sets base_architecture = "LlamaForCausalLM". This module is Llama-specific (imports LlamaForCausalLM and LlamaModelDescriptor), so the hardcoding appears intentional. Either add a docstring note clarifying this function is Llama-specific, or accept base_architecture as a parameter if broader model support is planned.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/puzzletron/subblock_stats/runtime_utils.py` around lines 91 -
114, The function convert_config_to_vllm_anymodel is missing a return type
annotation and has a hardcoded assumption about the model architecture. First,
add the return type hint -> None to the function signature since the function
does not explicitly return any value. Second, address the hardcoded
base_architecture assignment that unconditionally sets it to "LlamaForCausalLM".
Either add documentation in the function's docstring to clarify that this
function is Llama-specific and explain why the architecture is hardcoded, or
alternatively, parameterize the base_architecture by accepting it as an optional
function parameter with a default value to allow for broader model support in
the future.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@modelopt/torch/puzzletron/subblock_stats/runtime_utils.py`:
- Around line 91-114: The function convert_config_to_vllm_anymodel is missing a
return type annotation and has a hardcoded assumption about the model
architecture. First, add the return type hint -> None to the function signature
since the function does not explicitly return any value. Second, address the
hardcoded base_architecture assignment that unconditionally sets it to
"LlamaForCausalLM". Either add documentation in the function's docstring to
clarify that this function is Llama-specific and explain why the architecture is
hardcoded, or alternatively, parameterize the base_architecture by accepting it
as an optional function parameter with a default value to allow for broader
model support in the future.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c26ffda4-6f74-455d-a721-7a5ed0be45e2

📥 Commits

Reviewing files that changed from the base of the PR and between c3b913b and 8b02d7a.

📒 Files selected for processing (10)
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/Llama-3_1-8B.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/attn_pruning.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/ffn_pruning.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/hidden_dim_pruning.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/pruning_defaults.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/validate_model_defaults.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/validate_solutions_defaults.yaml
  • modelopt/torch/puzzletron/subblock_stats/calc_runtime_stats.py
  • modelopt/torch/puzzletron/subblock_stats/runtime_utils.py
  • modelopt/torch/puzzletron/subblock_stats/runtime_vllm.py

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Comment thread modelopt/torch/puzzletron/subblock_stats/runtime_utils.py Outdated
@grzegorz-k-karch grzegorz-k-karch requested a review from a team July 2, 2026 14:04
Signed-off-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com>
grzegorz-k-karch and others added 2 commits July 2, 2026 17:11
Added a TODO comment to extend support for other models.

Signed-off-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants