fix: prevent UnboundLocalError masking real errors in fsdp2_aware_weight_update by dinhxuanvu · Pull Request #1860 · NVIDIA/Model-Optimizer

dinhxuanvu · 2026-06-30T02:41:58Z

What

fsdp2_aware_weight_update's finally block uses fsdp_param_mapping, fsdp_param_group, and root_module, which are only set after the unshard() inside the try. If unshard() fails first (e.g. a CUDA OOM on a large block), the finally raises UnboundLocalError — and since that happens inside finally, it replaces the real error. The user sees a confusing UnboundLocalError instead of the actual OOM.

Fix

Set fsdp_param_mapping = {}, fsdp_param_group = None, root_module = None before the try.
Guard the writeback/reshard so they're skipped when those values were never set.

On an early failure the finally is now a no-op and the original error propagates. The success path is unchanged.

Test

Added test_fsdp2_weight_update_context_error_propagates: forces the in-body unshard() to raise and asserts the original error surfaces (not UnboundLocalError). Fails on the old code, passes with the fix.

Summary by CodeRabbit

Bug Fixes
- Fixed a crash path in weight update handling where an earlier error could be hidden by a secondary cleanup error.
- Original failures now propagate properly, even if sharding/unsharding fails partway through.
Tests
- Added a regression test to confirm the original error is preserved and cleanup behavior remains safe.

…ght_update The context manager's finally block referenced fsdp_param_mapping, fsdp_param_group, and root_module unconditionally, but those locals are only bound after the in-body unshard(). When an exception (commonly a CUDA OOM from unsharding a large FSDP block onto one GPU) fired before they were bound, the finally raised UnboundLocalError, which replaced and masked the real error. Pre-bind the three locals and guard the FSDPParam writeback/reshard so an early failure leaves them as no-ops and the original exception propagates unchanged. Add a regression test that forces the in-body unshard to raise and asserts the original error surfaces instead of UnboundLocalError. Signed-off-by: Vu Dinh <vudinh@outlook.com>

copy-pr-bot · 2026-06-30T02:42:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-30T02:43:20Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0efde7e3-ec7f-47e4-841e-506d132d068d

📥 Commits

Reviewing files that changed from the base of the PR and between 72651b2 and 9d7f453.

📒 Files selected for processing (3)

CHANGELOG.rst
modelopt/torch/quantization/utils/core_utils.py
tests/gpu/torch/export/test_fsdp2_export.py

📝 Walkthrough

Walkthrough

fsdp2_aware_weight_update now pre-binds fsdp_param_mapping, fsdp_param_group, and root_module to None before the try block, and guards their use in the finally block with None checks. A regression test monkeypatches unshard to raise a sentinel error and asserts it propagates. A changelog entry documents the fix.

fsdp2_aware_weight_update error propagation fix

Layer / File(s)	Summary
Pre-bind locals and guard finally block `modelopt/torch/quantization/utils/core_utils.py`	`fsdp_param_mapping`, `fsdp_param_group`, and `root_module` are initialized to `None` before the `try` body. The `finally` block updates `fsdp_params` and calls `reshard()` only when those variables are non-`None`.
Regression test and changelog `tests/gpu/torch/export/test_fsdp2_export.py`, `CHANGELOG.rst`	`_update_weight_error_propagates_test` shards a `ToyModel`, patches `unshard` to raise `_SentinelError`, and asserts it propagates unchanged. `test_fsdp2_weight_update_context_error_propagates` wires this into the dist_workers harness. Changelog documents the fix.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Security Anti-Patterns	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main fix in fsdp2_aware_weight_update and matches the PR contents.
Linked Issues check	✅ Passed	The code change and regression test directly address issue `#1859` by preserving the original error when unshard fails.
Out of Scope Changes check	✅ Passed	The changes are scoped to the bug fix, regression test, and changelog note, with no unrelated additions.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

dinhxuanvu requested review from a team as code owners June 30, 2026 02:41

dinhxuanvu requested a review from chadvoegele June 30, 2026 02:41

coderabbitai Bot approved these changes Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent UnboundLocalError masking real errors in fsdp2_aware_weight_update#1860

fix: prevent UnboundLocalError masking real errors in fsdp2_aware_weight_update#1860
dinhxuanvu wants to merge 1 commit into
NVIDIA:mainfrom
dinhxuanvu:vdinh/fix-fsdp2-weight-update-unboundlocal

dinhxuanvu commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Jun 30, 2026

Uh oh!

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dinhxuanvu commented Jun 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Fix

Test

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jun 30, 2026

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dinhxuanvu commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading