Skip to content

fix: prevent UnboundLocalError masking real errors in fsdp2_aware_weight_update#1860

Open
dinhxuanvu wants to merge 1 commit into
NVIDIA:mainfrom
dinhxuanvu:vdinh/fix-fsdp2-weight-update-unboundlocal
Open

fix: prevent UnboundLocalError masking real errors in fsdp2_aware_weight_update#1860
dinhxuanvu wants to merge 1 commit into
NVIDIA:mainfrom
dinhxuanvu:vdinh/fix-fsdp2-weight-update-unboundlocal

Conversation

@dinhxuanvu

@dinhxuanvu dinhxuanvu commented Jun 30, 2026

Copy link
Copy Markdown

Closes #1859

What

fsdp2_aware_weight_update's finally block uses fsdp_param_mapping, fsdp_param_group, and root_module, which are only set after the unshard() inside the try. If unshard() fails first (e.g. a CUDA OOM on a large block), the finally raises UnboundLocalError — and since that happens inside finally, it replaces the real error. The user sees a confusing UnboundLocalError instead of the actual OOM.

Fix

  • Set fsdp_param_mapping = {}, fsdp_param_group = None, root_module = None before the try.
  • Guard the writeback/reshard so they're skipped when those values were never set.

On an early failure the finally is now a no-op and the original error propagates. The success path is unchanged.

Test

Added test_fsdp2_weight_update_context_error_propagates: forces the in-body unshard() to raise and asserts the original error surfaces (not UnboundLocalError). Fails on the old code, passes with the fix.

Summary by CodeRabbit

  • Bug Fixes

    • Fixed a crash path in weight update handling where an earlier error could be hidden by a secondary cleanup error.
    • Original failures now propagate properly, even if sharding/unsharding fails partway through.
  • Tests

    • Added a regression test to confirm the original error is preserved and cleanup behavior remains safe.

…ght_update

The context manager's finally block referenced fsdp_param_mapping,
fsdp_param_group, and root_module unconditionally, but those locals are only
bound after the in-body unshard(). When an exception (commonly a CUDA OOM from
unsharding a large FSDP block onto one GPU) fired before they were bound, the
finally raised UnboundLocalError, which replaced and masked the real error.

Pre-bind the three locals and guard the FSDPParam writeback/reshard so an early
failure leaves them as no-ops and the original exception propagates unchanged.
Add a regression test that forces the in-body unshard to raise and asserts the
original error surfaces instead of UnboundLocalError.

Signed-off-by: Vu Dinh <vudinh@outlook.com>
@dinhxuanvu dinhxuanvu requested review from a team as code owners June 30, 2026 02:41
@dinhxuanvu dinhxuanvu requested a review from chadvoegele June 30, 2026 02:41
@copy-pr-bot

copy-pr-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0efde7e3-ec7f-47e4-841e-506d132d068d

📥 Commits

Reviewing files that changed from the base of the PR and between 72651b2 and 9d7f453.

📒 Files selected for processing (3)
  • CHANGELOG.rst
  • modelopt/torch/quantization/utils/core_utils.py
  • tests/gpu/torch/export/test_fsdp2_export.py

📝 Walkthrough

Walkthrough

fsdp2_aware_weight_update now pre-binds fsdp_param_mapping, fsdp_param_group, and root_module to None before the try block, and guards their use in the finally block with None checks. A regression test monkeypatches unshard to raise a sentinel error and asserts it propagates. A changelog entry documents the fix.

fsdp2_aware_weight_update error propagation fix

Layer / File(s) Summary
Pre-bind locals and guard finally block
modelopt/torch/quantization/utils/core_utils.py
fsdp_param_mapping, fsdp_param_group, and root_module are initialized to None before the try body. The finally block updates fsdp_params and calls reshard() only when those variables are non-None.
Regression test and changelog
tests/gpu/torch/export/test_fsdp2_export.py, CHANGELOG.rst
_update_weight_error_propagates_test shards a ToyModel, patches unshard to raise _SentinelError, and asserts it propagates unchanged. test_fsdp2_weight_update_context_error_propagates wires this into the dist_workers harness. Changelog documents the fix.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Security Anti-Patterns ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main fix in fsdp2_aware_weight_update and matches the PR contents.
Linked Issues check ✅ Passed The code change and regression test directly address issue #1859 by preserving the original error when unshard fails.
Out of Scope Changes check ✅ Passed The changes are scoped to the bug fix, regression test, and changelog note, with no unrelated additions.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fsdp2_aware_weight_update hides the real error with an UnboundLocalError

1 participant