fix: prevent UnboundLocalError masking real errors in fsdp2_aware_weight_update#1860
Open
dinhxuanvu wants to merge 1 commit into
Open
fix: prevent UnboundLocalError masking real errors in fsdp2_aware_weight_update#1860dinhxuanvu wants to merge 1 commit into
dinhxuanvu wants to merge 1 commit into
Conversation
…ght_update The context manager's finally block referenced fsdp_param_mapping, fsdp_param_group, and root_module unconditionally, but those locals are only bound after the in-body unshard(). When an exception (commonly a CUDA OOM from unsharding a large FSDP block onto one GPU) fired before they were bound, the finally raised UnboundLocalError, which replaced and masked the real error. Pre-bind the three locals and guard the FSDPParam writeback/reshard so an early failure leaves them as no-ops and the original exception propagates unchanged. Add a regression test that forces the in-body unshard to raise and asserts the original error surfaces instead of UnboundLocalError. Signed-off-by: Vu Dinh <vudinh@outlook.com>
Contributor
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthrough
fsdp2_aware_weight_update error propagation fix
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1859
What
fsdp2_aware_weight_update'sfinallyblock usesfsdp_param_mapping,fsdp_param_group, androot_module, which are only set after theunshard()inside thetry. Ifunshard()fails first (e.g. a CUDA OOM on a large block), thefinallyraisesUnboundLocalError— and since that happens insidefinally, it replaces the real error. The user sees a confusingUnboundLocalErrorinstead of the actual OOM.Fix
fsdp_param_mapping = {},fsdp_param_group = None,root_module = Nonebefore thetry.On an early failure the
finallyis now a no-op and the original error propagates. The success path is unchanged.Test
Added
test_fsdp2_weight_update_context_error_propagates: forces the in-bodyunshard()to raise and asserts the original error surfaces (notUnboundLocalError). Fails on the old code, passes with the fix.Summary by CodeRabbit
Bug Fixes
Tests