Skip to content

HDDS-15651. Roll back DiskBalancer move when markContainerForDelete fails#10593

Draft
arunsarin85 wants to merge 2 commits into
apache:masterfrom
arunsarin85:HDDS-15651
Draft

HDDS-15651. Roll back DiskBalancer move when markContainerForDelete fails#10593
arunsarin85 wants to merge 2 commits into
apache:masterfrom
arunsarin85:HDDS-15651

Conversation

@arunsarin85

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

When markContainerForDelete() fails after a container has been copied to the destination volume, treat the move as a failure instead of success.
Restore ContainerSet to the source replica, revert destination volume accounting, delete the destination replica directory, and do not queue the source replica for delayed deletion. Add a regression test.

Please describe your PR in detail:
Bug: DiskBalancer reported a successful move even when markContainerForDelete() failed on the source replica.
Fix: On mark failure, the move is rolled back and counted as a failure.

<style type="text/css"></style>

Before (bug) After (fix)
moveSucceeded = true set before calling markContainerForDelete() moveSucceeded = true only after mark succeeds
Success metrics updated regardless of mark outcome Success metrics updated only on full success
ContainerSet kept pointing at destination replica ContainerSet restored to source replica
Destination volume used space left incremented Destination used space decremented
Destination replica directory left on disk Destination replica directory deleted
Source replica queued for delayed deletion Source replica not queued
Log: "It will be handled after DN restart" Log: "Rolling back move"

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15651

How was this patch tested?

Regression test
TestDiskBalancerTask.moveSucceedsDespiteMarkContainerForDeleteFailure (HDDS-15651):

Creates a CLOSED container on the source volume
Sets replicaDeletionDelay = 60_000 ms so delayed deletion does not hide duplicate-replica bugs
Look for on the source KeyValueContainer and makes markContainerForDelete() throw
Runs DiskBalancerTask.call()

repro_HDDS_markContainerForDelete_BEFORE_fix.log

repro_HDDS_markContainerForDelete_AFTER_fix.log

@arunsarin85 arunsarin85 marked this pull request as draft June 24, 2026 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant