Empty eof race edge#156
Open
novatechflow wants to merge 2 commits into
Open
Conversation
Adds TestForwardFetchRetriesOnBackendError which simulates the exact EOF/decode failure from the bug (first backend response closes conn) and asserts the fix: retries on fresh conn and returns non-TIMED_OUT response. Also includes comment cleanups.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a reliability issue in the proxy's Fetch handling where transport/decode errors from the backend (e.g.
read frame size: EOFordecode fetch response v13: ...) would immediately returnREQUEST_TIMED_OUTto clients with no retry on a fresh connection.This matches the symptoms in #155: after a routine proxy restart/rollout, consumers would see 0 records (while
ListOffsetscontinued to work) and would stall until a coordinated broker + proxy restart.Root cause
forwardFetch(andfanOutFetch) only retried onNOT_LEADER_OR_FOLLOWERerrors from successfully parsed responses. Any error fromforwardToBackend(ReadFrame) orparseFetchResponsewas treated as terminal.Changes
forwardFetch: on backend transport/decode errors, collect the affected partitions for regroup + retry (using the existingmaxRetriesloop). Only emitREQUEST_TIMED_OUTafter exhausting retries.triedBackendsensures we avoid immediately re-using a connection that just failed.NOT_LEADER_OR_FOLLOWERretry behavior and router invalidation unchanged.Testing
Added
TestForwardFetchRetriesOnBackendError(incmd/proxy/main_test.go).The test:
REQUEST_TIMED_OUTpartitions) instead of failing immediately.Also cleaned up verbose comments in the changed code.
Related
Fixes #155 (core "retries only the wrong error class" + lack of resilience to connection-level Fetch failures). If merged it closes #155
How to verify
go test ./cmd/proxy -run TestForwardFetchRetriesOnBackendError -count=1 -v