Skip to content

Empty eof race edge#156

Open
novatechflow wants to merge 2 commits into
KafScale:mainfrom
novatechflow:empty-eof-race-edge
Open

Empty eof race edge#156
novatechflow wants to merge 2 commits into
KafScale:mainfrom
novatechflow:empty-eof-race-edge

Conversation

@novatechflow

Copy link
Copy Markdown
Collaborator

Summary

Fixes a reliability issue in the proxy's Fetch handling where transport/decode errors from the backend (e.g. read frame size: EOF or decode fetch response v13: ...) would immediately return REQUEST_TIMED_OUT to clients with no retry on a fresh connection.

This matches the symptoms in #155: after a routine proxy restart/rollout, consumers would see 0 records (while ListOffsets continued to work) and would stall until a coordinated broker + proxy restart.

Root cause

forwardFetch (and fanOutFetch) only retried on NOT_LEADER_OR_FOLLOWER errors from successfully parsed responses. Any error from forwardToBackend (ReadFrame) or parseFetchResponse was treated as terminal.

Changes

  • In forwardFetch: on backend transport/decode errors, collect the affected partitions for regroup + retry (using the existing maxRetries loop). Only emit REQUEST_TIMED_OUT after exhausting retries. triedBackends ensures we avoid immediately re-using a connection that just failed.
  • This allows the proxy to discard a stale/half-open backend connection and retry on a fresh one without changing the wire protocol or requiring matched broker/proxy builds.
  • Kept the original NOT_LEADER_OR_FOLLOWER retry behavior and router invalidation unchanged.

Testing

Added TestForwardFetchRetriesOnBackendError (in cmd/proxy/main_test.go).

The test:

  • Spins up a fake backend that accepts a forwarded Fetch, then closes the connection on the first response (exactly reproducing the observed EOF/decode failure).
  • Verifies that the proxy internally retries (≥2 backend attempts) and returns a successful response (no REQUEST_TIMED_OUT partitions) instead of failing immediately.

Also cleaned up verbose comments in the changed code.

Related

Fixes #155 (core "retries only the wrong error class" + lack of resilience to connection-level Fetch failures). If merged it closes #155

How to verify

  • Unit test: go test ./cmd/proxy -run TestForwardFetchRetriesOnBackendError -count=1 -v
  • The existing fetch routing / fan-out tests continue to pass.
  • In a real deployment: restart the proxy while consumers are active; consume round-trips should continue to return records without requiring a broker restart.

Adds TestForwardFetchRetriesOnBackendError which simulates the exact EOF/decode failure from the bug (first backend response closes conn) and asserts the fix: retries on fresh conn and returns non-TIMED_OUT response.

Also includes comment cleanups.
@novatechflow novatechflow requested review from kamir and klaudworks June 10, 2026 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fetch returns empty/EOF (v13 decode) after a proxy restart — consume silently stalls until a coordinated restart

1 participant