fix(libev): guard handle_read/handle_write against close() race condition#889
fix(libev): guard handle_read/handle_write against close() race condition#889vponomaryov wants to merge 1 commit into
Conversation
|
Verified the test fix here: Test run without this fix (for comparison) is here: Comment from the "zeus" comparing the fixed test run and issue affected: @zeus posted:
@fruch , @soyacz ^ |
|
Note that the @scylladb/python-driver-maint please merge and release it. |
| elif self.is_closed or self.is_defunct: | ||
| # Socket was closed by another thread between the | ||
| # watcher firing and us calling send(). This is the | ||
| # race described in scylladb/python-driver#614. | ||
| return |
There was a problem hiding this comment.
I don't see why this is necessary. Without this, we will go into else branch, call self.defunct(err) which will do nothing, and then return.
There was a problem hiding this comment.
Yes, defunct() does have its own if self.is_defunct or self.is_closed: return guard, so no crash or incorrect state transition would occur.
But, the guard is still valuable because of the following reasons:
-
Without the guard we will get missleading log message if the error is
EBADF(which is in_PEER_DISCONNECT_ERRNOS), the code falls into the_PEER_DISCONNECT_ERRNOSbranch first (it comes beforeelse), logging"Connection %s closed by peer during write".
But the socket wasn't closed by the peer - we closed it viaclose(). -
The
_PEER_DISCONNECT_ERRNOSbranch callsself.close(), which is a no-op when already closed but still acquiresself.lock- a minor unnecessary overhead. -
The guard makes the race condition handling self-documenting.
Without it, a reader must trace through multiple branches to confirm the behavior is safe.
So, I find it rather useful than redundant.
There was a problem hiding this comment.
Ah, thanks for the explanation, I did not pay much attention to this path. Now I see a different potential issue, mentioned by Avi ( #614 (comment) )
If we are receiving EBADF, then we performed some action (in this case READ / WRITE) on a closed file descriptor. As far as I know, file descriptors can be reused. So, if another connection was opened between this one being closed and watchers exiting, it makes it possible for us to accidentaly read / write to a totally unrelated connection (or even some file opened by a user app). Am I wrong?
If this is true, then this patch hides the immediate symptom of the problem, but also hides a much more dangerous problem that could cause data corruption.
I don't know the libev connection impl that well, but maybe the proper approach is to, in close, first stop the watchers somehow and make sure that libev won't touch the fd anymore, and only then close the fd?
There was a problem hiding this comment.
Situation where we accidentally read/write to a different connection is, probably, architecturally valid for C/C++ programs using raw fds, but doesn't apply in case of Python.
The Python socket abstraction prevents it.
Python's socket.socket object tracks state internally - after .close(), any operation on that object raises EBADF regardless of whether the OS reused the fd number for something else.
The self._socket reference always points to the old (closed) Python object, never to the new socket that happened to get the same fd.
There was a problem hiding this comment.
libev get a raw fd, not PYthon socket, but the watchers themselves use the Python socket so I guess this issue really can't happen.
The watchers also seem to be correctly removed from the loop in close (connection_destroyed), so they won't keep triggering forever. I think your change should be fine.
There was a problem hiding this comment.
@dkropachev
Take over the PR please.
I don't see a point in me guessing the expectations.
Update it as you wish and just make sure it fixes the issue.
There was a problem hiding this comment.
@dkropachev you have other fix to this issue ? or we'll need to revert to the last known working version, with this bug, across the broad ?
There was a problem hiding this comment.
What is the version that worked well?
There was a problem hiding this comment.
3.29.5 is the first version we started seeing it, and it was also when we moved to python 3.14
if you have better fix for this problem, go for it.
but @vponomaryov have proven that this PR is fixing this problem.
this problem is generating lots of noise, and the slowdowns we have by retrying on it from the client end, is wasting us lots of money (in some cases it's hours and hours of retries)
There was a problem hiding this comment.
It is not fixing it, socket gets broken by unknown reason and it is only hides it.
| elif self.is_closed or self.is_defunct: | ||
| # Socket was closed by another thread between the watcher | ||
| # firing and us calling recv(). This is the race described | ||
| # in scylladb/python-driver#614. | ||
| return |
There was a problem hiding this comment.
Without this, if the error is in _PEER_DISCONNECT_ERRNOS (EBADF), the code would execute following:
self.last_error = ConnectionShutdown(
"Connection to %s was closed by peer" % self.endpoint)
self.close()It will overwrite the last_error with an incorrect closed by peer message when in reality close() was already called by us.
Since last_error is used by factory() to detect why a connection died, this overwrite could mask the real error.
| class LibevPeerDisconnectErrnos(unittest.TestCase): | ||
| """Verify _PEER_DISCONNECT_ERRNOS contains expected values.""" | ||
|
|
||
| def setUp(self): | ||
| _skip_if_unavailable(self) | ||
|
|
||
| def test_contains_ebadf(self): | ||
| self.assertIn(errno.EBADF, _PEER_DISCONNECT_ERRNOS) | ||
|
|
||
| def test_contains_econnreset(self): | ||
| self.assertIn(errno.ECONNRESET, _PEER_DISCONNECT_ERRNOS) | ||
|
|
||
| def test_contains_econnaborted(self): | ||
| self.assertIn(errno.ECONNABORTED, _PEER_DISCONNECT_ERRNOS) | ||
|
|
||
| def test_contains_enotconn(self): | ||
| self.assertIn(errno.ENOTCONN, _PEER_DISCONNECT_ERRNOS) | ||
|
|
||
| def test_contains_eshutdown(self): | ||
| self.assertIn(errno.ESHUTDOWN, _PEER_DISCONNECT_ERRNOS) |
There was a problem hiding this comment.
I see no point in having such tests. It just cements the current code, not testing the behavior.
|
@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open? |
I thought it was due to switch to scylla version 2026.1 and #330, but I see it is supposed to be fixed by #884 |
|
@Lorak-mmk from the PR completeness point of view, should I remove that unit test class?
So, I need to rebase the PR? |
Tbh I don't care that much.
If you could then it would be great. I know you didn't touch asyncio, but let's just be sure. Change in one backend affecting another backend would not be the weirdest thing we saw in this driver. |
5dbc94b to
ad60335
Compare
|
Rebased. |
|
@dkropachev Please take a look at this. |
| elif self.is_closed or self.is_defunct: | ||
| # Socket was closed by another thread between the | ||
| # watcher firing and us calling send(). This is the | ||
| # race described in scylladb/python-driver#614. | ||
| return |
There was a problem hiding this comment.
Agreed, but you still need to route error handling to defunct
One of the tests was enabled that uses TLS with |
how asyncio is related to this fix ? |
|
@dkropachev see my answer in the respective comment thread for your comment. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughFixes a libevreactor race where watcher callbacks can run after close(), classifies peer-disconnect errno values as clean closes, adds early-return guards in read/write handlers, ensures ConnectionShutdown propagates to in-flight requests, and adds unit tests covering races and errno handling. ChangesLibev connection close race condition fix
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/unit/io/test_libevreactor_close_race.py`:
- Around line 59-82: The tearDown() must reset the module-global libev reactor
state so connections created by _make_connection() don't leak into other tests;
update tearDown() to access the LibevConnection module-global _global_loop and
clear its registries (e.g. set _global_loop._live_conns and
_global_loop._new_conns to empty lists or otherwise remove registered
connections) and then null out the global (e.g. set LibevConnection._global_loop
= None) so initialize_reactor() starts fresh for the next test run.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 9ad1e813-c716-48db-9847-2486199d0ab9
📒 Files selected for processing (2)
cassandra/io/libevreactor.pytests/unit/io/test_libevreactor_close_race.py
9530ab9 to
a5bb694
Compare
|
Updated:
Unchanged:
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cassandra/io/libevreactor.py`:
- Around line 45-48: The _PEER_DISCONNECT_ERRNOS set is missing errno.EPIPE
which causes BrokenPipeError writes to be treated as failures instead of clean
peer disconnects; update the frozenset _PEER_DISCONNECT_ERRNOS to include
errno.EPIPE so LibevLoop.handle_write() will classify EPIPE as a clean
disconnect (avoiding the path that calls self.defunct(err)), and extend the
existing unit test to assert that errno.EPIPE is a member of
_PEER_DISCONNECT_ERRNOS.
In `@tests/unit/io/test_libevreactor_close_race.py`:
- Around line 120-126: The test currently sets c.is_closed before calling
c.handle_write/c.handle_read so the handler returns early and the socket
send()/recv() side_effects (socket.error EBADF) never execute; change the test
to leave c.is_closed False when invoking c.handle_write/handle_read so the
dispatch path runs, and simulate the post-dispatch race by either making the
socket side_effect set c.is_closed to True (or set c.is_closed immediately after
the side_effect triggers) so the EBADF path in cassandra/io/libevreactor.py is
exercised; update both the write test (references: c._socket.send,
c.handle_write, c.is_closed, defunct) and the read test (references:
c._socket.recv, c.handle_read, c.is_closed, defunct) accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: d6a6d7fd-a6a9-4587-8579-790c8a80af74
📒 Files selected for processing (2)
cassandra/io/libevreactor.pytests/unit/io/test_libevreactor_close_race.py
a5bb694 to
0275436
Compare
…tion When `close()` is called before the connection handshake completes, the original code signals waiters via `connected_event.set()` but doesn't populate `last_error`, allowing `factory()` to potentially return a dead connection. Additionally, peer disconnects (ECONNRESET, ENOTCONN, etc.) during I/O were routed through defunct(), which is heavier than necessary for a clean shutdown scenario. So, fix it applying following changes: - Preserve `last_error` in `close()` when `connected_event` is unset, so `factory()` can detect dead connections. - Peer disconnect detection (ECONNRESET, ENOTCONN, etc.) that calls `close()` cleanly instead of `defunct()`. - Defensive early-return guards in `handle_read()/handle_write()` to skip unnecessary work when connection is already closed/defunct. Fixes: scylladb#614
0275436 to
04eea80
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tests/unit/io/test_libevreactor_close_race.py (1)
116-130:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winExercise the post-dispatch EBADF race instead of the top-level guard.
These tests set
is_closedbefore calling the handler, so Line 337 and Line 391 incassandra/io/libevreactor.pyreturn beforesend()/recv()run. That duplicates the “already closed” coverage and leaves the new secondaryexceptguards untested.Suggested fix
- c._socket.send.side_effect = socket.error(errno.EBADF, "Bad file descriptor") - c.is_closed = True + def send_then_close(_): + c.is_closed = True + raise socket.error(errno.EBADF, "Bad file descriptor") + c._socket.send.side_effect = send_then_close- c._socket.recv.side_effect = socket.error(errno.EBADF, "Bad file descriptor") - c.is_closed = True + def recv_then_close(_): + c.is_closed = True + raise socket.error(errno.EBADF, "Bad file descriptor") + c._socket.recv.side_effect = recv_then_closeAlso applies to: 173-184
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/io/test_libevreactor_close_race.py` around lines 116 - 130, Test currently sets c.is_closed before calling c.handle_write so send() never runs; instead make the socket.send side_effect simulate the post-dispatch race by setting c.is_closed = True inside the side_effect and then raising socket.error(errno.EBADF, ...), then call c.handle_write and assert defunct was not called; apply the same pattern to the similar test around lines 173-184 (the read-side test) by modifying the c._socket.recv side_effect to set is_closed then raise EBADF so the new secondary except-guard paths in handle_write/handle_read are exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/unit/io/test_libevreactor_close_race.py`:
- Around line 34-38: The test should only skip when libev is absent, not on real
import errors: change the import try/except so it only catches
DependencyException (i.e. except DependencyException:) and lets ImportError
propagate; keep the fallback assignments to LibevConnection and
_PEER_DISCONNECT_ERRNOS when DependencyException is raised so the test is
skipped only for missing libev while real import problems in
cassandra.io.libevreactor surface.
---
Duplicate comments:
In `@tests/unit/io/test_libevreactor_close_race.py`:
- Around line 116-130: Test currently sets c.is_closed before calling
c.handle_write so send() never runs; instead make the socket.send side_effect
simulate the post-dispatch race by setting c.is_closed = True inside the
side_effect and then raising socket.error(errno.EBADF, ...), then call
c.handle_write and assert defunct was not called; apply the same pattern to the
similar test around lines 173-184 (the read-side test) by modifying the
c._socket.recv side_effect to set is_closed then raise EBADF so the new
secondary except-guard paths in handle_write/handle_read are exercised.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: fd6ea025-26c7-464a-a1a6-2c32614c08ac
📒 Files selected for processing (2)
cassandra/io/libevreactor.pytests/unit/io/test_libevreactor_close_race.py
| try: | ||
| from cassandra.io.libevreactor import LibevConnection, _PEER_DISCONNECT_ERRNOS | ||
| except (ImportError, DependencyException): | ||
| LibevConnection = None | ||
| _PEER_DISCONNECT_ERRNOS = None |
There was a problem hiding this comment.
Only skip when libev is unavailable.
cassandra/io/libevreactor.py already translates a missing libev extension into DependencyException, so catching bare ImportError here will also turn real import regressions in the module under test into a silent skip.
Suggested fix
try:
from cassandra.io.libevreactor import LibevConnection, _PEER_DISCONNECT_ERRNOS
-except (ImportError, DependencyException):
+except DependencyException:
LibevConnection = None
_PEER_DISCONNECT_ERRNOS = None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| from cassandra.io.libevreactor import LibevConnection, _PEER_DISCONNECT_ERRNOS | |
| except (ImportError, DependencyException): | |
| LibevConnection = None | |
| _PEER_DISCONNECT_ERRNOS = None | |
| try: | |
| from cassandra.io.libevreactor import LibevConnection, _PEER_DISCONNECT_ERRNOS | |
| except DependencyException: | |
| LibevConnection = None | |
| _PEER_DISCONNECT_ERRNOS = None |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/io/test_libevreactor_close_race.py` around lines 34 - 38, The test
should only skip when libev is absent, not on real import errors: change the
import try/except so it only catches DependencyException (i.e. except
DependencyException:) and lets ImportError propagate; keep the fallback
assignments to LibevConnection and _PEER_DISCONNECT_ERRNOS when
DependencyException is raised so the test is skipped only for missing libev
while real import problems in cassandra.io.libevreactor surface.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
tests/unit/io/test_libevreactor_close_race.py (2)
34-38:⚠️ Potential issue | 🟠 Major | ⚡ Quick winOnly skip on
DependencyException.
cassandra.io.libevreactoralready translates a missing libev extension intoDependencyException, so catchingImportErrorhere will also hide unrelated import regressions in the module under test by turning them into a skip.Suggested fix
try: from cassandra.io.libevreactor import LibevConnection, _PEER_DISCONNECT_ERRNOS -except (ImportError, DependencyException): +except DependencyException: LibevConnection = None _PEER_DISCONNECT_ERRNOS = None#!/bin/bash set -euo pipefail echo "libevreactor import wrapper:" sed -n '24,34p' cassandra/io/libevreactor.py echo echo "test import guard:" sed -n '34,38p' tests/unit/io/test_libevreactor_close_race.pyExpected result:
cassandra/io/libevreactor.pyalready converts the missing-extension case intoDependencyException, soImportErrorin this test guard is broader than needed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/io/test_libevreactor_close_race.py` around lines 34 - 38, The test import guard is catching ImportError too broadly and can hide real import regressions; update the except clause so it only catches DependencyException (leave the assignments to LibevConnection and _PEER_DISCONNECT_ERRNOS as-is). Specifically, in tests/unit/io/test_libevreactor_close_race.py change the except (ImportError, DependencyException): to except DependencyException: so only the extension-missing case (DependencyException) is skipped while other import errors surface.
72-76:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReset the global reactor in
tearDown().
_make_connection()registers realLibevConnectioninstances in the module-global_global_loop. Callingclose()here only queues them into_closed_conns, and this suite patchesLibevLoop.maybe_start, so_loop_will_run()never drains that state. Leaving_global_loopalive makes later libev tests order-dependent.Suggested fix
def tearDown(self): for c in self._connections: c.close() + LibevConnection.handle_fork() for p in self.patchers: p.stop()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/io/test_libevreactor_close_race.py` around lines 72 - 76, TearDown must reset the module-global reactor so leftover LibevConnection instances don't persist; after closing connections and stopping patchers in tearDown(), clear any queued closed connections and reset the global loop by emptying _global_loop._closed_conns (if present) and then setting the module-global _global_loop to None (or a fresh LibevLoop) so _make_connection / LibevConnection registrations are not kept around when LibevLoop.maybe_start is patched and _loop_will_run never drains them.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@tests/unit/io/test_libevreactor_close_race.py`:
- Around line 34-38: The test import guard is catching ImportError too broadly
and can hide real import regressions; update the except clause so it only
catches DependencyException (leave the assignments to LibevConnection and
_PEER_DISCONNECT_ERRNOS as-is). Specifically, in
tests/unit/io/test_libevreactor_close_race.py change the except (ImportError,
DependencyException): to except DependencyException: so only the
extension-missing case (DependencyException) is skipped while other import
errors surface.
- Around line 72-76: TearDown must reset the module-global reactor so leftover
LibevConnection instances don't persist; after closing connections and stopping
patchers in tearDown(), clear any queued closed connections and reset the global
loop by emptying _global_loop._closed_conns (if present) and then setting the
module-global _global_loop to None (or a fresh LibevLoop) so _make_connection /
LibevConnection registrations are not kept around when LibevLoop.maybe_start is
patched and _loop_will_run never drains them.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 039ed698-3f52-4d73-a4ac-e1770681c2bc
📒 Files selected for processing (2)
cassandra/io/libevreactor.pytests/unit/io/test_libevreactor_close_race.py


When
close()is called before the connection handshake completes,the original code signals waiters via
connected_event.set()but doesn't populate
last_error, allowingfactory()to potentially return a dead connection.
Additionally, peer disconnects (ECONNRESET, ENOTCONN, etc.)
during I/O were routed through defunct(),
which is heavier than necessary for a clean shutdown scenario.
So, fix it applying following changes:
last_errorinclose()whenconnected_eventis unset, sofactory()can detect dead connections.close()cleanly instead ofdefunct().handle_read()/handle_write()to skip unnecessary work when connection is already closed/defunct.Fixes: #614
Pre-review checklist
./docs/source/.Fixes:annotations to PR description.Summary by CodeRabbit
Bug Fixes
Tests