fix(libev): guard handle_read/handle_write against close() race condition by vponomaryov · Pull Request #889 · scylladb/python-driver

vponomaryov · 2026-05-19T13:15:06Z

When close() is called before the connection handshake completes,
the original code signals waiters via connected_event.set()
but doesn't populate last_error, allowing factory()
to potentially return a dead connection.

Additionally, peer disconnects (ECONNRESET, ENOTCONN, etc.)
during I/O were routed through defunct(),
which is heavier than necessary for a clean shutdown scenario.

So, fix it applying following changes:

Preserve last_error in close() when connected_event is unset, so factory() can detect dead connections.
Peer disconnect detection (ECONNRESET, ENOTCONN, etc.) that calls close() cleanly instead of defunct().
Defensive early-return guards in handle_read()/handle_write() to skip unnecessary work when connection is already closed/defunct.

Fixes: #614

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

Summary by CodeRabbit

Bug Fixes
- Connection shutdown now records the shutdown error and reliably propagates it to all in‑flight requests.
- Peer-disconnect socket errors are treated as clean closes instead of defunct states.
- Read/write callbacks no longer act on sockets already closed or defunct, resolving close/watch race conditions and ensuring EOF/peer-close set the shutdown error before closing.
Tests
- Added unit tests covering close-race scenarios, EOF/peer-disconnect behavior, and peer-disconnect errno coverage.

vponomaryov · 2026-05-19T13:17:53Z

Ref: https://scylladb.atlassian.net/browse/SCT-83

vponomaryov · 2026-05-22T09:41:50Z

Verified the test fix here:

scylla-staging/valerii/vp-longevity-100gb-4h-oci-test#25

Test run without this fix (for comparison) is here:

scylla-master/longevity/longevity-100gb-4h-oci-test#34

Comment from the "zeus" comparing the fixed test run and issue affected:

@zeus posted:

Conclusion: The target run did NOT hit the SCT-83 bug. The fix works.

Comparison

Aspect Reference run (SCT-83 bug) 8a5b983a Target run (fix branch) 0d335d55

Status FAILED PASSED

Branch origin/master origin/fix-pydrv-race-issue-614

Errors 6 ERROR events (3 SoftTimeoutEvent + 3 FailedResultEvent) 0 errors

Test longevity_test.LongevityTest.test_custom_time Same

Scylla version 2026.3.0~dev Same

Backend/Region OCI / us-ashburn-1 Same

Nodes 6 6

Healthcheck duration comparison

Reference run (with bug) — progressive degradation pattern:

Healthchecks №1-7: 9-10s (normal)

Healthcheck №8: 46s (degradation begins)

Healthcheck №9: 1m38s

Healthcheck №10: 3m20s (198s, exceeds 180s soft timeout → ERROR)

Healthcheck №11: 3m50s (228s, exceeds timeout → ERROR)

Healthcheck №12: 2m8s

Healthcheck №13: 1m38s

Healthcheck №14: 1m7s

Healthcheck №15: 2m33s

Healthcheck №16: 46s

Healthcheck №17: 22m33s (1351s, massively exceeds timeout → ERROR)

Target run (with fix) — consistently healthy:

All 20 healthchecks: 9-11s (every single one normal, no degradation)

Analysis

The SCT-83 / python-driver#614 bug manifests as a progressive increase in healthcheck duration over the course of the test, caused by a race condition in the python driver. In the reference run, healthcheck times escalated from ~10s to minutes, eventually hitting 22m33s — well beyond the 180s soft timeout.

The target run, running on branch fix-pydrv-race-issue-614, shows zero degradation — all 20 healthchecks completed in 9-11 seconds throughout the entire 4+ hour test. No SoftTimeoutEvents, no FailedResultEvents, no errors at all.

The fix completely eliminates the healthcheck timeout escalation pattern characteristic of SCT-83.

@fruch , @soyacz ^
This has proven to fix the https://scylladb.atlassian.net/browse/SCT-83 bug which is identical to the GH issue - #614

vponomaryov · 2026-05-22T09:50:31Z

Note that the libenv tests have passed successfully - related to the PR.
The failed integration tests are the asyncio-based - not related to this PR.

@scylladb/python-driver-maint please merge and release it.

Lorak-mmk · 2026-05-22T10:04:43Z

+                elif self.is_closed or self.is_defunct:
+                    # Socket was closed by another thread between the
+                    # watcher firing and us calling send().  This is the
+                    # race described in scylladb/python-driver#614.
+                    return


I don't see why this is necessary. Without this, we will go into else branch, call self.defunct(err) which will do nothing, and then return.

Yes, defunct() does have its own if self.is_defunct or self.is_closed: return guard, so no crash or incorrect state transition would occur.

But, the guard is still valuable because of the following reasons:

Without the guard we will get missleading log message if the error is EBADF (which is in _PEER_DISCONNECT_ERRNOS), the code falls into the _PEER_DISCONNECT_ERRNOS branch first (it comes before else), logging "Connection %s closed by peer during write".
But the socket wasn't closed by the peer - we closed it via close().

The _PEER_DISCONNECT_ERRNOS branch calls self.close(), which is a no-op when already closed but still acquires self.lock - a minor unnecessary overhead.

The guard makes the race condition handling self-documenting.
Without it, a reader must trace through multiple branches to confirm the behavior is safe.

So, I find it rather useful than redundant.

Ah, thanks for the explanation, I did not pay much attention to this path. Now I see a different potential issue, mentioned by Avi ( #614 (comment) )

If we are receiving EBADF, then we performed some action (in this case READ / WRITE) on a closed file descriptor. As far as I know, file descriptors can be reused. So, if another connection was opened between this one being closed and watchers exiting, it makes it possible for us to accidentaly read / write to a totally unrelated connection (or even some file opened by a user app). Am I wrong?

If this is true, then this patch hides the immediate symptom of the problem, but also hides a much more dangerous problem that could cause data corruption.

I don't know the libev connection impl that well, but maybe the proper approach is to, in close, first stop the watchers somehow and make sure that libev won't touch the fd anymore, and only then close the fd?

Situation where we accidentally read/write to a different connection is, probably, architecturally valid for C/C++ programs using raw fds, but doesn't apply in case of Python.
The Python socket abstraction prevents it.

Python's socket.socket object tracks state internally - after .close(), any operation on that object raises EBADF regardless of whether the OS reused the fd number for something else.
The self._socket reference always points to the old (closed) Python object, never to the new socket that happened to get the same fd.

libev get a raw fd, not PYthon socket, but the watchers themselves use the Python socket so I guess this issue really can't happen.
The watchers also seem to be correctly removed from the loop in close (connection_destroyed), so they won't keep triggering forever. I think your change should be fine.

@dkropachev
Take over the PR please.
I don't see a point in me guessing the expectations.

Update it as you wish and just make sure it fixes the issue.

@dkropachev you have other fix to this issue ? or we'll need to revert to the last known working version, with this bug, across the broad ?

What is the version that worked well?

3.29.5 is the first version we started seeing it, and it was also when we moved to python 3.14

if you have better fix for this problem, go for it.

but @vponomaryov have proven that this PR is fixing this problem.

this problem is generating lots of noise, and the slowdowns we have by retrying on it from the client end, is wasting us lots of money (in some cases it's hours and hours of retries)

It is not fixing it, socket gets broken by unknown reason and it is only hides it.

Lorak-mmk · 2026-05-22T10:05:17Z

+            elif self.is_closed or self.is_defunct:
+                # Socket was closed by another thread between the watcher
+                # firing and us calling recv().  This is the race described
+                # in scylladb/python-driver#614.
+                return


Without this, if the error is in _PEER_DISCONNECT_ERRNOS (EBADF), the code would execute following:

self.last_error = ConnectionShutdown( "Connection to %s was closed by peer" % self.endpoint) self.close()

It will overwrite the last_error with an incorrect closed by peer message when in reality close() was already called by us.
Since last_error is used by factory() to detect why a connection died, this overwrite could mask the real error.

Lorak-mmk · 2026-05-22T10:07:36Z

+class LibevPeerDisconnectErrnos(unittest.TestCase):
+    """Verify _PEER_DISCONNECT_ERRNOS contains expected values."""
+
+    def setUp(self):
+        _skip_if_unavailable(self)
+
+    def test_contains_ebadf(self):
+        self.assertIn(errno.EBADF, _PEER_DISCONNECT_ERRNOS)
+
+    def test_contains_econnreset(self):
+        self.assertIn(errno.ECONNRESET, _PEER_DISCONNECT_ERRNOS)
+
+    def test_contains_econnaborted(self):
+        self.assertIn(errno.ECONNABORTED, _PEER_DISCONNECT_ERRNOS)
+
+    def test_contains_enotconn(self):
+        self.assertIn(errno.ENOTCONN, _PEER_DISCONNECT_ERRNOS)
+
+    def test_contains_eshutdown(self):
+        self.assertIn(errno.ESHUTDOWN, _PEER_DISCONNECT_ERRNOS)


I see no point in having such tests. It just cements the current code, not testing the behavior.

I don't mind removing it

vponomaryov · 2026-05-22T11:47:31Z

The effect of the python driver issue was slow health checks in SCT.
Example from a test run with the issue:

And here how much time it takes with the fix:

So, numbers tell for itself.

Lorak-mmk · 2026-05-22T11:48:18Z

@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open?

sylwiaszunejko · 2026-05-22T12:17:45Z

@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open?

I thought it was due to switch to scylla version 2026.1 and #330, but I see it is supposed to be fixed by #884

vponomaryov · 2026-05-22T12:24:08Z

@Lorak-mmk from the PR completeness point of view, should I remove that unit test class?

@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open?

I thought it was due to switch to scylla version 2026.1 and #330, but I see it is supposed to be fixed by #884

So, I need to rebase the PR?
Even if we know that this PR doesn't affect the failed tests?

Lorak-mmk · 2026-05-22T12:45:22Z

@Lorak-mmk from the PR completeness point of view, should I remove that unit test class?

Tbh I don't care that much.

So, I need to rebase the PR?

If you could then it would be great. I know you didn't touch asyncio, but let's just be sure. Change in one backend affecting another backend would not be the weirdest thing we saw in this driver.

vponomaryov · 2026-05-22T13:37:05Z

Rebased.

Lorak-mmk · 2026-05-22T14:27:42Z

@dkropachev Please take a look at this.

dkropachev · 2026-05-22T21:52:19Z

+                elif self.is_closed or self.is_defunct:
+                    # Socket was closed by another thread between the
+                    # watcher firing and us calling send().  This is the
+                    # race described in scylladb/python-driver#614.
+                    return


Agreed, but you still need to route error handling to defunct

dkropachev · 2026-05-22T21:55:02Z

@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open?

One of the tests was enabled that uses TLS with asyncio, the ultimate cure - #884

fruch · 2026-05-25T09:27:39Z

@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open?

One of the tests was enabled that uses TLS with asyncio, the ultimate cure - #884

how asyncio is related to this fix ?

vponomaryov · 2026-05-25T10:56:00Z

@dkropachev see my answer in the respective comment thread for your comment.

coderabbitai · 2026-05-25T18:57:07Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Fixes a libevreactor race where watcher callbacks can run after close(), classifies peer-disconnect errno values as clean closes, adds early-return guards in read/write handlers, ensures ConnectionShutdown propagates to in-flight requests, and adds unit tests covering races and errno handling.

Changes

Libev connection close race condition fix

Layer / File(s)	Summary
Peer-disconnect errno constants and test header/skip helper `cassandra/io/libevreactor.py`, `tests/unit/io/test_libevreactor_close_race.py`	Imports `errno`, defines `_PEER_DISCONNECT_ERRNOS` (EBADF, ECONNRESET, ECONNABORTED, ENOTCONN, ESHUTDOWN, plus EPIPE checked in tests); adds test module header, guarded imports, and `_skip_if_unavailable()` helper.
Test setup and connection factory `tests/unit/io/test_libevreactor_close_race.py`	Test fixture setup/teardown, reactor init, patcher control, connection tracking, and `_make_connection()` that builds a `LibevConnection` with a mocked socket.
Connection shutdown and error propagation `cassandra/io/libevreactor.py`, `tests/unit/io/test_libevreactor_close_race.py`	`close()` now creates a `ConnectionShutdown` exception, calls `error_all_requests()` to propagate it, preserves it in `last_error` if `connected_event` is unset, then sets `connected_event`. Test verifies `last_error` is populated when closing before `connected_event`.
Write handler: race guards and peer-disconnect handling `cassandra/io/libevreactor.py`, `tests/unit/io/test_libevreactor_close_race.py`	`handle_write()` returns early if connection became closed/defunct between watcher firing and IO; treats peer-disconnect errnos as clean close (log + `close()`), not `defunct()`. Tests verify early-return, EBADF-after-close does not defunct, and ECONNRESET triggers `close()`.
Read handler: race guards, EOF and peer-disconnect handling `cassandra/io/libevreactor.py`, `tests/unit/io/test_libevreactor_close_race.py`	`handle_read()` returns early for closed/defunct, returns on nonblocking with no buffered data, treats peer-disconnect errnos and EOF as clean close by setting `last_error` and calling `close()`, and sets `last_error` on server-initiated closure. Tests cover these read races and behaviors.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I nibble at sockets, tidy and spry,
Errnos now named, no panic nearby,
Watchers that race find a polite end,
Shutdowns told clearly to every friend,
A rabbit-approved close — gentle and spry. 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.71% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately captures the primary change: fixing a race condition in libev's handle_read/handle_write functions against close().
Description check	✅ Passed	The PR description addresses the root cause, explains the fix rationale, and includes a Fixes: annotation. All critical checklist items are marked complete.
Linked Issues check	✅ Passed	Changes implement all objectives from `#614`: preserve last_error in close(), route peer disconnects to clean close(), and add defensive guards in handle_read/handle_write to prevent operations on closed file descriptors.
Out of Scope Changes check	✅ Passed	All changes are scoped to libev I/O reactor improvements and corresponding unit tests addressing the `#614` race condition. No unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/io/test_libevreactor_close_race.py`:
- Around line 59-82: The tearDown() must reset the module-global libev reactor
state so connections created by _make_connection() don't leak into other tests;
update tearDown() to access the LibevConnection module-global _global_loop and
clear its registries (e.g. set _global_loop._live_conns and
_global_loop._new_conns to empty lists or otherwise remove registered
connections) and then null out the global (e.g. set LibevConnection._global_loop
= None) so initialize_reactor() starts fresh for the next test run.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 9ad1e813-c716-48db-9847-2486199d0ab9

📥 Commits

Reviewing files that changed from the base of the PR and between 44bc95a and ad60335.

📒 Files selected for processing (2)

cassandra/io/libevreactor.py
tests/unit/io/test_libevreactor_close_race.py

vponomaryov · 2026-05-26T10:28:37Z

Updated:

Commit description
[Unit tests] Reset the global libev reactor in tearDown()

Unchanged:

Main code changes

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cassandra/io/libevreactor.py`:
- Around line 45-48: The _PEER_DISCONNECT_ERRNOS set is missing errno.EPIPE
which causes BrokenPipeError writes to be treated as failures instead of clean
peer disconnects; update the frozenset _PEER_DISCONNECT_ERRNOS to include
errno.EPIPE so LibevLoop.handle_write() will classify EPIPE as a clean
disconnect (avoiding the path that calls self.defunct(err)), and extend the
existing unit test to assert that errno.EPIPE is a member of
_PEER_DISCONNECT_ERRNOS.

In `@tests/unit/io/test_libevreactor_close_race.py`:
- Around line 120-126: The test currently sets c.is_closed before calling
c.handle_write/c.handle_read so the handler returns early and the socket
send()/recv() side_effects (socket.error EBADF) never execute; change the test
to leave c.is_closed False when invoking c.handle_write/handle_read so the
dispatch path runs, and simulate the post-dispatch race by either making the
socket side_effect set c.is_closed to True (or set c.is_closed immediately after
the side_effect triggers) so the EBADF path in cassandra/io/libevreactor.py is
exercised; update both the write test (references: c._socket.send,
c.handle_write, c.is_closed, defunct) and the read test (references:
c._socket.recv, c.handle_read, c.is_closed, defunct) accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: d6a6d7fd-a6a9-4587-8579-790c8a80af74

📥 Commits

Reviewing files that changed from the base of the PR and between ad60335 and 9530ab9.

📒 Files selected for processing (2)

cassandra/io/libevreactor.py
tests/unit/io/test_libevreactor_close_race.py

…tion When `close()` is called before the connection handshake completes, the original code signals waiters via `connected_event.set()` but doesn't populate `last_error`, allowing `factory()` to potentially return a dead connection. Additionally, peer disconnects (ECONNRESET, ENOTCONN, etc.) during I/O were routed through defunct(), which is heavier than necessary for a clean shutdown scenario. So, fix it applying following changes: - Preserve `last_error` in `close()` when `connected_event` is unset, so `factory()` can detect dead connections. - Peer disconnect detection (ECONNRESET, ENOTCONN, etc.) that calls `close()` cleanly instead of `defunct()`. - Defensive early-return guards in `handle_read()/handle_write()` to skip unnecessary work when connection is already closed/defunct. Fixes: scylladb#614

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tests/unit/io/test_libevreactor_close_race.py (1)

116-130: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Exercise the post-dispatch EBADF race instead of the top-level guard.

These tests set is_closed before calling the handler, so Line 337 and Line 391 in cassandra/io/libevreactor.py return before send() / recv() run. That duplicates the “already closed” coverage and leaves the new secondary except guards untested.

Suggested fix

-        c._socket.send.side_effect = socket.error(errno.EBADF, "Bad file descriptor")
-        c.is_closed = True
+        def send_then_close(_):
+            c.is_closed = True
+            raise socket.error(errno.EBADF, "Bad file descriptor")
+        c._socket.send.side_effect = send_then_close

-        c._socket.recv.side_effect = socket.error(errno.EBADF, "Bad file descriptor")
-        c.is_closed = True
+        def recv_then_close(_):
+            c.is_closed = True
+            raise socket.error(errno.EBADF, "Bad file descriptor")
+        c._socket.recv.side_effect = recv_then_close

Also applies to: 173-184

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/io/test_libevreactor_close_race.py` around lines 116 - 130, Test
currently sets c.is_closed before calling c.handle_write so send() never runs;
instead make the socket.send side_effect simulate the post-dispatch race by
setting c.is_closed = True inside the side_effect and then raising
socket.error(errno.EBADF, ...), then call c.handle_write and assert defunct was
not called; apply the same pattern to the similar test around lines 173-184 (the
read-side test) by modifying the c._socket.recv side_effect to set is_closed
then raise EBADF so the new secondary except-guard paths in
handle_write/handle_read are exercised.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/io/test_libevreactor_close_race.py`:
- Around line 34-38: The test should only skip when libev is absent, not on real
import errors: change the import try/except so it only catches
DependencyException (i.e. except DependencyException:) and lets ImportError
propagate; keep the fallback assignments to LibevConnection and
_PEER_DISCONNECT_ERRNOS when DependencyException is raised so the test is
skipped only for missing libev while real import problems in
cassandra.io.libevreactor surface.

---

Duplicate comments:
In `@tests/unit/io/test_libevreactor_close_race.py`:
- Around line 116-130: Test currently sets c.is_closed before calling
c.handle_write so send() never runs; instead make the socket.send side_effect
simulate the post-dispatch race by setting c.is_closed = True inside the
side_effect and then raising socket.error(errno.EBADF, ...), then call
c.handle_write and assert defunct was not called; apply the same pattern to the
similar test around lines 173-184 (the read-side test) by modifying the
c._socket.recv side_effect to set is_closed then raise EBADF so the new
secondary except-guard paths in handle_write/handle_read are exercised.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: fd6ea025-26c7-464a-a1a6-2c32614c08ac

📥 Commits

Reviewing files that changed from the base of the PR and between a5bb694 and 0275436.

📒 Files selected for processing (2)

cassandra/io/libevreactor.py
tests/unit/io/test_libevreactor_close_race.py

coderabbitai · 2026-05-26T10:55:30Z

+try:
+    from cassandra.io.libevreactor import LibevConnection, _PEER_DISCONNECT_ERRNOS
+except (ImportError, DependencyException):
+    LibevConnection = None
+    _PEER_DISCONNECT_ERRNOS = None


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Only skip when libev is unavailable.

cassandra/io/libevreactor.py already translates a missing libev extension into DependencyException, so catching bare ImportError here will also turn real import regressions in the module under test into a silent skip.

Suggested fix

try: from cassandra.io.libevreactor import LibevConnection, _PEER_DISCONNECT_ERRNOS -except (ImportError, DependencyException): +except DependencyException: LibevConnection = None _PEER_DISCONNECT_ERRNOS = None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

from cassandra.io.libevreactor import LibevConnection, _PEER_DISCONNECT_ERRNOS

except (ImportError, DependencyException):

LibevConnection = None

_PEER_DISCONNECT_ERRNOS = None

try:

from cassandra.io.libevreactor import LibevConnection, _PEER_DISCONNECT_ERRNOS

except DependencyException:

LibevConnection = None

_PEER_DISCONNECT_ERRNOS = None

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/io/test_libevreactor_close_race.py` around lines 34 - 38, The test should only skip when libev is absent, not on real import errors: change the import try/except so it only catches DependencyException (i.e. except DependencyException:) and lets ImportError propagate; keep the fallback assignments to LibevConnection and _PEER_DISCONNECT_ERRNOS when DependencyException is raised so the test is skipped only for missing libev while real import problems in cassandra.io.libevreactor surface.

coderabbitai

♻️ Duplicate comments (2)

tests/unit/io/test_libevreactor_close_race.py (2)
34-38: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only skip on DependencyException.

cassandra.io.libevreactor already translates a missing libev extension into DependencyException, so catching ImportError here will also hide unrelated import regressions in the module under test by turning them into a skip.
Suggested fix
 try:
     from cassandra.io.libevreactor import LibevConnection, _PEER_DISCONNECT_ERRNOS
-except (ImportError, DependencyException):
+except DependencyException:
     LibevConnection = None
     _PEER_DISCONNECT_ERRNOS = None
#!/bin/bash
set -euo pipefail

echo "libevreactor import wrapper:"
sed -n '24,34p' cassandra/io/libevreactor.py

echo
echo "test import guard:"
sed -n '34,38p' tests/unit/io/test_libevreactor_close_race.py
Expected result: cassandra/io/libevreactor.py already converts the missing-extension case into DependencyException, so ImportError in this test guard is broader than needed.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/io/test_libevreactor_close_race.py` around lines 34 - 38, The test
import guard is catching ImportError too broadly and can hide real import
regressions; update the except clause so it only catches DependencyException
(leave the assignments to LibevConnection and _PEER_DISCONNECT_ERRNOS as-is).
Specifically, in tests/unit/io/test_libevreactor_close_race.py change the except
(ImportError, DependencyException): to except DependencyException: so only the
extension-missing case (DependencyException) is skipped while other import
errors surface.
72-76: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reset the global reactor in tearDown().

_make_connection() registers real LibevConnection instances in the module-global _global_loop. Calling close() here only queues them into _closed_conns, and this suite patches LibevLoop.maybe_start, so _loop_will_run() never drains that state. Leaving _global_loop alive makes later libev tests order-dependent.
Suggested fix
     def tearDown(self):
         for c in self._connections:
             c.close()
+        LibevConnection.handle_fork()
         for p in self.patchers:
             p.stop()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/io/test_libevreactor_close_race.py` around lines 72 - 76, TearDown
must reset the module-global reactor so leftover LibevConnection instances don't
persist; after closing connections and stopping patchers in tearDown(), clear
any queued closed connections and reset the global loop by emptying
_global_loop._closed_conns (if present) and then setting the module-global
_global_loop to None (or a fresh LibevLoop) so _make_connection /
LibevConnection registrations are not kept around when LibevLoop.maybe_start is
patched and _loop_will_run never drains them.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@tests/unit/io/test_libevreactor_close_race.py`:
- Around line 34-38: The test import guard is catching ImportError too broadly
and can hide real import regressions; update the except clause so it only
catches DependencyException (leave the assignments to LibevConnection and
_PEER_DISCONNECT_ERRNOS as-is). Specifically, in
tests/unit/io/test_libevreactor_close_race.py change the except (ImportError,
DependencyException): to except DependencyException: so only the
extension-missing case (DependencyException) is skipped while other import
errors surface.
- Around line 72-76: TearDown must reset the module-global reactor so leftover
LibevConnection instances don't persist; after closing connections and stopping
patchers in tearDown(), clear any queued closed connections and reset the global
loop by emptying _global_loop._closed_conns (if present) and then setting the
module-global _global_loop to None (or a fresh LibevLoop) so _make_connection /
LibevConnection registrations are not kept around when LibevLoop.maybe_start is
patched and _loop_will_run never drains them.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 039ed698-3f52-4d73-a4ac-e1770681c2bc

📥 Commits

Reviewing files that changed from the base of the PR and between 0275436 and 04eea80.

📒 Files selected for processing (2)

cassandra/io/libevreactor.py
tests/unit/io/test_libevreactor_close_race.py

vponomaryov requested review from a team and dkropachev May 22, 2026 09:42

This was referenced May 22, 2026

Driver reported "[Errno 9] Bad file descriptor" #614

Open

fix(py-drv): use fixed python driver with proper reconnections scylladb/scylla-cluster-tests#14708

Draft

Lorak-mmk requested changes May 22, 2026

View reviewed changes

vponomaryov requested a review from Lorak-mmk May 22, 2026 11:00

vponomaryov force-pushed the fix-race-issue-614 branch from 5dbc94b to ad60335 Compare May 22, 2026 13:36

Lorak-mmk approved these changes May 22, 2026

View reviewed changes

dkropachev requested changes May 22, 2026

View reviewed changes

vponomaryov requested a review from dkropachev May 25, 2026 10:50

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

Comment thread tests/unit/io/test_libevreactor_close_race.py

fruch approved these changes May 25, 2026

View reviewed changes

vponomaryov force-pushed the fix-race-issue-614 branch 2 times, most recently from 9530ab9 to a5bb694 Compare May 26, 2026 10:27

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

Comment thread cassandra/io/libevreactor.py

Comment thread tests/unit/io/test_libevreactor_close_race.py Outdated

vponomaryov force-pushed the fix-race-issue-614 branch from a5bb694 to 0275436 Compare May 26, 2026 10:50

vponomaryov force-pushed the fix-race-issue-614 branch from 0275436 to 04eea80 Compare May 26, 2026 10:53

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

Conversation

vponomaryov commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-review checklist

Summary by CodeRabbit

Uh oh!

vponomaryov commented May 19, 2026

Uh oh!

vponomaryov commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Conclusion: The target run did NOT hit the SCT-83 bug. The fix works.

Comparison

Healthcheck duration comparison

Analysis

Uh oh!

vponomaryov commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vponomaryov commented May 22, 2026

Uh oh!

Lorak-mmk commented May 22, 2026

Uh oh!

sylwiaszunejko commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vponomaryov commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lorak-mmk commented May 22, 2026

Uh oh!

vponomaryov commented May 22, 2026

Uh oh!

Lorak-mmk commented May 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkropachev commented May 22, 2026

Uh oh!

fruch commented May 25, 2026

Uh oh!

vponomaryov commented May 25, 2026

Uh oh!

coderabbitai Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

vponomaryov commented May 19, 2026 •

edited by coderabbitai Bot

Loading

vponomaryov commented May 22, 2026 •

edited

Loading

vponomaryov commented May 22, 2026 •

edited

Loading

sylwiaszunejko commented May 22, 2026 •

edited

Loading

vponomaryov commented May 22, 2026 •

edited

Loading

coderabbitai Bot commented May 25, 2026 •

edited

Loading