fix(bot): self-exit on sustained getUpdates network outage#135
Merged
Conversation
A sustained network outage (long-poll getUpdates failing continuously) is the silent twin of the Conflict storm: the bot stays alive-but-deaf and may NOT recover even after the upstream returns (a wedged half-open poll socket). Observed in production — a night-time proxy/VPN outage left the bot running-but-deaf for ~6h, since the network-error branch only logged a debounced one-liner and never exited. Mirror the existing sustained-Conflict exit: once NetworkError/TimedOut persist CONTIGUOUSLY longer than NETWORK_MAX_SECONDS (180s), log CRITICAL and exit non-zero so the supervisor (Docker `restart: unless-stopped` / ccbot-supervisor.sh's wait-for-net loop) respawns a clean instance with a fresh getUpdates connection. Contiguity is judged by NETWORK_GAP_SECONDS (45s): a longer quiet gap proves a poll recovered, so the outage clock resets — sporadic blips on a healthy idle bot never accumulate toward the threshold. Reuses the _terminate_for_sustained_conflict() seam (stop_running + os._exit). Covered by 3 new tests in test_conflict_exit.py (single blip tolerated, sustained outage exits, gap resets the clock). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
422483a to
a94cb3a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The bot's
_error_handlerexits the process on a sustained Conflict storm (a second poller owns the token — unrecoverable by retry) so the supervisor converges on one clean instance. But the network-error branch had no such escape hatch: onNetworkError/TimedOutit logged a single debounced one-liner and returned, trusting PTB's polling to self-recover.In production that trust broke. A night-time proxy/VPN outage left the long-poll
getUpdateswedged on a half-open socket. The bot stayedrunningbut deaf — it never recovered even after the upstream came back, and sat silent for ~6h until a manual restart. The container'srestart: unless-stoppedcouldn't help because the process never exited.Fix
Add the symmetric self-heal for sustained network outages: if
NetworkError/TimedOutpersist contiguously longer thanNETWORK_MAX_SECONDS(180s), logCRITICALand exit non-zero via the existing_terminate_for_sustained_conflict()seam. The supervisor then respawns a clean instance that opens a freshgetUpdatesconnection.Contiguity is judged by
NETWORK_GAP_SECONDS(45s): a quiet gap longer than that proves a poll succeeded in between (recovery), so the outage clock resets. This means sporadic blips on a healthy idle bot never accumulate toward the threshold — only a genuinely stuck poll trips it.Composes with both supervisors:
restart: unless-stopped) → respawns the container.ccbot-supervisor.sh→os._exit(1)isEXIT_CRASH, and its wait-for-net loop gates the restart, so a still-down network can't cause a restart storm.Tests
3 new cases in
test_conflict_exit.py:Full suite green (732 passed locally + in an ephemeral docker build of this branch).
app.pyuntouched since #111, so this applies cleanly on currentmain.🤖 Generated with Claude Code