Skip to content

fix(bot): self-exit on sustained getUpdates network outage#135

Merged
Time4Mind merged 1 commit into
mainfrom
fix/network-self-exit
Jun 17, 2026
Merged

fix(bot): self-exit on sustained getUpdates network outage#135
Time4Mind merged 1 commit into
mainfrom
fix/network-self-exit

Conversation

@Time4Mind

Copy link
Copy Markdown
Owner

Problem

The bot's _error_handler exits the process on a sustained Conflict storm (a second poller owns the token — unrecoverable by retry) so the supervisor converges on one clean instance. But the network-error branch had no such escape hatch: on NetworkError/TimedOut it logged a single debounced one-liner and returned, trusting PTB's polling to self-recover.

In production that trust broke. A night-time proxy/VPN outage left the long-poll getUpdates wedged on a half-open socket. The bot stayed running but deaf — it never recovered even after the upstream came back, and sat silent for ~6h until a manual restart. The container's restart: unless-stopped couldn't help because the process never exited.

Fix

Add the symmetric self-heal for sustained network outages: if NetworkError/TimedOut persist contiguously longer than NETWORK_MAX_SECONDS (180s), log CRITICAL and exit non-zero via the existing _terminate_for_sustained_conflict() seam. The supervisor then respawns a clean instance that opens a fresh getUpdates connection.

Contiguity is judged by NETWORK_GAP_SECONDS (45s): a quiet gap longer than that proves a poll succeeded in between (recovery), so the outage clock resets. This means sporadic blips on a healthy idle bot never accumulate toward the threshold — only a genuinely stuck poll trips it.

Composes with both supervisors:

  • Docker (restart: unless-stopped) → respawns the container.
  • ccbot-supervisor.shos._exit(1) is EXIT_CRASH, and its wait-for-net loop gates the restart, so a still-down network can't cause a restart storm.

Tests

3 new cases in test_conflict_exit.py:

  • a single network blip is tolerated (no exit),
  • a contiguous outage past the budget exits exactly once,
  • a quiet gap resets the outage clock so spread-out blips never trip it.

Full suite green (732 passed locally + in an ephemeral docker build of this branch). app.py untouched since #111, so this applies cleanly on current main.

🤖 Generated with Claude Code

A sustained network outage (long-poll getUpdates failing continuously) is
the silent twin of the Conflict storm: the bot stays alive-but-deaf and may
NOT recover even after the upstream returns (a wedged half-open poll socket).
Observed in production — a night-time proxy/VPN outage left the bot
running-but-deaf for ~6h, since the network-error branch only logged a
debounced one-liner and never exited.

Mirror the existing sustained-Conflict exit: once NetworkError/TimedOut
persist CONTIGUOUSLY longer than NETWORK_MAX_SECONDS (180s), log CRITICAL and
exit non-zero so the supervisor (Docker `restart: unless-stopped` /
ccbot-supervisor.sh's wait-for-net loop) respawns a clean instance with a
fresh getUpdates connection. Contiguity is judged by NETWORK_GAP_SECONDS
(45s): a longer quiet gap proves a poll recovered, so the outage clock resets
— sporadic blips on a healthy idle bot never accumulate toward the threshold.

Reuses the _terminate_for_sustained_conflict() seam (stop_running + os._exit).
Covered by 3 new tests in test_conflict_exit.py (single blip tolerated,
sustained outage exits, gap resets the clock).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Time4Mind Time4Mind force-pushed the fix/network-self-exit branch from 422483a to a94cb3a Compare June 17, 2026 05:19
@Time4Mind Time4Mind merged commit 57f92c0 into main Jun 17, 2026
4 checks passed
@Time4Mind Time4Mind deleted the fix/network-self-exit branch June 17, 2026 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant