You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The loopback-address pool only reclaims addresses held by dead processes when the pool is full, and it keys liveness on the test-runner PID. A runner killed mid-shard (CI job cancelled, OOM, Ctrl-C) never runs its teardown/release, so its addresses linger in the pool until something else exhausts the pool and triggers the dead-PID sweep.
This is a separate problem from the normal teardown/restart recycle race fixed in #9 — that PR makes the graceful teardown path correct; this is about the runner-died-without-teardown path.
Mechanism
In src/loopbackAddressPool.ts:
On acquire, the dead-process sweep runs only in the index === null branch — i.e. when no address is available:
if(index===null){// No available addresses - remove any dead processes from the pool and waitremoveDeadProcessesFromPool(loopbackPool);}else{loopbackPool[index]=process.pid;// <- the *runner's* pid}
removeDeadProcessesFromPool decides liveness with process.kill(pid, 0) against that stored runner PID.
A normal exit releases via releaseLoopbackAddress / releaseAllLoopbackAddressesForCurrentProcess. A runner SIGKILLed mid-shard runs neither, so its slots stay marked until the next full-pool sweep notices the PID is gone.
Impact
Under sharded CI with runner churn, addresses leaked by killed runners aren't reclaimed promptly — the pool drifts toward exhaustion and acquirers wait on RETRY_DELAY_MS loops. Compounding it: runner-exit reaping (added in #9) is registered on processexit/SIGINT/SIGTERM, which don't fire on SIGKILL, so a hard-killed runner can also orphan Harper process trees still bound to that address — so when the slot is eventually reclaimed and reused, the new tenant can hit EADDRINUSE.
Possible direction (not prescriptive)
Run removeDeadProcessesFromPool proactively on every acquire (before findAvailableIndex), not just when full — cheap, and reclaims dead-runner slots without waiting for exhaustion.
Consider verifying the address's fixed ports are actually free at acquire time (tie-in with the post-kill port assertion in fix: idle-based startup readiness and port-safe teardown recycle #9), so a reclaimed-but-still-orphaned address is detected rather than handed out.
Credit / refs
Surfaced by @Ethan-Arrowood's review of #9 (root-cause analysis of the teardown recycle race). Related: #8.
Summary
The loopback-address pool only reclaims addresses held by dead processes when the pool is full, and it keys liveness on the test-runner PID. A runner killed mid-shard (CI job cancelled, OOM, Ctrl-C) never runs its teardown/release, so its addresses linger in the pool until something else exhausts the pool and triggers the dead-PID sweep.
This is a separate problem from the normal teardown/restart recycle race fixed in #9 — that PR makes the graceful teardown path correct; this is about the runner-died-without-teardown path.
Mechanism
In
src/loopbackAddressPool.ts:index === nullbranch — i.e. when no address is available:removeDeadProcessesFromPooldecides liveness withprocess.kill(pid, 0)against that stored runner PID.releaseLoopbackAddress/releaseAllLoopbackAddressesForCurrentProcess. A runner SIGKILLed mid-shard runs neither, so its slots stay marked until the next full-pool sweep notices the PID is gone.Impact
Under sharded CI with runner churn, addresses leaked by killed runners aren't reclaimed promptly — the pool drifts toward exhaustion and acquirers wait on
RETRY_DELAY_MSloops. Compounding it: runner-exit reaping (added in #9) is registered onprocessexit/SIGINT/SIGTERM, which don't fire on SIGKILL, so a hard-killed runner can also orphan Harper process trees still bound to that address — so when the slot is eventually reclaimed and reused, the new tenant can hitEADDRINUSE.Possible direction (not prescriptive)
removeDeadProcessesFromPoolproactively on every acquire (beforefindAvailableIndex), not just when full — cheap, and reclaims dead-runner slots without waiting for exhaustion.Credit / refs
Surfaced by @Ethan-Arrowood's review of #9 (root-cause analysis of the teardown recycle race). Related: #8.
🤖 Generated with Claude Code