Skip to content

Loopback pool leaks addresses from runners killed mid-shard (dead-PID sweep only runs when the pool is full) #13

@heskew

Description

@heskew

Summary

The loopback-address pool only reclaims addresses held by dead processes when the pool is full, and it keys liveness on the test-runner PID. A runner killed mid-shard (CI job cancelled, OOM, Ctrl-C) never runs its teardown/release, so its addresses linger in the pool until something else exhausts the pool and triggers the dead-PID sweep.

This is a separate problem from the normal teardown/restart recycle race fixed in #9 — that PR makes the graceful teardown path correct; this is about the runner-died-without-teardown path.

Mechanism

In src/loopbackAddressPool.ts:

  • On acquire, the dead-process sweep runs only in the index === null branch — i.e. when no address is available:
    if (index === null) {
        // No available addresses - remove any dead processes from the pool and wait
        removeDeadProcessesFromPool(loopbackPool);
    } else {
        loopbackPool[index] = process.pid; // <- the *runner's* pid
    }
  • removeDeadProcessesFromPool decides liveness with process.kill(pid, 0) against that stored runner PID.
  • A normal exit releases via releaseLoopbackAddress / releaseAllLoopbackAddressesForCurrentProcess. A runner SIGKILLed mid-shard runs neither, so its slots stay marked until the next full-pool sweep notices the PID is gone.

Impact

Under sharded CI with runner churn, addresses leaked by killed runners aren't reclaimed promptly — the pool drifts toward exhaustion and acquirers wait on RETRY_DELAY_MS loops. Compounding it: runner-exit reaping (added in #9) is registered on process exit/SIGINT/SIGTERM, which don't fire on SIGKILL, so a hard-killed runner can also orphan Harper process trees still bound to that address — so when the slot is eventually reclaimed and reused, the new tenant can hit EADDRINUSE.

Possible direction (not prescriptive)

  • Run removeDeadProcessesFromPool proactively on every acquire (before findAvailableIndex), not just when full — cheap, and reclaims dead-runner slots without waiting for exhaustion.
  • Consider verifying the address's fixed ports are actually free at acquire time (tie-in with the post-kill port assertion in fix: idle-based startup readiness and port-safe teardown recycle #9), so a reclaimed-but-still-orphaned address is detected rather than handed out.

Credit / refs

Surfaced by @Ethan-Arrowood's review of #9 (root-cause analysis of the teardown recycle race). Related: #8.


🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions