Summary
Two races in the Harper lifecycle helpers (src/harperLifecycle.ts / dist/harperLifecycle.js) are a recurring source of intermittent failures for consumers that run sharded, concurrent integration suites — most visibly HarperFast/harper's Integration Tests workflow, which currently has to work around both. Filing here because the fix lives in this package.
Tracked downstream in HarperFast/harper#1139.
1. Fixed startup deadline (DEFAULT_STARTUP_TIMEOUT_MS)
startHarper() resolves only when Harper prints successfully started on stdout; otherwise it rejects after DEFAULT_STARTUP_TIMEOUT_MS, which defaults to 60s (harperLifecycle.js:27, read in runHarperCommand at :155-162). Under shard contention on shared CI runners, install + start regularly exceeds 60s, so the consumer papers over it with env bumps:
- HarperFast/harper sets
HARPER_INTEGRATION_TEST_STARTUP_TIMEOUT_MS=120000 (Linux) and 180000 (Windows), with workflow comments explicitly chasing the deadline ("Slow runners ... may need more than the default 60s"; "Windows runners are noticeably slower than the Linux pool").
The fixed deadline is the flake lever: a slow-but-healthy boot is indistinguishable from a hang. Worth weighing a readiness poll / health-probe with backoff instead of a single wall-clock deadline, and/or a higher CI-aware default.
2. Teardown SIGKILL window + immediate loopback recycle
killHarper() (harperLifecycle.js:310-329) sends SIGTERM, then escalates to SIGKILL after only 200ms, then resolves. teardownHarper() (:351-358) then immediately calls releaseLoopbackAddress(), returning the loopback IP to the pool for the next suite to grab. Issues:
- 200ms is short for a graceful shutdown —
teardownHarper's own comment notes "rocksdb may be flushing." The process is frequently SIGKILL'd mid-flush.
- Harper spawns worker threads / child processes;
proc.kill() signals only the direct child. Lingering workers can keep holding the fixed ports (9925/9926/9927/1883/8883) on that loopback IP after the parent is gone.
- Because the ports are fixed and only the loopback address rotates between suites, the next suite that acquires the just-released IP can hit
EADDRINUSE / connection races against sockets still in TIME_WAIT or held by an orphaned worker.
Related observation: removeDeadProcessesFromPool (loopbackAddressPool.js:289-300) keys liveness on the test-runner PID stored in the pool, not the Harper child PID — so a runner killed while holding addresses (e.g. a timed-out shard) can leave its slot marked in-use until another process reaps it.
Impact
Downstream, these contribute the "ECONNREFUSED on restart" / "address already in use" class of intermittent failures and are the reason the startup-timeout env workarounds exist. (HarperFast/harper Integration Tests have been red on main 7 of the last 8 runs; the dominant single cause is a specific suite, tracked separately downstream, but the harness races are a real secondary contributor.)
Not proposing a specific fix here
Just capturing the mechanics. Options to weigh: readiness polling instead of a fixed deadline; a configurable, longer SIGTERM grace before SIGKILL; verifying the ports are actually free (not just the address bindable) before recycling; tree-kill of worker children on teardown.
🤖 Generated with Claude Code
Summary
Two races in the Harper lifecycle helpers (
src/harperLifecycle.ts/dist/harperLifecycle.js) are a recurring source of intermittent failures for consumers that run sharded, concurrent integration suites — most visibly HarperFast/harper's Integration Tests workflow, which currently has to work around both. Filing here because the fix lives in this package.1. Fixed startup deadline (
DEFAULT_STARTUP_TIMEOUT_MS)startHarper()resolves only when Harper printssuccessfully startedon stdout; otherwise it rejects afterDEFAULT_STARTUP_TIMEOUT_MS, which defaults to 60s (harperLifecycle.js:27, read inrunHarperCommandat:155-162). Under shard contention on shared CI runners, install + start regularly exceeds 60s, so the consumer papers over it with env bumps:HARPER_INTEGRATION_TEST_STARTUP_TIMEOUT_MS=120000(Linux) and180000(Windows), with workflow comments explicitly chasing the deadline ("Slow runners ... may need more than the default 60s"; "Windows runners are noticeably slower than the Linux pool").The fixed deadline is the flake lever: a slow-but-healthy boot is indistinguishable from a hang. Worth weighing a readiness poll / health-probe with backoff instead of a single wall-clock deadline, and/or a higher CI-aware default.
2. Teardown SIGKILL window + immediate loopback recycle
killHarper()(harperLifecycle.js:310-329) sends SIGTERM, then escalates to SIGKILL after only 200ms, then resolves.teardownHarper()(:351-358) then immediately callsreleaseLoopbackAddress(), returning the loopback IP to the pool for the next suite to grab. Issues:teardownHarper's own comment notes "rocksdb may be flushing." The process is frequently SIGKILL'd mid-flush.proc.kill()signals only the direct child. Lingering workers can keep holding the fixed ports (9925/9926/9927/1883/8883) on that loopback IP after the parent is gone.EADDRINUSE/ connection races against sockets still inTIME_WAITor held by an orphaned worker.Related observation:
removeDeadProcessesFromPool(loopbackAddressPool.js:289-300) keys liveness on the test-runner PID stored in the pool, not the Harper child PID — so a runner killed while holding addresses (e.g. a timed-out shard) can leave its slot marked in-use until another process reaps it.Impact
Downstream, these contribute the "ECONNREFUSED on restart" / "address already in use" class of intermittent failures and are the reason the startup-timeout env workarounds exist. (HarperFast/harper Integration Tests have been red on
main7 of the last 8 runs; the dominant single cause is a specific suite, tracked separately downstream, but the harness races are a real secondary contributor.)Not proposing a specific fix here
Just capturing the mechanics. Options to weigh: readiness polling instead of a fixed deadline; a configurable, longer SIGTERM grace before SIGKILL; verifying the ports are actually free (not just the address bindable) before recycling; tree-kill of worker children on teardown.
🤖 Generated with Claude Code