Skip to content

startHarper readiness deadline and teardown loopback recycle cause intermittent failures under sharded CI #8

@heskew

Description

@heskew

Summary

Two races in the Harper lifecycle helpers (src/harperLifecycle.ts / dist/harperLifecycle.js) are a recurring source of intermittent failures for consumers that run sharded, concurrent integration suites — most visibly HarperFast/harper's Integration Tests workflow, which currently has to work around both. Filing here because the fix lives in this package.

Tracked downstream in HarperFast/harper#1139.

1. Fixed startup deadline (DEFAULT_STARTUP_TIMEOUT_MS)

startHarper() resolves only when Harper prints successfully started on stdout; otherwise it rejects after DEFAULT_STARTUP_TIMEOUT_MS, which defaults to 60s (harperLifecycle.js:27, read in runHarperCommand at :155-162). Under shard contention on shared CI runners, install + start regularly exceeds 60s, so the consumer papers over it with env bumps:

  • HarperFast/harper sets HARPER_INTEGRATION_TEST_STARTUP_TIMEOUT_MS=120000 (Linux) and 180000 (Windows), with workflow comments explicitly chasing the deadline ("Slow runners ... may need more than the default 60s"; "Windows runners are noticeably slower than the Linux pool").

The fixed deadline is the flake lever: a slow-but-healthy boot is indistinguishable from a hang. Worth weighing a readiness poll / health-probe with backoff instead of a single wall-clock deadline, and/or a higher CI-aware default.

2. Teardown SIGKILL window + immediate loopback recycle

killHarper() (harperLifecycle.js:310-329) sends SIGTERM, then escalates to SIGKILL after only 200ms, then resolves. teardownHarper() (:351-358) then immediately calls releaseLoopbackAddress(), returning the loopback IP to the pool for the next suite to grab. Issues:

  • 200ms is short for a graceful shutdown — teardownHarper's own comment notes "rocksdb may be flushing." The process is frequently SIGKILL'd mid-flush.
  • Harper spawns worker threads / child processes; proc.kill() signals only the direct child. Lingering workers can keep holding the fixed ports (9925/9926/9927/1883/8883) on that loopback IP after the parent is gone.
  • Because the ports are fixed and only the loopback address rotates between suites, the next suite that acquires the just-released IP can hit EADDRINUSE / connection races against sockets still in TIME_WAIT or held by an orphaned worker.

Related observation: removeDeadProcessesFromPool (loopbackAddressPool.js:289-300) keys liveness on the test-runner PID stored in the pool, not the Harper child PID — so a runner killed while holding addresses (e.g. a timed-out shard) can leave its slot marked in-use until another process reaps it.

Impact

Downstream, these contribute the "ECONNREFUSED on restart" / "address already in use" class of intermittent failures and are the reason the startup-timeout env workarounds exist. (HarperFast/harper Integration Tests have been red on main 7 of the last 8 runs; the dominant single cause is a specific suite, tracked separately downstream, but the harness races are a real secondary contributor.)

Not proposing a specific fix here

Just capturing the mechanics. Options to weigh: readiness polling instead of a fixed deadline; a configurable, longer SIGTERM grace before SIGKILL; verifying the ports are actually free (not just the address bindable) before recycling; tree-kill of worker children on teardown.


🤖 Generated with Claude Code

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions