startHarper readiness deadline and teardown loopback recycle cause intermittent failures under sharded CI

## Summary

Two races in the Harper lifecycle helpers (`src/harperLifecycle.ts` / `dist/harperLifecycle.js`) are a recurring source of intermittent failures for consumers that run sharded, concurrent integration suites — most visibly [HarperFast/harper](https://github.com/HarperFast/harper)'s Integration Tests workflow, which currently has to work around both. Filing here because the fix lives in this package.

> Tracked downstream in HarperFast/harper#1139.

## 1. Fixed startup deadline (`DEFAULT_STARTUP_TIMEOUT_MS`)

`startHarper()` resolves only when Harper prints `successfully started` on stdout; otherwise it rejects after `DEFAULT_STARTUP_TIMEOUT_MS`, which defaults to **60s** (`harperLifecycle.js:27`, read in `runHarperCommand` at `:155-162`). Under shard contention on shared CI runners, install + start regularly exceeds 60s, so the consumer papers over it with env bumps:

- HarperFast/harper sets `HARPER_INTEGRATION_TEST_STARTUP_TIMEOUT_MS=120000` (Linux) and `180000` (Windows), with workflow comments explicitly chasing the deadline ("Slow runners ... may need more than the default 60s"; "Windows runners are noticeably slower than the Linux pool").

The fixed deadline is the flake lever: a slow-but-healthy boot is indistinguishable from a hang. Worth weighing a readiness poll / health-probe with backoff instead of a single wall-clock deadline, and/or a higher CI-aware default.

## 2. Teardown SIGKILL window + immediate loopback recycle

`killHarper()` (`harperLifecycle.js:310-329`) sends SIGTERM, then escalates to **SIGKILL after only 200ms**, then resolves. `teardownHarper()` (`:351-358`) then *immediately* calls `releaseLoopbackAddress()`, returning the loopback IP to the pool for the next suite to grab. Issues:

- 200ms is short for a graceful shutdown — `teardownHarper`'s own comment notes "rocksdb may be flushing." The process is frequently SIGKILL'd mid-flush.
- Harper spawns worker threads / child processes; `proc.kill()` signals only the direct child. Lingering workers can keep holding the **fixed ports** (9925/9926/9927/1883/8883) on that loopback IP after the parent is gone.
- Because the ports are fixed and only the loopback *address* rotates between suites, the next suite that acquires the just-released IP can hit `EADDRINUSE` / connection races against sockets still in `TIME_WAIT` or held by an orphaned worker.

Related observation: `removeDeadProcessesFromPool` (`loopbackAddressPool.js:289-300`) keys liveness on the **test-runner** PID stored in the pool, not the Harper child PID — so a runner killed while holding addresses (e.g. a timed-out shard) can leave its slot marked in-use until another process reaps it.

## Impact

Downstream, these contribute the "ECONNREFUSED on restart" / "address already in use" class of intermittent failures and are the reason the startup-timeout env workarounds exist. (HarperFast/harper Integration Tests have been red on `main` 7 of the last 8 runs; the dominant single cause is a specific suite, tracked separately downstream, but the harness races are a real secondary contributor.)

## Not proposing a specific fix here

Just capturing the mechanics. Options to weigh: readiness polling instead of a fixed deadline; a configurable, longer SIGTERM grace before SIGKILL; verifying the ports are actually free (not just the address bindable) before recycling; tree-kill of worker children on teardown.

---

🤖 Generated with [Claude Code](https://claude.com/claude-code)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

startHarper readiness deadline and teardown loopback recycle cause intermittent failures under sharded CI #8

Summary

1. Fixed startup deadline (`DEFAULT_STARTUP_TIMEOUT_MS`)

2. Teardown SIGKILL window + immediate loopback recycle

Impact

Not proposing a specific fix here

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

startHarper readiness deadline and teardown loopback recycle cause intermittent failures under sharded CI #8

Description

Summary

1. Fixed startup deadline (DEFAULT_STARTUP_TIMEOUT_MS)

2. Teardown SIGKILL window + immediate loopback recycle

Impact

Not proposing a specific fix here

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Fixed startup deadline (`DEFAULT_STARTUP_TIMEOUT_MS`)