Skip to content

fix(hermes-adapter): stop bridge subprocess leak — share one bridge client per host process (#1910)#1911

Open
fancyboi999 wants to merge 4 commits into
MemTensor:mainfrom
fancyboi999:fix/hermes-bridge-process-leak-1910
Open

fix(hermes-adapter): stop bridge subprocess leak — share one bridge client per host process (#1910)#1911
fancyboi999 wants to merge 4 commits into
MemTensor:mainfrom
fancyboi999:fix/hermes-bridge-process-leak-1910

Conversation

@fancyboi999

Copy link
Copy Markdown
Contributor

Summary

Fixes #1910 — every conversation turn could spawn a new bridge.cjs --agent=hermes --no-viewer process, with old processes never cleaned up (4+ accumulated, 70–340 MB RSS each).

Root cause

Each MemTensorProvider instance owned its own Node bridge subprocess, but the Hermes host gives no lifecycle guarantee: load_memory_provider() constructs a fresh provider instance on every call (the module is cached in sys.modules, the instance is not), and hosts before NousResearch/hermes-agent#27190 rebuilt a provider per background-review fork — i.e. per turn — without ever calling shutdown(). Each abandoned instance leaked one live bridge process, pinned forever by three reference legs (all reproduced deterministically with the unmodified adapter):

  1. bound-method hooks appended to the host's global plugin manager and never removed (mgr._hooks.setdefault(...).append(self._on_post_tool_call)) — also duplicating hook work N×;
  2. a per-instance keepalive thread that even resurrected the bridge after a manual kill -9 (verified: killed pid respawned within one 5 s cycle);
  3. the reader-thread ↔ stdin-pipe cycle blocking GC, so the child never received the stdin-EOF exit signal.

Repro: 3 abandoned provider instances → exactly 4 live --no-viewer processes, matching the ps output in the issue.

Fix — enforce "one bridge per host process" structurally

New bridge_supervisor.py moves bridge ownership off provider instances onto a process-wide supervisor:

  • One shared MemosBridgeClient per host process, refcounted via a WeakSet of holders with generation-aware replace/respawn. The JSON-RPC protocol is already multi-session (session.open carries sessionId; the --daemon viewer shares a core the same way), so sharing is semantically free.
  • One process-wide keepalive bound to the shared client: respawns it after a crash (holders > 0 guard so a zero-holder client can never be revived; exponential backoff when the factory keeps failing), recreates a client lost while holders remain, and stops with the last holder.
  • Host hooks installed once per process, dispatched to live providers through a weakref registry; host.llm.complete is registered via a WeakMethod wrapper that falls through to a surviving provider when the registrant is GC'd.
  • initialize() is idempotent (re-init no longer leaks the previous bridge); reader threads hold only a weakref and a weakref.finalize backstop closes the pipes if a client is dropped without close() — abandoned providers are now actually garbage-collectable.

Verification

Real-subprocess surface (real Popen/pipes/Node child, lifecycle-equivalent stub bridge):

Scenario Before After
3 providers abandoned without shutdown() 4 live processes 1
initialize() called twice on one provider +1 leaked process 1
kill -9 the bridge of an abandoned provider resurrected per orphan exactly one shared respawn; turn roundtrip works across it
all live providers shutdown() processes lived on 0 processes, keepalive stopped

Tests: 12 new lifecycle tests (tests/python/test_hermes_bridge_lifecycle.py) pin the invariants — shared-client, GC-collectability, hooks-once, respawn guards, stop-event race, WeakMethod fall-through. All 54 Python tests green (incl. -W error::ResourceWarning); ruff check clean. Two independent adversarial reviews (incl. 2000–3000-trial barrier-synchronized race repros on the supervisor) confirmed the "zero holders ⇒ no live client" invariant and lock ordering; every finding (3 major + keepalive stop-event race) is fixed with a dedicated regression test.

Notes

  • Python-only change inside adapters/hermes/; the TS core, OpenClaw adapter, and bridge protocol are untouched.
  • The single-process RSS growth mentioned in the issue (343 MB) is a separate concern (bridge-side memory), not addressed here — this PR eliminates the process-count leak and the resurrection behavior.
  • Host-side hardening (e.g. skip_memory=True for oneshot.py) is worth a follow-up PR to hermes-agent; this fix makes the adapter robust regardless of host behavior.

…t per host process (MemTensor#1910)

Every MemTensorProvider instance owned its own Node bridge subprocess,
but the Hermes host gives no lifecycle guarantee: load_memory_provider()
constructs a fresh provider per call, and pre-#27190 hosts rebuilt one
per background-review fork — every turn — without ever calling
shutdown(). Each abandoned instance leaked a live bridge.cjs process,
pinned forever by three reference legs:

  1. bound-method hooks appended to the host's global plugin manager
     and never removed (also duplicating hook work N times);
  2. a per-instance keepalive thread that even resurrected the bridge
     after a manual kill -9;
  3. the reader thread <-> stdin pipe cycle blocking GC, so the child
     never saw stdin EOF.

Fix: ownership moves off provider instances onto a process-wide
supervisor (new bridge_supervisor.py):

  - one shared MemosBridgeClient per host process, WeakSet-refcounted,
    generation-aware replace/respawn (the JSON-RPC protocol is already
    multi-session, so sharing is semantically free);
  - one process-wide keepalive bound to the shared client: respawns it
    after a crash (with holders>0 guard + backoff), stops with the last
    holder, never revives zero-holder clients;
  - host hooks installed once per process, dispatched to live providers
    through a weakref registry; host.llm.complete registered via a
    WeakMethod wrapper that falls through to a surviving provider;
  - initialize() is idempotent; reader threads hold only a weakref and
    a finalize() backstop closes the pipes if a client is dropped
    without close().

Verified against the real subprocess surface: 3 abandoned providers ->
1 live process (was 4+); kill -9 -> exactly one respawn; all providers
shutdown -> 0 processes, keepalive stopped. 12 new lifecycle tests pin
the invariants; existing 42 Python tests stay green.

Fixes MemTensor#1910
Copilot AI review requested due to automatic review settings June 12, 2026 03:26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR hardens the Hermes MemOS adapter against process leaks by centralizing bridge lifecycle management and ensuring provider instances don’t keep bridge subprocesses/threads alive unintentionally.

Changes:

  • Introduces a process-wide BridgeSupervisor to share a single bridge client across providers and manage keepalive/respawn.
  • Refactors MemosBridgeClient reader threads to avoid strong references (using weakrefs + finalizers) and closes pipes more deterministically.
  • Adds/updates Python tests to isolate process-wide state and validate lifecycle/GC/respawn behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
apps/memos-local-plugin/tests/python/test_hermes_provider_pipeline.py Resets shared bridge runtime between tests to avoid cross-test interference.
apps/memos-local-plugin/tests/python/test_hermes_bridge_lifecycle.py Adds comprehensive lifecycle tests for single-client sharing, hook dedupe, GC, and keepalive respawn.
apps/memos-local-plugin/tests/python/test_bridge_client.py Ensures process-wide bridge runtime is reset in setUp/tearDown for isolation.
apps/memos-local-plugin/adapters/hermes/memos_provider/bridge_supervisor.py New supervisor module implementing shared client ownership, keepalive, respawn, and weak provider registry + hook dispatch.
apps/memos-local-plugin/adapters/hermes/memos_provider/bridge_client.py Switches stdout/stderr pumps to module-level functions with weakrefs and adds GC finalizer + pipe cleanup.
apps/memos-local-plugin/adapters/hermes/memos_provider/init.py Integrates supervisor ownership into provider lifecycle; dedupes hooks and avoids strong handler pinning.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…tract

Found by real-bridge E2E for MemTensor#1910: the core contract
(agent-contract/memory-core.ts onTurnEnd) returns
{ traceId, episodeId }, but the adapter read the plural `traceIds`,
so `_last_trace_id` was always empty against the real core and
verifier-feedback never achieved trace-level binding (silently fell
back to episode scope). Pre-existing bug on main, independent of the
process-leak fix; the plural form is kept as a fallback.
@fancyboi999

Copy link
Copy Markdown
Contributor Author

Real-bridge E2E results (compiled dist/bridge.cjs: real MemoryCore + SQLite + migrations, isolated MEMOS_HOME, no stubs) — 14/14 checks pass:

  • 3 providers abandoned without shutdown() → exactly 1 live process throughout (was 4+ before the fix); all 3 abandoned instances garbage-collected
  • initialize() re-entry → still 1 process
  • Full turn roundtrip through the real core: session.openturn.start (ep_xrx3q1v7rfka) → turn.end (tr_hhdgb216wc9a persisted)
  • kill -9 the shared bridge → exactly one replacement spawned by the keepalive, no runaway loop; post-crash turn rebinds lazily and recall survives via SQLite (772-char injected context referencing the pre-crash fact)
  • All providers shut down → 0 processes, keepalive thread stopped

The E2E also surfaced a pre-existing bug on main, fixed in 3ecc419: the adapter read the plural traceIds from turn.end, but the core contract (memory-core.ts onTurnEnd) returns a singular { traceId, episodeId } — so _last_trace_id was always empty against the real core and verifier feedback silently lost trace-level binding. Independent of the leak fix, but it lives in the same adapter and is now covered by an assertion in the pipeline tests.

Full suite: 54 Python tests green (incl. -W error::ResourceWarning), ruff clean.

@fancyboi999

Copy link
Copy Markdown
Contributor Author

Real-host verification — installed into a real hermes-agent install (v0.8.0, which predates the host-side skip_memory fixes) with GitHub Copilot (copilot-acp) as the live LLM, memory.provider: memtensor pointing at this branch via the bundled-plugin path:

  • 419 process samples across all scenarios (0.5 s interval): --no-viewer bridge count max = 1, never 2
  • Two separate hermes chat -q sessions: bridge spawns once per host process, exits cleanly with it (0 stragglers)
  • Cross-session memory recall works end-to-end: session 2 answered What does HERMES_REAL_E2E_1910 mean? correctly with zero tool calls — i.e. the fact persisted via turn.end → SQLite in session 1 and was injected by turn.start prefetch in session 2
  • Interactive TUI session, 3 turns in one host process (the issue's per-turn scenario): bridge count stayed at exactly 1 after every turn
  • After exit: 0 stdio bridges; only the by-design singleton --daemon viewer remains

Caveat for completeness: this host version's background-review fork wasn't explicitly triggered during the short sessions (it's cadence-gated); that abandonment scenario is covered by the lifecycle unit tests and the real-bridge E2E above.

…hook install, deterministic test reset

- _handle_line: guard non-object JSON (null/list/string) before dict-key
  access — previously a stray line could raise TypeError and kill the
  reader thread (pre-existing in _read_loop, now covered by a test)
- install_host_hooks: replace the boolean short-circuit with an
  idempotent presence check against the *current* plugin manager, so a
  host that rebuilds the manager or clears _hooks on plugin reload gets
  the dispatchers re-installed instead of going silently unhooked
- reset_for_tests: capture the keepalive thread reference before
  _stop_keepalive_locked() clears it, so the join actually runs and
  test resets are deterministic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bridge process leak: every turn spawns new bridge.cjs, 4+ processes accumulated

2 participants