fix(hermes-adapter): stop bridge subprocess leak — share one bridge client per host process (#1910)#1911
Conversation
…t per host process (MemTensor#1910) Every MemTensorProvider instance owned its own Node bridge subprocess, but the Hermes host gives no lifecycle guarantee: load_memory_provider() constructs a fresh provider per call, and pre-#27190 hosts rebuilt one per background-review fork — every turn — without ever calling shutdown(). Each abandoned instance leaked a live bridge.cjs process, pinned forever by three reference legs: 1. bound-method hooks appended to the host's global plugin manager and never removed (also duplicating hook work N times); 2. a per-instance keepalive thread that even resurrected the bridge after a manual kill -9; 3. the reader thread <-> stdin pipe cycle blocking GC, so the child never saw stdin EOF. Fix: ownership moves off provider instances onto a process-wide supervisor (new bridge_supervisor.py): - one shared MemosBridgeClient per host process, WeakSet-refcounted, generation-aware replace/respawn (the JSON-RPC protocol is already multi-session, so sharing is semantically free); - one process-wide keepalive bound to the shared client: respawns it after a crash (with holders>0 guard + backoff), stops with the last holder, never revives zero-holder clients; - host hooks installed once per process, dispatched to live providers through a weakref registry; host.llm.complete registered via a WeakMethod wrapper that falls through to a surviving provider; - initialize() is idempotent; reader threads hold only a weakref and a finalize() backstop closes the pipes if a client is dropped without close(). Verified against the real subprocess surface: 3 abandoned providers -> 1 live process (was 4+); kill -9 -> exactly one respawn; all providers shutdown -> 0 processes, keepalive stopped. 12 new lifecycle tests pin the invariants; existing 42 Python tests stay green. Fixes MemTensor#1910
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR hardens the Hermes MemOS adapter against process leaks by centralizing bridge lifecycle management and ensuring provider instances don’t keep bridge subprocesses/threads alive unintentionally.
Changes:
- Introduces a process-wide
BridgeSupervisorto share a single bridge client across providers and manage keepalive/respawn. - Refactors
MemosBridgeClientreader threads to avoid strong references (using weakrefs + finalizers) and closes pipes more deterministically. - Adds/updates Python tests to isolate process-wide state and validate lifecycle/GC/respawn behavior.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| apps/memos-local-plugin/tests/python/test_hermes_provider_pipeline.py | Resets shared bridge runtime between tests to avoid cross-test interference. |
| apps/memos-local-plugin/tests/python/test_hermes_bridge_lifecycle.py | Adds comprehensive lifecycle tests for single-client sharing, hook dedupe, GC, and keepalive respawn. |
| apps/memos-local-plugin/tests/python/test_bridge_client.py | Ensures process-wide bridge runtime is reset in setUp/tearDown for isolation. |
| apps/memos-local-plugin/adapters/hermes/memos_provider/bridge_supervisor.py | New supervisor module implementing shared client ownership, keepalive, respawn, and weak provider registry + hook dispatch. |
| apps/memos-local-plugin/adapters/hermes/memos_provider/bridge_client.py | Switches stdout/stderr pumps to module-level functions with weakrefs and adds GC finalizer + pipe cleanup. |
| apps/memos-local-plugin/adapters/hermes/memos_provider/init.py | Integrates supervisor ownership into provider lifecycle; dedupes hooks and avoids strong handler pinning. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…tract Found by real-bridge E2E for MemTensor#1910: the core contract (agent-contract/memory-core.ts onTurnEnd) returns { traceId, episodeId }, but the adapter read the plural `traceIds`, so `_last_trace_id` was always empty against the real core and verifier-feedback never achieved trace-level binding (silently fell back to episode scope). Pre-existing bug on main, independent of the process-leak fix; the plural form is kept as a fallback.
|
Real-bridge E2E results (compiled
The E2E also surfaced a pre-existing bug on Full suite: 54 Python tests green (incl. |
|
Real-host verification — installed into a real
Caveat for completeness: this host version's background-review fork wasn't explicitly triggered during the short sessions (it's cadence-gated); that abandonment scenario is covered by the lifecycle unit tests and the real-bridge E2E above. |
…hook install, deterministic test reset - _handle_line: guard non-object JSON (null/list/string) before dict-key access — previously a stray line could raise TypeError and kill the reader thread (pre-existing in _read_loop, now covered by a test) - install_host_hooks: replace the boolean short-circuit with an idempotent presence check against the *current* plugin manager, so a host that rebuilds the manager or clears _hooks on plugin reload gets the dispatchers re-installed instead of going silently unhooked - reset_for_tests: capture the keepalive thread reference before _stop_keepalive_locked() clears it, so the join actually runs and test resets are deterministic
Summary
Fixes #1910 — every conversation turn could spawn a new
bridge.cjs --agent=hermes --no-viewerprocess, with old processes never cleaned up (4+ accumulated, 70–340 MB RSS each).Root cause
Each
MemTensorProviderinstance owned its own Node bridge subprocess, but the Hermes host gives no lifecycle guarantee:load_memory_provider()constructs a fresh provider instance on every call (the module is cached insys.modules, the instance is not), and hosts before NousResearch/hermes-agent#27190 rebuilt a provider per background-review fork — i.e. per turn — without ever callingshutdown(). Each abandoned instance leaked one live bridge process, pinned forever by three reference legs (all reproduced deterministically with the unmodified adapter):mgr._hooks.setdefault(...).append(self._on_post_tool_call)) — also duplicating hook work N×;kill -9(verified: killed pid respawned within one 5 s cycle);Repro: 3 abandoned provider instances → exactly 4 live
--no-viewerprocesses, matching thepsoutput in the issue.Fix — enforce "one bridge per host process" structurally
New
bridge_supervisor.pymoves bridge ownership off provider instances onto a process-wide supervisor:MemosBridgeClientper host process, refcounted via aWeakSetof holders with generation-awarereplace/respawn. The JSON-RPC protocol is already multi-session (session.opencarriessessionId; the--daemonviewer shares a core the same way), so sharing is semantically free.holders > 0guard so a zero-holder client can never be revived; exponential backoff when the factory keeps failing), recreates a client lost while holders remain, and stops with the last holder.host.llm.completeis registered via aWeakMethodwrapper that falls through to a surviving provider when the registrant is GC'd.initialize()is idempotent (re-init no longer leaks the previous bridge); reader threads hold only a weakref and aweakref.finalizebackstop closes the pipes if a client is dropped withoutclose()— abandoned providers are now actually garbage-collectable.Verification
Real-subprocess surface (real
Popen/pipes/Node child, lifecycle-equivalent stub bridge):shutdown()initialize()called twice on one providerkill -9the bridge of an abandoned providershutdown()Tests: 12 new lifecycle tests (
tests/python/test_hermes_bridge_lifecycle.py) pin the invariants — shared-client, GC-collectability, hooks-once, respawn guards, stop-event race, WeakMethod fall-through. All 54 Python tests green (incl.-W error::ResourceWarning);ruff checkclean. Two independent adversarial reviews (incl. 2000–3000-trial barrier-synchronized race repros on the supervisor) confirmed the "zero holders ⇒ no live client" invariant and lock ordering; every finding (3 major + keepalive stop-event race) is fixed with a dedicated regression test.Notes
adapters/hermes/; the TS core, OpenClaw adapter, and bridge protocol are untouched.skip_memory=Trueforoneshot.py) is worth a follow-up PR to hermes-agent; this fix makes the adapter robust regardless of host behavior.