fix(hermes-adapter): stop bridge subprocess leak — share one bridge client per host process (#1910) by fancyboi999 · Pull Request #1911 · MemTensor/MemOS

fancyboi999 · 2026-06-12T03:26:08Z

Summary

Fixes #1910 — every conversation turn could spawn a new bridge.cjs --agent=hermes --no-viewer process, with old processes never cleaned up (4+ accumulated, 70–340 MB RSS each).

Root cause

Each MemTensorProvider instance owned its own Node bridge subprocess, but the Hermes host gives no lifecycle guarantee: load_memory_provider() constructs a fresh provider instance on every call (the module is cached in sys.modules, the instance is not), and hosts before NousResearch/hermes-agent#27190 rebuilt a provider per background-review fork — i.e. per turn — without ever calling shutdown(). Each abandoned instance leaked one live bridge process, pinned forever by three reference legs (all reproduced deterministically with the unmodified adapter):

bound-method hooks appended to the host's global plugin manager and never removed (mgr._hooks.setdefault(...).append(self._on_post_tool_call)) — also duplicating hook work N×;
a per-instance keepalive thread that even resurrected the bridge after a manual kill -9 (verified: killed pid respawned within one 5 s cycle);
the reader-thread ↔ stdin-pipe cycle blocking GC, so the child never received the stdin-EOF exit signal.

Repro: 3 abandoned provider instances → exactly 4 live --no-viewer processes, matching the ps output in the issue.

Fix — enforce "one bridge per host process" structurally

New bridge_supervisor.py moves bridge ownership off provider instances onto a process-wide supervisor:

One shared MemosBridgeClient per host process, refcounted via a WeakSet of holders with generation-aware replace/respawn. The JSON-RPC protocol is already multi-session (session.open carries sessionId; the --daemon viewer shares a core the same way), so sharing is semantically free.
One process-wide keepalive bound to the shared client: respawns it after a crash (holders > 0 guard so a zero-holder client can never be revived; exponential backoff when the factory keeps failing), recreates a client lost while holders remain, and stops with the last holder.
Host hooks installed once per process, dispatched to live providers through a weakref registry; host.llm.complete is registered via a WeakMethod wrapper that falls through to a surviving provider when the registrant is GC'd.
initialize() is idempotent (re-init no longer leaks the previous bridge); reader threads hold only a weakref and a weakref.finalize backstop closes the pipes if a client is dropped without close() — abandoned providers are now actually garbage-collectable.

Verification

Real-subprocess surface (real Popen/pipes/Node child, lifecycle-equivalent stub bridge):

Scenario	Before	After
3 providers abandoned without `shutdown()`	4 live processes	1
`initialize()` called twice on one provider	+1 leaked process	1
`kill -9` the bridge of an abandoned provider	resurrected per orphan	exactly one shared respawn; turn roundtrip works across it
all live providers `shutdown()`	processes lived on	0 processes, keepalive stopped

Tests: 12 new lifecycle tests (tests/python/test_hermes_bridge_lifecycle.py) pin the invariants — shared-client, GC-collectability, hooks-once, respawn guards, stop-event race, WeakMethod fall-through. All 54 Python tests green (incl. -W error::ResourceWarning); ruff check clean. Two independent adversarial reviews (incl. 2000–3000-trial barrier-synchronized race repros on the supervisor) confirmed the "zero holders ⇒ no live client" invariant and lock ordering; every finding (3 major + keepalive stop-event race) is fixed with a dedicated regression test.

Notes

Python-only change inside adapters/hermes/; the TS core, OpenClaw adapter, and bridge protocol are untouched.
The single-process RSS growth mentioned in the issue (343 MB) is a separate concern (bridge-side memory), not addressed here — this PR eliminates the process-count leak and the resurrection behavior.
Host-side hardening (e.g. skip_memory=True for oneshot.py) is worth a follow-up PR to hermes-agent; this fix makes the adapter robust regardless of host behavior.

…t per host process (MemTensor#1910) Every MemTensorProvider instance owned its own Node bridge subprocess, but the Hermes host gives no lifecycle guarantee: load_memory_provider() constructs a fresh provider per call, and pre-#27190 hosts rebuilt one per background-review fork — every turn — without ever calling shutdown(). Each abandoned instance leaked a live bridge.cjs process, pinned forever by three reference legs: 1. bound-method hooks appended to the host's global plugin manager and never removed (also duplicating hook work N times); 2. a per-instance keepalive thread that even resurrected the bridge after a manual kill -9; 3. the reader thread <-> stdin pipe cycle blocking GC, so the child never saw stdin EOF. Fix: ownership moves off provider instances onto a process-wide supervisor (new bridge_supervisor.py): - one shared MemosBridgeClient per host process, WeakSet-refcounted, generation-aware replace/respawn (the JSON-RPC protocol is already multi-session, so sharing is semantically free); - one process-wide keepalive bound to the shared client: respawns it after a crash (with holders>0 guard + backoff), stops with the last holder, never revives zero-holder clients; - host hooks installed once per process, dispatched to live providers through a weakref registry; host.llm.complete registered via a WeakMethod wrapper that falls through to a surviving provider; - initialize() is idempotent; reader threads hold only a weakref and a finalize() backstop closes the pipes if a client is dropped without close(). Verified against the real subprocess surface: 3 abandoned providers -> 1 live process (was 4+); kill -9 -> exactly one respawn; all providers shutdown -> 0 processes, keepalive stopped. 12 new lifecycle tests pin the invariants; existing 42 Python tests stay green. Fixes MemTensor#1910

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR hardens the Hermes MemOS adapter against process leaks by centralizing bridge lifecycle management and ensuring provider instances don’t keep bridge subprocesses/threads alive unintentionally.

Changes:

Introduces a process-wide BridgeSupervisor to share a single bridge client across providers and manage keepalive/respawn.
Refactors MemosBridgeClient reader threads to avoid strong references (using weakrefs + finalizers) and closes pipes more deterministically.
Adds/updates Python tests to isolate process-wide state and validate lifecycle/GC/respawn behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
apps/memos-local-plugin/tests/python/test_hermes_provider_pipeline.py	Resets shared bridge runtime between tests to avoid cross-test interference.
apps/memos-local-plugin/tests/python/test_hermes_bridge_lifecycle.py	Adds comprehensive lifecycle tests for single-client sharing, hook dedupe, GC, and keepalive respawn.
apps/memos-local-plugin/tests/python/test_bridge_client.py	Ensures process-wide bridge runtime is reset in setUp/tearDown for isolation.
apps/memos-local-plugin/adapters/hermes/memos_provider/bridge_supervisor.py	New supervisor module implementing shared client ownership, keepalive, respawn, and weak provider registry + hook dispatch.
apps/memos-local-plugin/adapters/hermes/memos_provider/bridge_client.py	Switches stdout/stderr pumps to module-level functions with weakrefs and adds GC finalizer + pipe cleanup.
apps/memos-local-plugin/adapters/hermes/memos_provider/init.py	Integrates supervisor ownership into provider lifecycle; dedupes hooks and avoids strong handler pinning.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…tract Found by real-bridge E2E for MemTensor#1910: the core contract (agent-contract/memory-core.ts onTurnEnd) returns { traceId, episodeId }, but the adapter read the plural `traceIds`, so `_last_trace_id` was always empty against the real core and verifier-feedback never achieved trace-level binding (silently fell back to episode scope). Pre-existing bug on main, independent of the process-leak fix; the plural form is kept as a fallback.

fancyboi999 · 2026-06-12T03:30:45Z

Real-bridge E2E results (compiled dist/bridge.cjs: real MemoryCore + SQLite + migrations, isolated MEMOS_HOME, no stubs) — 14/14 checks pass:

3 providers abandoned without shutdown() → exactly 1 live process throughout (was 4+ before the fix); all 3 abandoned instances garbage-collected
initialize() re-entry → still 1 process
Full turn roundtrip through the real core: session.open → turn.start (ep_xrx3q1v7rfka) → turn.end (tr_hhdgb216wc9a persisted)
kill -9 the shared bridge → exactly one replacement spawned by the keepalive, no runaway loop; post-crash turn rebinds lazily and recall survives via SQLite (772-char injected context referencing the pre-crash fact)
All providers shut down → 0 processes, keepalive thread stopped

The E2E also surfaced a pre-existing bug on main, fixed in 3ecc419: the adapter read the plural traceIds from turn.end, but the core contract (memory-core.ts onTurnEnd) returns a singular { traceId, episodeId } — so _last_trace_id was always empty against the real core and verifier feedback silently lost trace-level binding. Independent of the leak fix, but it lives in the same adapter and is now covered by an assertion in the pipeline tests.

Full suite: 54 Python tests green (incl. -W error::ResourceWarning), ruff clean.

fancyboi999 · 2026-06-12T03:42:49Z

Real-host verification — installed into a real hermes-agent install (v0.8.0, which predates the host-side skip_memory fixes) with GitHub Copilot (copilot-acp) as the live LLM, memory.provider: memtensor pointing at this branch via the bundled-plugin path:

419 process samples across all scenarios (0.5 s interval): --no-viewer bridge count max = 1, never 2
Two separate hermes chat -q sessions: bridge spawns once per host process, exits cleanly with it (0 stragglers)
Cross-session memory recall works end-to-end: session 2 answered What does HERMES_REAL_E2E_1910 mean? correctly with zero tool calls — i.e. the fact persisted via turn.end → SQLite in session 1 and was injected by turn.start prefetch in session 2
Interactive TUI session, 3 turns in one host process (the issue's per-turn scenario): bridge count stayed at exactly 1 after every turn
After exit: 0 stdio bridges; only the by-design singleton --daemon viewer remains

Caveat for completeness: this host version's background-review fork wasn't explicitly triggered during the short sessions (it's cadence-gated); that abandonment scenario is covered by the lifecycle unit tests and the real-bridge E2E above.

…hook install, deterministic test reset - _handle_line: guard non-object JSON (null/list/string) before dict-key access — previously a stray line could raise TypeError and kill the reader thread (pre-existing in _read_loop, now covered by a test) - install_host_hooks: replace the boolean short-circuit with an idempotent presence check against the *current* plugin manager, so a host that rebuilds the manager or clears _hooks on plugin reload gets the dispatchers re-installed instead of going silently unhooked - reset_for_tests: capture the keepalive thread reference before _stop_keepalive_locked() clears it, so the join actually runs and test resets are deterministic

Copilot AI review requested due to automatic review settings June 12, 2026 03:26

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread apps/memos-local-plugin/adapters/hermes/memos_provider/bridge_supervisor.py

Comment thread apps/memos-local-plugin/adapters/hermes/memos_provider/bridge_client.py

Comment thread apps/memos-local-plugin/adapters/hermes/memos_provider/bridge_supervisor.py

style: apply ruff format to lifecycle tests (CI ruff format --check)

270c9cc

fancyboi999 mentioned this pull request Jun 12, 2026

turn.start injection collapses to empty when top hits are dominated by near-duplicate question traces #1913

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hermes-adapter): stop bridge subprocess leak — share one bridge client per host process (#1910)#1911

fix(hermes-adapter): stop bridge subprocess leak — share one bridge client per host process (#1910)#1911
fancyboi999 wants to merge 4 commits into
MemTensor:mainfrom
fancyboi999:fix/hermes-bridge-process-leak-1910

fancyboi999 commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fancyboi999 commented Jun 12, 2026

Uh oh!

fancyboi999 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fancyboi999 commented Jun 12, 2026

Summary

Root cause

Fix — enforce "one bridge per host process" structurally

Verification

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fancyboi999 commented Jun 12, 2026

Uh oh!

fancyboi999 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants