Server: fix agent-loop cache misses, add cancellation, observability, and robustness fixes by elkaix · Pull Request #489 · antirez/ds4

elkaix · 2026-07-02T21:53:57Z

This series came out of running ds4-server as the daily backend for a
tool-calling coding agent (OpenAI chat completions dialect, 45K-90K token
conversations) on the project's reference M5 Max. In that workload every
agent turn was paying a full-state evict store (~650 MiB) plus a disk
snapshot restore and a 12-25K-token tail re-prefill (~80-100 s per turn),
because nothing bound a chat tool-call turn to the next request. The series
fixes that (turns now cost only their genuinely new tokens, 2-3 s), plus the
robustness and observability gaps found while diagnosing it.

No inference/backend code is touched: changes are confined to ds4_server.c,
ds4_kvstore.c, and ds4_help.c. CUDA and ROCm were not validated (no
backend-specific code changed); the CPU target was built as a compile check.

Commits

Cache efficiency (the agent-loop fix):

Capture sampled whitespace separator in raw DSML tool blocks — raw-DSML
replay was byte-exact only when the model sampled the canonical \n\n
separator; any other separator was silently dropped from both content and
raw block, so replayed history never byte-matched live KV. With this fix,
agent replays retokenize token-identically in practice and hit the exact
token-prefix path.
Remember a visible live checkpoint for chat/Anthropic tool-call turns —
mirrors the Responses API's predicted-visible-transcript mechanism for
chat/Anthropic tool turns (reusing the thinking_live machinery), so when
BPE spelling does drift, the next request still continues from live KV
instead of taking the disk round trip. Hits are labeled tool-visible.
The predicted text is prompt_text + build_tool_checkpoint_suffix(...),
whose byte-equality with the next rendered prompt is already covered by
test_tool_checkpoint_suffix_is_future_prompt_canonical.
Store continued KV snapshots on step thresholds, not exact multiples —
prefills resumed from unaligned cached positions never land on exact
multiples of the continued interval, so waypoint snapshots silently
stopped firing (observed 12-20K-token gaps). Threshold-crossing fires
from any resume position; the cadence restarts at the continuation point.
Give fresh KV snapshots an eviction grace over stale never-hit anchors —
under budget pressure the newest waypoints of the live conversation
(always hits=0, they were just written) were evicted first while stale
anchors survived on the unconditional 2.0x reason bonus. Observed in
production: the waypoint one position below a mid-history divergence was
evicted seconds before it would have been hit. Never-hit entries now get
a freshness grace (one hit decaying on the existing half-life) and the
anchor bonus scales with the same activity term.
Defer consumed KV snapshot deletion until the tail prefill succeeds —
consume-on-load unlinked the snapshot before the tail prefill ran, so a
transient sync failure (Metal OOM under memory pressure) or cancellation
lost both live state and snapshot, forcing a cold prefill.
Skip tool-map disk scan when all replay ids are already in RAM — every
tools request opened and parsed the trailer of every checkpoint file for
ids tool_memory already holds; now only missing ids trigger the walk
(normally only right after a restart).

Correctness:

Do not replay multi-invoke DSML blocks into partial messages — a client
that splits one turn's parallel tool calls across two assistant messages
(or drops one) got the full multi-invoke block attached to each message,
rendering duplicated ghost invokes into the prompt. Blocks now record
their invoke count and only attach to messages carrying exactly that many
calls. Regression test included.

Robustness / operations:

Cancel prefill and decode for disconnected clients — wires
ds4_session_set_cancel() (already used by ds4-agent) around all server
syncs, so prefill stops at the next chunk on shutdown or a dead socket
instead of burning up to ~10 minutes for a client that timed out and
retried; queued jobs from dead clients are skipped, and non-streaming
decode polls the socket. Interrupted syncs keep the valid session prefix
and do not discard the loaded disk snapshot.
Add /health, /stats, and a bounded request queue — GET /health and
GET /stats answer on the client thread (responsive mid-generation) with
uptime, queue depth, per-source cache hit counters, token totals, and
last prefill/decode t/s. Optional --max-queue N returns 429 when N jobs
already wait (default 0 keeps unbounded behavior). 429/503 get proper
HTTP reason strings.
Small HTTP and API fidelity fixes — answer Expect: 100-continue;
reject chunked transfer encoding with an explicit message instead of
parsing an empty body; TCP_NODELAY; Anthropic-shaped error envelopes on
/v1/messages; OpenAI error type follows status class (api_error for
5xx, rate_limit_error for 429); chat tool_choice "required"/forced now
rejected like the Responses endpoint already does; log once per request
when thinking mode overrides client sampling params.

Testing

Machine/backend: MacBook Pro M5 Max 128 GB, macOS (Metal). Model:
DeepSeek-V4-Flash-Layers37-42Q4KExperts-OtherExpertLayersIQ2XXSGateUp-Q2KDown-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix-fixed.gguf
(q2-q4 imatrix mixed quant), iogpu.wired_limit_mb=118000.

Commands run:

make clean && make — zero warnings.
make clean && make cpu — compile check, builds (pre-existing ds4.c
warnings only, none in touched files).
git diff --check — clean.
make test — all suites pass (metal-tensor-equivalence worst_max_abs=0.139,
top1_mismatch=0; server OK). SSD-streaming and MTP-verify suites skipped by
the harness (require DS4_TEST_SSD_STREAMING / DS4_TEST_MTP).
./ds4_test --server — OK (includes new regression tests for the raw-DSML
separator, multi-invoke attach, continued-store thresholds, and eviction
freshness).
./ds4_test --tool-call-quality — OK (fast and exact paths).
./ds4_test --logprob-vectors — OK (long_memory_archive skipped by the
harness: API/official graph mismatch).
./ds4_agent_test, ./ds4-eval --self-test-extractors — OK.
End-to-end smoke against a live server (scratch KV dir, 128 MiB budget,
--max-queue 1): /health + /stats schema; OpenAI chat non-stream + SSE
(thinking on/off); Responses; Anthropic success and error envelope; OpenAI
error envelope; tool_choice=required 400; chunked rejection;
Expect: 100-continue; 429 on queue overflow; disconnect cancellation frees
the worker mid-decode (log shows finish=error error="client disconnected"
within seconds of the socket dying); repeat >512-token request reports
cached_tokens via disk second-hit; tiny budget triggers eviction and
store-after-evict keeps the newest entry; graceful shutdown persists KV.
18/18 checks pass.

Production validation (same machine, pi coding agent, 45K-90K token
sessions): agent turn overhead dropped from ~80-100 s (evict store + disk
restore + 12-25K-token re-prefill, live kv cache miss ... token-mismatch
every turn) to 2-3 s (exact-token or tool-visible live continuation);
decode 24.8 -> 25-27.9 t/s at 46K ctx; a server restart resumed a 87K-token
session from the shutdown snapshot prefetching only ~150 tokens.

Not validated: CUDA, ROCm, distributed (no backend or distributed code
touched; ds4_server.c/ds4_kvstore.c compile into those targets unchanged).
Speed: no inference-path changes; production decode/prefill numbers above
show no regression on Metal.

Prefills resumed from an unaligned cached position (disk hits, live text continuations) never land on exact multiples of the continued-store interval, so waypoint snapshots silently stopped firing and disk recovery could re-prefill 12-20K tokens instead of <=1 interval. Fire a continued store whenever a full interval has passed since the last store, and restart the cadence from the continuation point on a cache hit so the first chunk after a hit does not store a near duplicate of the snapshot just loaded.

Prefill previously always ran to completion: a streaming client that vanished was only detected on the next SSE write after sync returned, and a non-streaming client was never checked at all, so an agent timeout or retry could burn many minutes of GPU on a dead socket, with duplicate jobs queueing behind each other. Wire ds4_session_set_cancel() around every server-side sync so prefill stops at the next chunk boundary on shutdown, on a failed keepalive write, or on a closed client socket. Interrupted syncs keep the valid session prefix and do not discard the loaded disk snapshot. Also skip queued jobs whose client already disconnected and poll the socket during non-streaming decode.

Loading a text-prefix snapshot longer than cold_max_tokens unlinked the file immediately, before the tail prefill that extends it had run. A transient sync failure (Metal OOM under memory pressure) or a cancelled prefill then left neither live state nor the snapshot, turning the next attempt into a multi-minute cold prefill. Keep the file on disk until the request's sync completes and unlink it at that point; interrupted syncs already skip the failed-entry discard, so a retry can hit the same snapshot again.

The server exposed no machine-readable operational state: every metric (cache hit source, prefill/decode t/s, queue behavior) existed only as stderr log lines, and the job queue was unbounded and invisible, so an agent with a client-side timeout shorter than a turn could silently stack duplicate jobs that each run for minutes. Add GET /health and GET /stats, answered on the client thread so they respond even mid-generation. /stats reports uptime, queue depth, busy flag, live/ctx tokens, per-source cache hit counters, token totals, cancellation counters, and last prefill/decode t/s. Add --max-queue N to reject requests with 429 once N jobs wait (default 0 keeps the old unbounded behavior), log when a request queues behind others, and give 429/503 their proper HTTP reason strings.

kv_cache_restore_tool_memory_for_messages() opened and parsed the trailer of every checkpoint file in the cache directory on every tools request, even though tool_memory_remember() already holds the replayed DSML blocks in RAM at generation time. Filter the wanted ids against the in-memory map first and only walk the directory for ids that are actually missing, which normally only happens right after a restart.

tool_memory_attach_to_messages() attached a remembered DSML block when every id in the message resolved to the same block, but never checked that the message covers all of the block's invokes. A client that splits one turn's parallel tool calls across two assistant messages (or drops one call) got the full multi-invoke block attached to each message, rendering duplicate ghost invokes into the prompt. Count the invokes per remembered block and require the replaying message to carry exactly that many calls; partial messages fall back to canonical JSON rendering. Adds a regression test.

parse_generated_message_ex() anchored the raw block at the marker match, which includes the separator only when the model sampled the canonical "\n\n". Any other separator (a single newline, spaces) was trimmed from the content and excluded from raw_dsml, so those bytes existed in live KV but in no re-rendered prompt, silently breaking byte-exact tool replay and forcing a cache round trip. Anchor the raw block at the trimmed-content boundary instead, so content + raw block always reassemble the sampled bytes exactly. Adds a regression test.

Chat/completions and Anthropic tool-call turns had no live binding to the next request: the sampled tail (hidden reasoning, exact DSML spelling, BPE drift) never token-matches the client replay, and the byte-exact memory-text path rarely matches either, so every agent turn paid a full-state evict store (~650 MiB) plus a disk snapshot restore and tail re-prefill. The Responses API already solves this with a predicted visible transcript key. Reuse the thinking_live visible-key mechanism for tool turns: after finish=tool_calls, remember prompt_text plus the same canonical suffix that canonicalize_tool_checkpoint() builds (its byte-equality with the next rendered prompt is already covered by test_tool_checkpoint_suffix_is_future_prompt_canonical). The next request then continues from live KV, tokenizing only the new tool results. Hits are labeled tool-visible in logs and /stats.

- Answer Expect: 100-continue so curl-style clients do not stall a second before sending large POST bodies, and reject chunked transfer encoding explicitly instead of parsing an empty body and returning a misleading JSON error. - Set TCP_NODELAY on client sockets; per-token SSE writes are tiny and Nagle adds latency jitter off loopback. - Give /v1/messages clients Anthropic-shaped error envelopes ({"type":"error",...}) for parse, queue, continuation, and prefill failures, and classify OpenAI error types by status code (api_error for 5xx, rate_limit_error for 429). - Reject tool_choice "required" and forced function targets on chat completions like the Responses endpoint already does, instead of silently downgrading to auto. - Log once per request when thinking mode overrides client sampling parameters instead of ignoring them silently.

Eviction scored a never-hit entry as (0+1) * density from the moment it was written, while cold/evict/shutdown anchors kept an unconditional 2x reason bonus forever. Under budget pressure this evicted the newest waypoints of the live conversation first (they are always hits=0, they were just written) while stale anchors from long-dead conversations survived. Observed in production: the waypoint one position below a mid-history divergence was evicted seconds before it would have been hit, costing ~12K tokens of avoidable re-prefill. Score never-hit entries with a freshness grace equal to one hit that decays on the existing hit half-life, and scale the anchor bonus by the same activity term so a never-hit anchor older than the half-life competes on density alone. Adds a regression test; updates the fresh floor test to the new (1+1) * density value.

elkaix added 10 commits July 2, 2026 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server: fix agent-loop cache misses, add cancellation, observability, and robustness fixes#489

Server: fix agent-loop cache misses, add cancellation, observability, and robustness fixes#489
elkaix wants to merge 10 commits into
antirez:mainfrom
elkaix:server-enhancements

elkaix commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

elkaix commented Jul 2, 2026

Commits

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant