Server: fix agent-loop cache misses, add cancellation, observability, and robustness fixes#489
Open
elkaix wants to merge 10 commits into
Open
Server: fix agent-loop cache misses, add cancellation, observability, and robustness fixes#489elkaix wants to merge 10 commits into
elkaix wants to merge 10 commits into
Conversation
Prefills resumed from an unaligned cached position (disk hits, live text continuations) never land on exact multiples of the continued-store interval, so waypoint snapshots silently stopped firing and disk recovery could re-prefill 12-20K tokens instead of <=1 interval. Fire a continued store whenever a full interval has passed since the last store, and restart the cadence from the continuation point on a cache hit so the first chunk after a hit does not store a near duplicate of the snapshot just loaded.
Prefill previously always ran to completion: a streaming client that vanished was only detected on the next SSE write after sync returned, and a non-streaming client was never checked at all, so an agent timeout or retry could burn many minutes of GPU on a dead socket, with duplicate jobs queueing behind each other. Wire ds4_session_set_cancel() around every server-side sync so prefill stops at the next chunk boundary on shutdown, on a failed keepalive write, or on a closed client socket. Interrupted syncs keep the valid session prefix and do not discard the loaded disk snapshot. Also skip queued jobs whose client already disconnected and poll the socket during non-streaming decode.
Loading a text-prefix snapshot longer than cold_max_tokens unlinked the file immediately, before the tail prefill that extends it had run. A transient sync failure (Metal OOM under memory pressure) or a cancelled prefill then left neither live state nor the snapshot, turning the next attempt into a multi-minute cold prefill. Keep the file on disk until the request's sync completes and unlink it at that point; interrupted syncs already skip the failed-entry discard, so a retry can hit the same snapshot again.
The server exposed no machine-readable operational state: every metric (cache hit source, prefill/decode t/s, queue behavior) existed only as stderr log lines, and the job queue was unbounded and invisible, so an agent with a client-side timeout shorter than a turn could silently stack duplicate jobs that each run for minutes. Add GET /health and GET /stats, answered on the client thread so they respond even mid-generation. /stats reports uptime, queue depth, busy flag, live/ctx tokens, per-source cache hit counters, token totals, cancellation counters, and last prefill/decode t/s. Add --max-queue N to reject requests with 429 once N jobs wait (default 0 keeps the old unbounded behavior), log when a request queues behind others, and give 429/503 their proper HTTP reason strings.
kv_cache_restore_tool_memory_for_messages() opened and parsed the trailer of every checkpoint file in the cache directory on every tools request, even though tool_memory_remember() already holds the replayed DSML blocks in RAM at generation time. Filter the wanted ids against the in-memory map first and only walk the directory for ids that are actually missing, which normally only happens right after a restart.
tool_memory_attach_to_messages() attached a remembered DSML block when every id in the message resolved to the same block, but never checked that the message covers all of the block's invokes. A client that splits one turn's parallel tool calls across two assistant messages (or drops one call) got the full multi-invoke block attached to each message, rendering duplicate ghost invokes into the prompt. Count the invokes per remembered block and require the replaying message to carry exactly that many calls; partial messages fall back to canonical JSON rendering. Adds a regression test.
parse_generated_message_ex() anchored the raw block at the marker match, which includes the separator only when the model sampled the canonical "\n\n". Any other separator (a single newline, spaces) was trimmed from the content and excluded from raw_dsml, so those bytes existed in live KV but in no re-rendered prompt, silently breaking byte-exact tool replay and forcing a cache round trip. Anchor the raw block at the trimmed-content boundary instead, so content + raw block always reassemble the sampled bytes exactly. Adds a regression test.
Chat/completions and Anthropic tool-call turns had no live binding to the next request: the sampled tail (hidden reasoning, exact DSML spelling, BPE drift) never token-matches the client replay, and the byte-exact memory-text path rarely matches either, so every agent turn paid a full-state evict store (~650 MiB) plus a disk snapshot restore and tail re-prefill. The Responses API already solves this with a predicted visible transcript key. Reuse the thinking_live visible-key mechanism for tool turns: after finish=tool_calls, remember prompt_text plus the same canonical suffix that canonicalize_tool_checkpoint() builds (its byte-equality with the next rendered prompt is already covered by test_tool_checkpoint_suffix_is_future_prompt_canonical). The next request then continues from live KV, tokenizing only the new tool results. Hits are labeled tool-visible in logs and /stats.
- Answer Expect: 100-continue so curl-style clients do not stall a
second before sending large POST bodies, and reject chunked
transfer encoding explicitly instead of parsing an empty body and
returning a misleading JSON error.
- Set TCP_NODELAY on client sockets; per-token SSE writes are tiny and
Nagle adds latency jitter off loopback.
- Give /v1/messages clients Anthropic-shaped error envelopes
({"type":"error",...}) for parse, queue, continuation, and
prefill failures, and classify OpenAI error types by status code
(api_error for 5xx, rate_limit_error for 429).
- Reject tool_choice "required" and forced function targets on chat
completions like the Responses endpoint already does, instead of
silently downgrading to auto.
- Log once per request when thinking mode overrides client sampling
parameters instead of ignoring them silently.
Eviction scored a never-hit entry as (0+1) * density from the moment it was written, while cold/evict/shutdown anchors kept an unconditional 2x reason bonus forever. Under budget pressure this evicted the newest waypoints of the live conversation first (they are always hits=0, they were just written) while stale anchors from long-dead conversations survived. Observed in production: the waypoint one position below a mid-history divergence was evicted seconds before it would have been hit, costing ~12K tokens of avoidable re-prefill. Score never-hit entries with a freshness grace equal to one hit that decays on the existing hit half-life, and scale the anchor bonus by the same activity term so a never-hit anchor older than the half-life competes on density alone. Adds a regression test; updates the fresh floor test to the new (1+1) * density value.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This series came out of running ds4-server as the daily backend for a
tool-calling coding agent (OpenAI chat completions dialect, 45K-90K token
conversations) on the project's reference M5 Max. In that workload every
agent turn was paying a full-state evict store (~650 MiB) plus a disk
snapshot restore and a 12-25K-token tail re-prefill (~80-100 s per turn),
because nothing bound a chat tool-call turn to the next request. The series
fixes that (turns now cost only their genuinely new tokens, 2-3 s), plus the
robustness and observability gaps found while diagnosing it.
No inference/backend code is touched: changes are confined to ds4_server.c,
ds4_kvstore.c, and ds4_help.c. CUDA and ROCm were not validated (no
backend-specific code changed); the CPU target was built as a compile check.
Commits
Cache efficiency (the agent-loop fix):
Capture sampled whitespace separator in raw DSML tool blocks— raw-DSMLreplay was byte-exact only when the model sampled the canonical
\n\nseparator; any other separator was silently dropped from both content and
raw block, so replayed history never byte-matched live KV. With this fix,
agent replays retokenize token-identically in practice and hit the exact
token-prefix path.
Remember a visible live checkpoint for chat/Anthropic tool-call turns—mirrors the Responses API's predicted-visible-transcript mechanism for
chat/Anthropic tool turns (reusing the thinking_live machinery), so when
BPE spelling does drift, the next request still continues from live KV
instead of taking the disk round trip. Hits are labeled
tool-visible.The predicted text is
prompt_text + build_tool_checkpoint_suffix(...),whose byte-equality with the next rendered prompt is already covered by
test_tool_checkpoint_suffix_is_future_prompt_canonical.Store continued KV snapshots on step thresholds, not exact multiples—prefills resumed from unaligned cached positions never land on exact
multiples of the continued interval, so waypoint snapshots silently
stopped firing (observed 12-20K-token gaps). Threshold-crossing fires
from any resume position; the cadence restarts at the continuation point.
Give fresh KV snapshots an eviction grace over stale never-hit anchors—under budget pressure the newest waypoints of the live conversation
(always hits=0, they were just written) were evicted first while stale
anchors survived on the unconditional 2.0x reason bonus. Observed in
production: the waypoint one position below a mid-history divergence was
evicted seconds before it would have been hit. Never-hit entries now get
a freshness grace (one hit decaying on the existing half-life) and the
anchor bonus scales with the same activity term.
Defer consumed KV snapshot deletion until the tail prefill succeeds—consume-on-load unlinked the snapshot before the tail prefill ran, so a
transient sync failure (Metal OOM under memory pressure) or cancellation
lost both live state and snapshot, forcing a cold prefill.
Skip tool-map disk scan when all replay ids are already in RAM— everytools request opened and parsed the trailer of every checkpoint file for
ids tool_memory already holds; now only missing ids trigger the walk
(normally only right after a restart).
Correctness:
Do not replay multi-invoke DSML blocks into partial messages— a clientthat splits one turn's parallel tool calls across two assistant messages
(or drops one) got the full multi-invoke block attached to each message,
rendering duplicated ghost invokes into the prompt. Blocks now record
their invoke count and only attach to messages carrying exactly that many
calls. Regression test included.
Robustness / operations:
Cancel prefill and decode for disconnected clients— wiresds4_session_set_cancel() (already used by ds4-agent) around all server
syncs, so prefill stops at the next chunk on shutdown or a dead socket
instead of burning up to ~10 minutes for a client that timed out and
retried; queued jobs from dead clients are skipped, and non-streaming
decode polls the socket. Interrupted syncs keep the valid session prefix
and do not discard the loaded disk snapshot.
Add /health, /stats, and a bounded request queue— GET /health andGET /stats answer on the client thread (responsive mid-generation) with
uptime, queue depth, per-source cache hit counters, token totals, and
last prefill/decode t/s. Optional --max-queue N returns 429 when N jobs
already wait (default 0 keeps unbounded behavior). 429/503 get proper
HTTP reason strings.
Small HTTP and API fidelity fixes— answerExpect: 100-continue;reject chunked transfer encoding with an explicit message instead of
parsing an empty body; TCP_NODELAY; Anthropic-shaped error envelopes on
/v1/messages; OpenAI error
typefollows status class (api_error for5xx, rate_limit_error for 429); chat tool_choice "required"/forced now
rejected like the Responses endpoint already does; log once per request
when thinking mode overrides client sampling params.
Testing
Machine/backend: MacBook Pro M5 Max 128 GB, macOS (Metal). Model:
DeepSeek-V4-Flash-Layers37-42Q4KExperts-OtherExpertLayersIQ2XXSGateUp-Q2KDown-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix-fixed.gguf(q2-q4 imatrix mixed quant), iogpu.wired_limit_mb=118000.
Commands run:
make clean && make— zero warnings.make clean && make cpu— compile check, builds (pre-existing ds4.cwarnings only, none in touched files).
git diff --check— clean.make test— all suites pass (metal-tensor-equivalence worst_max_abs=0.139,top1_mismatch=0; server OK). SSD-streaming and MTP-verify suites skipped by
the harness (require DS4_TEST_SSD_STREAMING / DS4_TEST_MTP).
./ds4_test --server— OK (includes new regression tests for the raw-DSMLseparator, multi-invoke attach, continued-store thresholds, and eviction
freshness).
./ds4_test --tool-call-quality— OK (fast and exact paths)../ds4_test --logprob-vectors— OK (long_memory_archive skipped by theharness: API/official graph mismatch).
./ds4_agent_test,./ds4-eval --self-test-extractors— OK.--max-queue 1): /health + /stats schema; OpenAI chat non-stream + SSE
(thinking on/off); Responses; Anthropic success and error envelope; OpenAI
error envelope; tool_choice=required 400; chunked rejection;
Expect: 100-continue; 429 on queue overflow; disconnect cancellation frees
the worker mid-decode (log shows
finish=error error="client disconnected"within seconds of the socket dying); repeat >512-token request reports
cached_tokens via disk second-hit; tiny budget triggers eviction and
store-after-evict keeps the newest entry; graceful shutdown persists KV.
18/18 checks pass.
Production validation (same machine, pi coding agent, 45K-90K token
sessions): agent turn overhead dropped from ~80-100 s (evict store + disk
restore + 12-25K-token re-prefill,
live kv cache miss ... token-mismatchevery turn) to 2-3 s (exact-token or tool-visible live continuation);
decode 24.8 -> 25-27.9 t/s at 46K ctx; a server restart resumed a 87K-token
session from the shutdown snapshot prefetching only ~150 tokens.
Not validated: CUDA, ROCm, distributed (no backend or distributed code
touched; ds4_server.c/ds4_kvstore.c compile into those targets unchanged).
Speed: no inference-path changes; production decode/prefill numbers above
show no regression on Metal.