Problem
First-party GLM-5.2 sessions (glm-5.2[1m] → /data/models/hf/zai-org__GLM-5.2-FP8, see src/utils/model/ncodeModels.ts:26-27) return zero cache_creation_input_tokens and zero cache_read_input_tokens in every response usage object, even on long stable-prefix sessions where the engine's prefix cache must be active.
Verified from a real project session JSONL on this machine:
"model": "/data/models/hf/zai-org__GLM-5.2-FP8"
"usage": {
"input_tokens": 26744,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 0,
"output_tokens": 49,
...
}
Across 87 assistant turns in one session, input_tokens totals land in the hundreds of millions because every turn re-bills the full system prompt + tool schemas as fresh input. The cache_creation_input_tokens and cache_read_input_tokens fields exist in every response — they are just never populated.
Root cause
NCode already emits cache_control: { type: "ephemeral" } breakpoints on cached prefixes:
src/services/api/claude.ts:630,642,675,690 — message-level cache_control markers
src/services/api/claude.ts:3194-3210 — cache_reference placement for tool-result blocks within the cached prefix
src/services/api/claude.ts:337-363 — getPromptCachingEnabled() defaults to true; only env kill switches (DISABLE_PROMPT_CACHING, DISABLE_PROMPT_CACHING_HAIKU, etc.) turn it off
src/services/api/claude.ts:365-381 — getCacheControl() produces the { type: "ephemeral", ttl?: "1h", scope?: "global" } shape
And the client already keys off the response fields:
src/services/api/promptCacheBreakDetection.ts:437-466 — checkResponseForCacheBreak(querySource, cacheReadTokens, cacheCreationTokens, …) compares previous-vs-current cache_read_input_tokens, fires ncode_prompt_cache_break analytics when the drop exceeds 5% / 2k tokens
So the marker plumbing and the consumption layer are both correct on the client. The zeros come from the GLM serving gateway, which accepts cache_control silently and returns the Anthropic-shape usage fields unfilled. There is no translation from the engine's real prefix-cache activity into the usage object.
Architectural direction — decouple mechanism from protocol
Anthropic's cache_control is a billing and breakpoint protocol layered on top of a mechanism that already exists in every major OSS inference engine:
- vLLM:
--enable-prefix-caching (RADIX-like KV reuse; not always default-on)
- SGLang: RADIX attention (default-on)
- TensorRT-LLM:
iTex / KV-cache reuse
- DeepSeek's serving stack: longest-prefix auto-cache
The mechanism is token-level and protocol-agnostic. It does not need client opt-in, breakpoint markers, or 1024-token minimums — those are Anthropic billing constraints. Reusing the mechanism and surfacing it through the existing wire-compat boundary is strictly easier and strictly more independent than reimplementing Anthropic's accounting from scratch.
What to keep vs drop from the Anthropic shape
| Anthropic shape |
Keep? |
Why |
Wire format of cache_control requests |
Accept silently (don't crash) |
Keeps every Anthropic-shape client working with zero changes |
cache_read_input_tokens / cache_creation_input_tokens in usage |
Keep — populate from engine reality |
NCode's promptCacheBreakDetection.ts, /insights-context Token Economics, and any Anthropic-shape dashboard key off these. Honest population unlocks the existing observability layer for free. |
| 4-breakpoint-per-request cap, 1024-token minimum |
Drop |
Engine-centric Anthropic constraints; irrelevant under auto-caching |
ttl: "5m" / ttl: "1h" |
Replace |
Evict on the cadence the GLM deployment actually supports, not Anthropic's spec |
anthropic_beta feature flags |
Drop (use native flags) |
One fewer protocol dependency |
scope: "global" |
Drop |
Anthropic org/global cache routing — not meaningful for a single-tenant deployment |
The wire request/response shape stays Anthropic-compatible so NCode needs no client-side change. The protocol underneath becomes Noumena-native, with the Anthropic fields as a translation of engine reality — not a mirror of what the client requested.
Implementation
There are three pieces. All three are gateway-side; the inference engine itself needs only --enable-prefix-caching confirmed on (or its equivalent for SGLang/TRT-LLM). No engine patches required.
1. Engine prefix caching — verify it is on
This is pure deployment config, not code. For a vLLM-backed GLM-5.2-FP8 deployment, the launcher line must include:
vllm serve /data/models/hf/zai-org__GLM-5.2-FP8 \
--enable-prefix-caching \
--kv-cache-dtype fp8 \
--max-model-len 1048576
Without --enable-prefix-caching, every prefix recompute happens regardless of what the gateway reports. This is the single highest-leverage change and costs nothing.
For SGLang the equivalent is --enable-radix-cache (usually default-on). For TensorRT-LLM, the iTex plugin must be enabled in the build config.
Verification that prefix caching is actually doing something — vLLM exposes hit counts per request via the OpenAI-compat /metrics endpoint (Prometheus) and via request.metrics on the internal API:
# vllm/engine/llm_engine.py — RequestOutput.outputs[].metrics
# Look at: num_cached_tokens, num_computed_tokens
# num_cached_tokens > 0 means KV was reused for that part of the prompt.
2. Tokenizer-stable prefix key
This is the load-bearing detail. The prefix hash must be computed from the engine's tokenizer (GLM's BPE), not Claude's, because the same text breaks into different tokens across tokenizers and breaks the cache key. Most "engine thinks it hit cache, gateway reports miss" disagreements come from drift here.
# gateway/cache/prefix_key.py
import hashlib
from typing import List
from transformers import AutoTokenizer
# Load GLM's tokenizer from the same path the engine serves.
_TOKENIZER = AutoTokenizer.from_pretrained(
"/data/models/hf/zai-org__GLM-5.2-FP8",
trust_remote_code=True,
)
def prefix_key(messages: List[dict]) -> str:
"""
Produce a stable cache key for the longest cached prefix of a request.
The key is the SHA-256 of the token IDs for the system prompt + tools
+ the prefix of message history that the engine will reuse KV for.
Token IDs (not text) avoid BPE-normalization drift between sessions.
messages: Anthropic-shape [{'role': 'system'|'user'|'assistant', 'content': ...}]
"""
# Concatenate text content the way the engine will, then tokenize ONCE.
# Tokenize at the message-boundary granularity so partial-prefix reuse
# still hashes the same way across requests.
token_ids: List[int] = []
for msg in messages:
content = msg.get("content", "")
if isinstance(content, list):
# Anthropic-shape content blocks: extract text only
content = "".join(
block.get("text", "") for block in content
if isinstance(block, dict) and block.get("type") == "text"
)
ids = _TOKENIZER.encode(content, add_special_tokens=False)
token_ids.extend(ids)
# Separator token between messages — use a stable sentinel, not the
# tokenizer's conversation template (that changes between model revs).
token_ids.append(-1)
return hashlib.sha256(bytes((t & 0xFFFFFFFF) for t in token_ids)).hexdigest()
3. Gateway middleware — translate engine metrics to Anthropic-shape usage
This is the piece that turns the existing wire-compat boundary into a translation, not a silent drop.
# gateway/cache/middleware.py
import time
from collections import OrderedDict
from typing import Optional
from .prefix_key import prefix_key
class CachedPrefix:
"""One resident prefix entry in the gateway-side cache ledger."""
__slots__ = ("key", "token_count", "last_seen_ts", "expiration_ts")
def __init__(self, key: str, token_count: int, ttl_seconds: int):
self.key = key
self.token_count = token_count
self.last_seen_ts = time.time()
self.expiration_ts = self.last_seen_ts + ttl_seconds
class PrefixCacheLedger:
"""
Tracks which prefixes the engine has cached KV for, so we can populate
cache_read_input_tokens / cache_creation_input_tokens honestly.
TTL is the deployment's choice — not Anthropic's 5m/1h. Default 30 minutes
fits a typical GLM-5.2 coding session; tune to available KV memory.
"""
def __init__(self, ttl_seconds: int = 1800, max_entries: int = 2048):
self._ttl = ttl_seconds
self._max = max_entries
self._entries: "OrderedDict[str, CachedPrefix]" = OrderedDict()
def _evict_expired(self) -> None:
now = time.time()
expired = [k for k, v in self._entries.items() if v.expiration_ts <= now]
for k in expired:
self._entries.pop(k, None)
def lookup(self, key: str) -> Optional[CachedPrefix]:
self._evict_expired()
entry = self._entries.get(key)
if entry is None:
return None
# LRU bump
self._entries.move_to_end(key)
entry.last_seen_ts = time.time()
entry.expiration_ts = entry.last_seen_ts + self._ttl
return entry
def record_write(self, key: str, token_count: int) -> None:
self._evict_expired()
if key in self._entries:
self._entries.move_to_end(key)
self._entries[key] = CachedPrefix(key, token_count, self._ttl)
while len(self._entries) > self._max:
self._entries.popitem(last=False)
# Singleton — one ledger per worker process. For multi-process gateways,
# move to Redis with the same key shape.
LEDGER = PrefixCacheLedger()
async def translate_usage(request: dict, engine_response) -> dict:
"""
Translate the engine's real prefix-cache activity into the Anthropic-shape
usage object. Called after the engine returns its response.
request: the parsed Anthropic-shape Messages request
engine_response: the vLLM/SGLang Response object (has .metrics or .outputs)
"""
# Pull the prompt-token count and the cached-prefix count from the engine.
# vLLM exposes this on RequestOutput; SGLang on its sampling output metrics.
metrics = getattr(engine_response, "metrics", None)
num_prompt_tokens = getattr(metrics, "num_prompt_tokens", 0)
# num_cached_tokens = tokens served from KV cache, not recomputed.
num_cached_tokens = getattr(metrics, "num_cached_tokens", 0)
if num_prompt_tokens <= 0 or num_cached_tokens <= 0:
# Either the metrics shape isn't wired up, or there was genuinely
# nothing cached for this request. Fall back to a ledger lookup so
# we still report honestly if the engine forgets to tell us.
key = prefix_key(request.get("messages", []))
ledger_entry = LEDGER.lookup(key)
if ledger_entry is not None:
num_prompt_tokens = max(num_prompt_tokens, ledger_entry.token_count)
num_cached_tokens = ledger_entry.token_count
else:
# Engine told us the truth. Record it for the next request that
# sends the same prefix — handles engines that report hit/miss
# asymmetrically across requests.
key = prefix_key(request.get("messages", []))
LEDGER.record_write(key, num_cached_tokens)
# Anthropic-shape usage contract. cache_creation_input_tokens is only
# set on a deliberate write (new prefix not previously seen). Auto-caching
# engines do not distinguish "wrote this turn" from "served from cache",
# so we set it to 0 in the steady state and only populate it the first
# time a prefix appears — flagged by the ledger miss.
is_new_prefix = num_cached_tokens < num_prompt_tokens and num_cached_tokens == 0
cache_creation = num_prompt_tokens if is_new_prefix else 0
cache_read = num_cached_tokens
# Noumena-native billing rates go here — independent of Anthropic pricing.
# The shape is the only thing inherited; the numbers are ours.
return {
"input_tokens": num_prompt_tokens - cache_read, # billed at base rate
"output_tokens": getattr(metrics, "num_output_tokens", 0),
"cache_read_input_tokens": cache_read, # billed at cache-hit rate
"cache_creation_input_tokens": cache_creation, # billed at write rate
}
4. Hooking it into the request path
The translation runs after the engine returns, before the gateway serializes the response. In an ASGI gateway (Starlette/FastAPI shape that most OpenAI/Anthropic-compat proxies use):
# gateway/routes/messages.py
from fastapi import APIRouter, Request
from .cache.middleware import translate_usage
router = APIRouter()
@router.post("/v1/messages")
async def create_message(request: Request):
body = await request.json()
engine_response = await call_glm_engine(body)
usage = await translate_usage(body, engine_response)
# Merge usage into the Anthropic-shape response body.
engine_response_dict = engine_response.to_dict()
engine_response_dict["usage"] = {
**engine_response_dict.get("usage", {}),
**usage,
}
return engine_response_dict
Client-side evidence the above is sufficient
Once the gateway populates cache_read_input_tokens and cache_creation_input_tokens honestly, no NCode change is required to see the benefit:
- Cache-break detector (
src/services/api/promptCacheBreakDetection.ts:437-466) — already consumes cacheReadTokens / cacheCreationTokens and fires ncode_prompt_cache_break analytics on legitimate drops. Will start producing real signal instead of always-zero inputs.
/insights-context Token Economics — the recently-updated scanner (~/.ncode/commands/insights-context-scripts/scan.py:139-147) already sums these fields; the current zero output is purely downstream of this missing translation.
getCacheControl markers (src/services/api/claude.ts:365-381) — NCode stops sending these usefully under auto-caching, but they remain harmless to accept, so wire-compat is preserved while the engine runs the show.
Acceptance
cache_read_input_tokens > 0 on the second+ turn of a glm-5.2[1m] session with a stable system prompt + tool schemas.
cache_creation_input_tokens > 0 only on the first turn with a new prefix; subsequent turns with the same prefix show cache_read_input_tokens equal to the prompt length.
/insights-context Token Economics card shows a non-zero cache-read tier on real sessions, with no client-side code change.
- Wall-clock latency per turn drops measurably on stable-prefix sessions (engine KV reuse).
promptCacheBreakDetection.ts legitimately fires or stays silent based on real engine cache state, not always-silent-because-zero.
cache_control requests continue to be accepted without error (back-compat with all Anthropic-shape clients, not just NCode).
Non-goals
- Reimplementing Anthropic's breakpoint accounting or billing tiers verbatim.
- Routing NCode to Anthropic's API. The goal is operational independence from Anthropic, with the Anthropic wire shape as a thin translation boundary only.
- Changing NCode's request shape on the client side.
- Adding client-side feature flags or
Settings toggles. The client is correct as-shipped; the work is server-side.
Open questions for the serving team
- Which inference engine backs
/data/models/hf/zai-org__GLM-5.2-FP8 (vLLM / SGLang / TRT-LLM / custom)? The metric field names in translate_usage need to match.
- Is prefix caching currently enabled in the deployment? If not, that alone is the highest-leverage change.
- What KV-cache budget is available? Affects the ledger TTL — too aggressive and prefixes churn mid-session; too generous and they thrash.
- Is the gateway single-process or multi-process? Single-process ledger works for the former; multi-process needs Redis or shared memory for the same key shape.
Authored by GLM 5.2 [1m] via NCode
Problem
First-party GLM-5.2 sessions (
glm-5.2[1m]→/data/models/hf/zai-org__GLM-5.2-FP8, seesrc/utils/model/ncodeModels.ts:26-27) return zerocache_creation_input_tokensand zerocache_read_input_tokensin every responseusageobject, even on long stable-prefix sessions where the engine's prefix cache must be active.Verified from a real project session JSONL on this machine:
Across 87 assistant turns in one session,
input_tokenstotals land in the hundreds of millions because every turn re-bills the full system prompt + tool schemas as fresh input. Thecache_creation_input_tokensandcache_read_input_tokensfields exist in every response — they are just never populated.Root cause
NCode already emits
cache_control: { type: "ephemeral" }breakpoints on cached prefixes:src/services/api/claude.ts:630,642,675,690— message-levelcache_controlmarkerssrc/services/api/claude.ts:3194-3210—cache_referenceplacement for tool-result blocks within the cached prefixsrc/services/api/claude.ts:337-363—getPromptCachingEnabled()defaults totrue; only env kill switches (DISABLE_PROMPT_CACHING,DISABLE_PROMPT_CACHING_HAIKU, etc.) turn it offsrc/services/api/claude.ts:365-381—getCacheControl()produces the{ type: "ephemeral", ttl?: "1h", scope?: "global" }shapeAnd the client already keys off the response fields:
src/services/api/promptCacheBreakDetection.ts:437-466—checkResponseForCacheBreak(querySource, cacheReadTokens, cacheCreationTokens, …)compares previous-vs-currentcache_read_input_tokens, firesncode_prompt_cache_breakanalytics when the drop exceeds 5% / 2k tokensSo the marker plumbing and the consumption layer are both correct on the client. The zeros come from the GLM serving gateway, which accepts
cache_controlsilently and returns the Anthropic-shape usage fields unfilled. There is no translation from the engine's real prefix-cache activity into the usage object.Architectural direction — decouple mechanism from protocol
Anthropic's
cache_controlis a billing and breakpoint protocol layered on top of a mechanism that already exists in every major OSS inference engine:--enable-prefix-caching(RADIX-like KV reuse; not always default-on)iTex/ KV-cache reuseThe mechanism is token-level and protocol-agnostic. It does not need client opt-in, breakpoint markers, or 1024-token minimums — those are Anthropic billing constraints. Reusing the mechanism and surfacing it through the existing wire-compat boundary is strictly easier and strictly more independent than reimplementing Anthropic's accounting from scratch.
What to keep vs drop from the Anthropic shape
cache_controlrequestscache_read_input_tokens/cache_creation_input_tokensinusagepromptCacheBreakDetection.ts,/insights-contextToken Economics, and any Anthropic-shape dashboard key off these. Honest population unlocks the existing observability layer for free.ttl: "5m"/ttl: "1h"anthropic_betafeature flagsscope: "global"The wire request/response shape stays Anthropic-compatible so NCode needs no client-side change. The protocol underneath becomes Noumena-native, with the Anthropic fields as a translation of engine reality — not a mirror of what the client requested.
Implementation
There are three pieces. All three are gateway-side; the inference engine itself needs only
--enable-prefix-cachingconfirmed on (or its equivalent for SGLang/TRT-LLM). No engine patches required.1. Engine prefix caching — verify it is on
This is pure deployment config, not code. For a vLLM-backed GLM-5.2-FP8 deployment, the launcher line must include:
Without
--enable-prefix-caching, every prefix recompute happens regardless of what the gateway reports. This is the single highest-leverage change and costs nothing.For SGLang the equivalent is
--enable-radix-cache(usually default-on). For TensorRT-LLM, the iTex plugin must be enabled in the build config.Verification that prefix caching is actually doing something — vLLM exposes hit counts per request via the OpenAI-compat
/metricsendpoint (Prometheus) and viarequest.metricson the internal API:2. Tokenizer-stable prefix key
This is the load-bearing detail. The prefix hash must be computed from the engine's tokenizer (GLM's BPE), not Claude's, because the same text breaks into different tokens across tokenizers and breaks the cache key. Most "engine thinks it hit cache, gateway reports miss" disagreements come from drift here.
3. Gateway middleware — translate engine metrics to Anthropic-shape usage
This is the piece that turns the existing wire-compat boundary into a translation, not a silent drop.
4. Hooking it into the request path
The translation runs after the engine returns, before the gateway serializes the response. In an ASGI gateway (Starlette/FastAPI shape that most OpenAI/Anthropic-compat proxies use):
Client-side evidence the above is sufficient
Once the gateway populates
cache_read_input_tokensandcache_creation_input_tokenshonestly, no NCode change is required to see the benefit:src/services/api/promptCacheBreakDetection.ts:437-466) — already consumescacheReadTokens/cacheCreationTokensand firesncode_prompt_cache_breakanalytics on legitimate drops. Will start producing real signal instead of always-zero inputs./insights-contextToken Economics — the recently-updated scanner (~/.ncode/commands/insights-context-scripts/scan.py:139-147) already sums these fields; the current zero output is purely downstream of this missing translation.getCacheControlmarkers (src/services/api/claude.ts:365-381) — NCode stops sending these usefully under auto-caching, but they remain harmless to accept, so wire-compat is preserved while the engine runs the show.Acceptance
cache_read_input_tokens> 0 on the second+ turn of aglm-5.2[1m]session with a stable system prompt + tool schemas.cache_creation_input_tokens> 0 only on the first turn with a new prefix; subsequent turns with the same prefix showcache_read_input_tokensequal to the prompt length./insights-contextToken Economics card shows a non-zero cache-read tier on real sessions, with no client-side code change.promptCacheBreakDetection.tslegitimately fires or stays silent based on real engine cache state, not always-silent-because-zero.cache_controlrequests continue to be accepted without error (back-compat with all Anthropic-shape clients, not just NCode).Non-goals
Settingstoggles. The client is correct as-shipped; the work is server-side.Open questions for the serving team
/data/models/hf/zai-org__GLM-5.2-FP8(vLLM / SGLang / TRT-LLM / custom)? The metric field names intranslate_usageneed to match.Authored by GLM 5.2 [1m] via NCode