Skip to content

GLM-5.2 serving endpoint: surface prefix cache hits as Anthropic-shape usage fields (decouple from Anthropic) #50

Description

@RasputinKaiser

Problem

First-party GLM-5.2 sessions (glm-5.2[1m]/data/models/hf/zai-org__GLM-5.2-FP8, see src/utils/model/ncodeModels.ts:26-27) return zero cache_creation_input_tokens and zero cache_read_input_tokens in every response usage object, even on long stable-prefix sessions where the engine's prefix cache must be active.

Verified from a real project session JSONL on this machine:

"model": "/data/models/hf/zai-org__GLM-5.2-FP8"
"usage": {
  "input_tokens": 26744,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 0,
  "output_tokens": 49,
  ...
}

Across 87 assistant turns in one session, input_tokens totals land in the hundreds of millions because every turn re-bills the full system prompt + tool schemas as fresh input. The cache_creation_input_tokens and cache_read_input_tokens fields exist in every response — they are just never populated.

Root cause

NCode already emits cache_control: { type: "ephemeral" } breakpoints on cached prefixes:

  • src/services/api/claude.ts:630,642,675,690 — message-level cache_control markers
  • src/services/api/claude.ts:3194-3210cache_reference placement for tool-result blocks within the cached prefix
  • src/services/api/claude.ts:337-363getPromptCachingEnabled() defaults to true; only env kill switches (DISABLE_PROMPT_CACHING, DISABLE_PROMPT_CACHING_HAIKU, etc.) turn it off
  • src/services/api/claude.ts:365-381getCacheControl() produces the { type: "ephemeral", ttl?: "1h", scope?: "global" } shape

And the client already keys off the response fields:

  • src/services/api/promptCacheBreakDetection.ts:437-466checkResponseForCacheBreak(querySource, cacheReadTokens, cacheCreationTokens, …) compares previous-vs-current cache_read_input_tokens, fires ncode_prompt_cache_break analytics when the drop exceeds 5% / 2k tokens

So the marker plumbing and the consumption layer are both correct on the client. The zeros come from the GLM serving gateway, which accepts cache_control silently and returns the Anthropic-shape usage fields unfilled. There is no translation from the engine's real prefix-cache activity into the usage object.

Architectural direction — decouple mechanism from protocol

Anthropic's cache_control is a billing and breakpoint protocol layered on top of a mechanism that already exists in every major OSS inference engine:

  • vLLM: --enable-prefix-caching (RADIX-like KV reuse; not always default-on)
  • SGLang: RADIX attention (default-on)
  • TensorRT-LLM: iTex / KV-cache reuse
  • DeepSeek's serving stack: longest-prefix auto-cache

The mechanism is token-level and protocol-agnostic. It does not need client opt-in, breakpoint markers, or 1024-token minimums — those are Anthropic billing constraints. Reusing the mechanism and surfacing it through the existing wire-compat boundary is strictly easier and strictly more independent than reimplementing Anthropic's accounting from scratch.

What to keep vs drop from the Anthropic shape

Anthropic shape Keep? Why
Wire format of cache_control requests Accept silently (don't crash) Keeps every Anthropic-shape client working with zero changes
cache_read_input_tokens / cache_creation_input_tokens in usage Keep — populate from engine reality NCode's promptCacheBreakDetection.ts, /insights-context Token Economics, and any Anthropic-shape dashboard key off these. Honest population unlocks the existing observability layer for free.
4-breakpoint-per-request cap, 1024-token minimum Drop Engine-centric Anthropic constraints; irrelevant under auto-caching
ttl: "5m" / ttl: "1h" Replace Evict on the cadence the GLM deployment actually supports, not Anthropic's spec
anthropic_beta feature flags Drop (use native flags) One fewer protocol dependency
scope: "global" Drop Anthropic org/global cache routing — not meaningful for a single-tenant deployment

The wire request/response shape stays Anthropic-compatible so NCode needs no client-side change. The protocol underneath becomes Noumena-native, with the Anthropic fields as a translation of engine reality — not a mirror of what the client requested.

Implementation

There are three pieces. All three are gateway-side; the inference engine itself needs only --enable-prefix-caching confirmed on (or its equivalent for SGLang/TRT-LLM). No engine patches required.

1. Engine prefix caching — verify it is on

This is pure deployment config, not code. For a vLLM-backed GLM-5.2-FP8 deployment, the launcher line must include:

vllm serve /data/models/hf/zai-org__GLM-5.2-FP8 \
  --enable-prefix-caching \
  --kv-cache-dtype fp8 \
  --max-model-len 1048576

Without --enable-prefix-caching, every prefix recompute happens regardless of what the gateway reports. This is the single highest-leverage change and costs nothing.

For SGLang the equivalent is --enable-radix-cache (usually default-on). For TensorRT-LLM, the iTex plugin must be enabled in the build config.

Verification that prefix caching is actually doing something — vLLM exposes hit counts per request via the OpenAI-compat /metrics endpoint (Prometheus) and via request.metrics on the internal API:

# vllm/engine/llm_engine.py — RequestOutput.outputs[].metrics
# Look at: num_cached_tokens, num_computed_tokens
# num_cached_tokens > 0 means KV was reused for that part of the prompt.

2. Tokenizer-stable prefix key

This is the load-bearing detail. The prefix hash must be computed from the engine's tokenizer (GLM's BPE), not Claude's, because the same text breaks into different tokens across tokenizers and breaks the cache key. Most "engine thinks it hit cache, gateway reports miss" disagreements come from drift here.

# gateway/cache/prefix_key.py
import hashlib
from typing import List
from transformers import AutoTokenizer

# Load GLM's tokenizer from the same path the engine serves.
_TOKENIZER = AutoTokenizer.from_pretrained(
    "/data/models/hf/zai-org__GLM-5.2-FP8",
    trust_remote_code=True,
)

def prefix_key(messages: List[dict]) -> str:
    """
    Produce a stable cache key for the longest cached prefix of a request.

    The key is the SHA-256 of the token IDs for the system prompt + tools
    + the prefix of message history that the engine will reuse KV for.
    Token IDs (not text) avoid BPE-normalization drift between sessions.

    messages: Anthropic-shape [{'role': 'system'|'user'|'assistant', 'content': ...}]
    """
    # Concatenate text content the way the engine will, then tokenize ONCE.
    # Tokenize at the message-boundary granularity so partial-prefix reuse
    # still hashes the same way across requests.
    token_ids: List[int] = []
    for msg in messages:
        content = msg.get("content", "")
        if isinstance(content, list):
            # Anthropic-shape content blocks: extract text only
            content = "".join(
                block.get("text", "") for block in content
                if isinstance(block, dict) and block.get("type") == "text"
            )
        ids = _TOKENIZER.encode(content, add_special_tokens=False)
        token_ids.extend(ids)
        # Separator token between messages — use a stable sentinel, not the
        # tokenizer's conversation template (that changes between model revs).
        token_ids.append(-1)

    return hashlib.sha256(bytes((t & 0xFFFFFFFF) for t in token_ids)).hexdigest()

3. Gateway middleware — translate engine metrics to Anthropic-shape usage

This is the piece that turns the existing wire-compat boundary into a translation, not a silent drop.

# gateway/cache/middleware.py
import time
from collections import OrderedDict
from typing import Optional

from .prefix_key import prefix_key

class CachedPrefix:
    """One resident prefix entry in the gateway-side cache ledger."""
    __slots__ = ("key", "token_count", "last_seen_ts", "expiration_ts")
    def __init__(self, key: str, token_count: int, ttl_seconds: int):
        self.key = key
        self.token_count = token_count
        self.last_seen_ts = time.time()
        self.expiration_ts = self.last_seen_ts + ttl_seconds

class PrefixCacheLedger:
    """
    Tracks which prefixes the engine has cached KV for, so we can populate
    cache_read_input_tokens / cache_creation_input_tokens honestly.

    TTL is the deployment's choice — not Anthropic's 5m/1h. Default 30 minutes
    fits a typical GLM-5.2 coding session; tune to available KV memory.
    """
    def __init__(self, ttl_seconds: int = 1800, max_entries: int = 2048):
        self._ttl = ttl_seconds
        self._max = max_entries
        self._entries: "OrderedDict[str, CachedPrefix]" = OrderedDict()

    def _evict_expired(self) -> None:
        now = time.time()
        expired = [k for k, v in self._entries.items() if v.expiration_ts <= now]
        for k in expired:
            self._entries.pop(k, None)

    def lookup(self, key: str) -> Optional[CachedPrefix]:
        self._evict_expired()
        entry = self._entries.get(key)
        if entry is None:
            return None
        # LRU bump
        self._entries.move_to_end(key)
        entry.last_seen_ts = time.time()
        entry.expiration_ts = entry.last_seen_ts + self._ttl
        return entry

    def record_write(self, key: str, token_count: int) -> None:
        self._evict_expired()
        if key in self._entries:
            self._entries.move_to_end(key)
        self._entries[key] = CachedPrefix(key, token_count, self._ttl)
        while len(self._entries) > self._max:
            self._entries.popitem(last=False)


# Singleton — one ledger per worker process. For multi-process gateways,
# move to Redis with the same key shape.
LEDGER = PrefixCacheLedger()


async def translate_usage(request: dict, engine_response) -> dict:
    """
    Translate the engine's real prefix-cache activity into the Anthropic-shape
    usage object. Called after the engine returns its response.

    request:        the parsed Anthropic-shape Messages request
    engine_response: the vLLM/SGLang Response object (has .metrics or .outputs)
    """
    # Pull the prompt-token count and the cached-prefix count from the engine.
    # vLLM exposes this on RequestOutput; SGLang on its sampling output metrics.
    metrics = getattr(engine_response, "metrics", None)
    num_prompt_tokens = getattr(metrics, "num_prompt_tokens", 0)
    # num_cached_tokens = tokens served from KV cache, not recomputed.
    num_cached_tokens = getattr(metrics, "num_cached_tokens", 0)

    if num_prompt_tokens <= 0 or num_cached_tokens <= 0:
        # Either the metrics shape isn't wired up, or there was genuinely
        # nothing cached for this request. Fall back to a ledger lookup so
        # we still report honestly if the engine forgets to tell us.
        key = prefix_key(request.get("messages", []))
        ledger_entry = LEDGER.lookup(key)
        if ledger_entry is not None:
            num_prompt_tokens = max(num_prompt_tokens, ledger_entry.token_count)
            num_cached_tokens = ledger_entry.token_count
    else:
        # Engine told us the truth. Record it for the next request that
        # sends the same prefix — handles engines that report hit/miss
        # asymmetrically across requests.
        key = prefix_key(request.get("messages", []))
        LEDGER.record_write(key, num_cached_tokens)

    # Anthropic-shape usage contract. cache_creation_input_tokens is only
    # set on a deliberate write (new prefix not previously seen). Auto-caching
    # engines do not distinguish "wrote this turn" from "served from cache",
    # so we set it to 0 in the steady state and only populate it the first
    # time a prefix appears — flagged by the ledger miss.
    is_new_prefix = num_cached_tokens < num_prompt_tokens and num_cached_tokens == 0
    cache_creation = num_prompt_tokens if is_new_prefix else 0
    cache_read = num_cached_tokens

    # Noumena-native billing rates go here — independent of Anthropic pricing.
    # The shape is the only thing inherited; the numbers are ours.

    return {
        "input_tokens": num_prompt_tokens - cache_read,    # billed at base rate
        "output_tokens": getattr(metrics, "num_output_tokens", 0),
        "cache_read_input_tokens": cache_read,             # billed at cache-hit rate
        "cache_creation_input_tokens": cache_creation,     # billed at write rate
    }

4. Hooking it into the request path

The translation runs after the engine returns, before the gateway serializes the response. In an ASGI gateway (Starlette/FastAPI shape that most OpenAI/Anthropic-compat proxies use):

# gateway/routes/messages.py
from fastapi import APIRouter, Request
from .cache.middleware import translate_usage

router = APIRouter()

@router.post("/v1/messages")
async def create_message(request: Request):
    body = await request.json()
    engine_response = await call_glm_engine(body)
    usage = await translate_usage(body, engine_response)
    # Merge usage into the Anthropic-shape response body.
    engine_response_dict = engine_response.to_dict()
    engine_response_dict["usage"] = {
        **engine_response_dict.get("usage", {}),
        **usage,
    }
    return engine_response_dict

Client-side evidence the above is sufficient

Once the gateway populates cache_read_input_tokens and cache_creation_input_tokens honestly, no NCode change is required to see the benefit:

  1. Cache-break detector (src/services/api/promptCacheBreakDetection.ts:437-466) — already consumes cacheReadTokens / cacheCreationTokens and fires ncode_prompt_cache_break analytics on legitimate drops. Will start producing real signal instead of always-zero inputs.
  2. /insights-context Token Economics — the recently-updated scanner (~/.ncode/commands/insights-context-scripts/scan.py:139-147) already sums these fields; the current zero output is purely downstream of this missing translation.
  3. getCacheControl markers (src/services/api/claude.ts:365-381) — NCode stops sending these usefully under auto-caching, but they remain harmless to accept, so wire-compat is preserved while the engine runs the show.

Acceptance

  • cache_read_input_tokens > 0 on the second+ turn of a glm-5.2[1m] session with a stable system prompt + tool schemas.
  • cache_creation_input_tokens > 0 only on the first turn with a new prefix; subsequent turns with the same prefix show cache_read_input_tokens equal to the prompt length.
  • /insights-context Token Economics card shows a non-zero cache-read tier on real sessions, with no client-side code change.
  • Wall-clock latency per turn drops measurably on stable-prefix sessions (engine KV reuse).
  • promptCacheBreakDetection.ts legitimately fires or stays silent based on real engine cache state, not always-silent-because-zero.
  • cache_control requests continue to be accepted without error (back-compat with all Anthropic-shape clients, not just NCode).

Non-goals

  • Reimplementing Anthropic's breakpoint accounting or billing tiers verbatim.
  • Routing NCode to Anthropic's API. The goal is operational independence from Anthropic, with the Anthropic wire shape as a thin translation boundary only.
  • Changing NCode's request shape on the client side.
  • Adding client-side feature flags or Settings toggles. The client is correct as-shipped; the work is server-side.

Open questions for the serving team

  1. Which inference engine backs /data/models/hf/zai-org__GLM-5.2-FP8 (vLLM / SGLang / TRT-LLM / custom)? The metric field names in translate_usage need to match.
  2. Is prefix caching currently enabled in the deployment? If not, that alone is the highest-leverage change.
  3. What KV-cache budget is available? Affects the ledger TTL — too aggressive and prefixes churn mid-session; too generous and they thrash.
  4. Is the gateway single-process or multi-process? Single-process ledger works for the former; multi-process needs Redis or shared memory for the same key shape.

Authored by GLM 5.2 [1m] via NCode

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions