GLM-5.2 serving endpoint: surface prefix cache hits as Anthropic-shape usage fields (decouple from Anthropic)

## Problem

First-party GLM-5.2 sessions (`glm-5.2[1m]` → `/data/models/hf/zai-org__GLM-5.2-FP8`, see `src/utils/model/ncodeModels.ts:26-27`) return zero `cache_creation_input_tokens` and zero `cache_read_input_tokens` in every response `usage` object, even on long stable-prefix sessions where the engine's prefix cache must be active.

Verified from a real project session JSONL on this machine:

```
"model": "/data/models/hf/zai-org__GLM-5.2-FP8"
"usage": {
  "input_tokens": 26744,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 0,
  "output_tokens": 49,
  ...
}
```

Across 87 assistant turns in one session, `input_tokens` totals land in the hundreds of millions because every turn re-bills the full system prompt + tool schemas as fresh input. The `cache_creation_input_tokens` and `cache_read_input_tokens` fields exist in every response — they are just never populated.

## Root cause

NCode already emits `cache_control: { type: "ephemeral" }` breakpoints on cached prefixes:

- `src/services/api/claude.ts:630,642,675,690` — message-level `cache_control` markers
- `src/services/api/claude.ts:3194-3210` — `cache_reference` placement for tool-result blocks within the cached prefix
- `src/services/api/claude.ts:337-363` — `getPromptCachingEnabled()` defaults to `true`; only env kill switches (`DISABLE_PROMPT_CACHING`, `DISABLE_PROMPT_CACHING_HAIKU`, etc.) turn it off
- `src/services/api/claude.ts:365-381` — `getCacheControl()` produces the `{ type: "ephemeral", ttl?: "1h", scope?: "global" }` shape

And the client already keys off the response fields:

- `src/services/api/promptCacheBreakDetection.ts:437-466` — `checkResponseForCacheBreak(querySource, cacheReadTokens, cacheCreationTokens, …)` compares previous-vs-current `cache_read_input_tokens`, fires `ncode_prompt_cache_break` analytics when the drop exceeds 5% / 2k tokens

So the marker plumbing and the consumption layer are both correct on the client. The zeros come from the GLM serving gateway, which accepts `cache_control` silently and returns the Anthropic-shape usage fields unfilled. There is no translation from the engine's real prefix-cache activity into the usage object.

## Architectural direction — decouple mechanism from protocol

Anthropic's `cache_control` is a **billing and breakpoint protocol** layered on top of a **mechanism** that already exists in every major OSS inference engine:

- vLLM: `--enable-prefix-caching` (RADIX-like KV reuse; not always default-on)
- SGLang: RADIX attention (default-on)
- TensorRT-LLM: `iTex` / KV-cache reuse
- DeepSeek's serving stack: longest-prefix auto-cache

The mechanism is token-level and protocol-agnostic. It does not need client opt-in, breakpoint markers, or 1024-token minimums — those are Anthropic billing constraints. Reusing the mechanism and surfacing it through the existing wire-compat boundary is strictly easier and strictly more independent than reimplementing Anthropic's accounting from scratch.

### What to keep vs drop from the Anthropic shape

| Anthropic shape | Keep? | Why |
|---|---|---|
| Wire format of `cache_control` requests | Accept silently (don't crash) | Keeps every Anthropic-shape client working with zero changes |
| `cache_read_input_tokens` / `cache_creation_input_tokens` in `usage` | **Keep** — populate from engine reality | NCode's `promptCacheBreakDetection.ts`, `/insights-context` Token Economics, and any Anthropic-shape dashboard key off these. Honest population unlocks the existing observability layer for free. |
| 4-breakpoint-per-request cap, 1024-token minimum | Drop | Engine-centric Anthropic constraints; irrelevant under auto-caching |
| `ttl: "5m"` / `ttl: "1h"` | Replace | Evict on the cadence the GLM deployment actually supports, not Anthropic's spec |
| `anthropic_beta` feature flags | Drop (use native flags) | One fewer protocol dependency |
| `scope: "global"` | Drop | Anthropic org/global cache routing — not meaningful for a single-tenant deployment |

The wire request/response shape stays Anthropic-compatible so NCode needs no client-side change. The protocol underneath becomes Noumena-native, with the Anthropic fields as a translation of engine reality — not a mirror of what the client requested.

## Implementation

There are three pieces. All three are gateway-side; the inference engine itself needs only `--enable-prefix-caching` confirmed on (or its equivalent for SGLang/TRT-LLM). No engine patches required.

### 1. Engine prefix caching — verify it is on

This is pure deployment config, not code. For a vLLM-backed GLM-5.2-FP8 deployment, the launcher line must include:

```bash
vllm serve /data/models/hf/zai-org__GLM-5.2-FP8 \
  --enable-prefix-caching \
  --kv-cache-dtype fp8 \
  --max-model-len 1048576
```

Without `--enable-prefix-caching`, every prefix recompute happens regardless of what the gateway reports. This is the single highest-leverage change and costs nothing.

For SGLang the equivalent is `--enable-radix-cache` (usually default-on). For TensorRT-LLM, the iTex plugin must be enabled in the build config.

Verification that prefix caching is actually doing something — vLLM exposes hit counts per request via the OpenAI-compat `/metrics` endpoint (Prometheus) and via `request.metrics` on the internal API:

```python
# vllm/engine/llm_engine.py — RequestOutput.outputs[].metrics
# Look at: num_cached_tokens, num_computed_tokens
# num_cached_tokens > 0 means KV was reused for that part of the prompt.
```

### 2. Tokenizer-stable prefix key

This is the load-bearing detail. The prefix hash must be computed from the engine's tokenizer (GLM's BPE), not Claude's, because the same text breaks into different tokens across tokenizers and breaks the cache key. Most "engine thinks it hit cache, gateway reports miss" disagreements come from drift here.

```python
# gateway/cache/prefix_key.py
import hashlib
from typing import List
from transformers import AutoTokenizer

# Load GLM's tokenizer from the same path the engine serves.
_TOKENIZER = AutoTokenizer.from_pretrained(
    "/data/models/hf/zai-org__GLM-5.2-FP8",
    trust_remote_code=True,
)

def prefix_key(messages: List[dict]) -> str:
    """
    Produce a stable cache key for the longest cached prefix of a request.

    The key is the SHA-256 of the token IDs for the system prompt + tools
    + the prefix of message history that the engine will reuse KV for.
    Token IDs (not text) avoid BPE-normalization drift between sessions.

    messages: Anthropic-shape [{'role': 'system'|'user'|'assistant', 'content': ...}]
    """
    # Concatenate text content the way the engine will, then tokenize ONCE.
    # Tokenize at the message-boundary granularity so partial-prefix reuse
    # still hashes the same way across requests.
    token_ids: List[int] = []
    for msg in messages:
        content = msg.get("content", "")
        if isinstance(content, list):
            # Anthropic-shape content blocks: extract text only
            content = "".join(
                block.get("text", "") for block in content
                if isinstance(block, dict) and block.get("type") == "text"
            )
        ids = _TOKENIZER.encode(content, add_special_tokens=False)
        token_ids.extend(ids)
        # Separator token between messages — use a stable sentinel, not the
        # tokenizer's conversation template (that changes between model revs).
        token_ids.append(-1)

    return hashlib.sha256(bytes((t & 0xFFFFFFFF) for t in token_ids)).hexdigest()
```

### 3. Gateway middleware — translate engine metrics to Anthropic-shape usage

This is the piece that turns the existing wire-compat boundary into a translation, not a silent drop.

```python
# gateway/cache/middleware.py
import time
from collections import OrderedDict
from typing import Optional

from .prefix_key import prefix_key

class CachedPrefix:
    """One resident prefix entry in the gateway-side cache ledger."""
    __slots__ = ("key", "token_count", "last_seen_ts", "expiration_ts")
    def __init__(self, key: str, token_count: int, ttl_seconds: int):
        self.key = key
        self.token_count = token_count
        self.last_seen_ts = time.time()
        self.expiration_ts = self.last_seen_ts + ttl_seconds

class PrefixCacheLedger:
    """
    Tracks which prefixes the engine has cached KV for, so we can populate
    cache_read_input_tokens / cache_creation_input_tokens honestly.

    TTL is the deployment's choice — not Anthropic's 5m/1h. Default 30 minutes
    fits a typical GLM-5.2 coding session; tune to available KV memory.
    """
    def __init__(self, ttl_seconds: int = 1800, max_entries: int = 2048):
        self._ttl = ttl_seconds
        self._max = max_entries
        self._entries: "OrderedDict[str, CachedPrefix]" = OrderedDict()

    def _evict_expired(self) -> None:
        now = time.time()
        expired = [k for k, v in self._entries.items() if v.expiration_ts <= now]
        for k in expired:
            self._entries.pop(k, None)

    def lookup(self, key: str) -> Optional[CachedPrefix]:
        self._evict_expired()
        entry = self._entries.get(key)
        if entry is None:
            return None
        # LRU bump
        self._entries.move_to_end(key)
        entry.last_seen_ts = time.time()
        entry.expiration_ts = entry.last_seen_ts + self._ttl
        return entry

    def record_write(self, key: str, token_count: int) -> None:
        self._evict_expired()
        if key in self._entries:
            self._entries.move_to_end(key)
        self._entries[key] = CachedPrefix(key, token_count, self._ttl)
        while len(self._entries) > self._max:
            self._entries.popitem(last=False)


# Singleton — one ledger per worker process. For multi-process gateways,
# move to Redis with the same key shape.
LEDGER = PrefixCacheLedger()


async def translate_usage(request: dict, engine_response) -> dict:
    """
    Translate the engine's real prefix-cache activity into the Anthropic-shape
    usage object. Called after the engine returns its response.

    request:        the parsed Anthropic-shape Messages request
    engine_response: the vLLM/SGLang Response object (has .metrics or .outputs)
    """
    # Pull the prompt-token count and the cached-prefix count from the engine.
    # vLLM exposes this on RequestOutput; SGLang on its sampling output metrics.
    metrics = getattr(engine_response, "metrics", None)
    num_prompt_tokens = getattr(metrics, "num_prompt_tokens", 0)
    # num_cached_tokens = tokens served from KV cache, not recomputed.
    num_cached_tokens = getattr(metrics, "num_cached_tokens", 0)

    if num_prompt_tokens <= 0 or num_cached_tokens <= 0:
        # Either the metrics shape isn't wired up, or there was genuinely
        # nothing cached for this request. Fall back to a ledger lookup so
        # we still report honestly if the engine forgets to tell us.
        key = prefix_key(request.get("messages", []))
        ledger_entry = LEDGER.lookup(key)
        if ledger_entry is not None:
            num_prompt_tokens = max(num_prompt_tokens, ledger_entry.token_count)
            num_cached_tokens = ledger_entry.token_count
    else:
        # Engine told us the truth. Record it for the next request that
        # sends the same prefix — handles engines that report hit/miss
        # asymmetrically across requests.
        key = prefix_key(request.get("messages", []))
        LEDGER.record_write(key, num_cached_tokens)

    # Anthropic-shape usage contract. cache_creation_input_tokens is only
    # set on a deliberate write (new prefix not previously seen). Auto-caching
    # engines do not distinguish "wrote this turn" from "served from cache",
    # so we set it to 0 in the steady state and only populate it the first
    # time a prefix appears — flagged by the ledger miss.
    is_new_prefix = num_cached_tokens < num_prompt_tokens and num_cached_tokens == 0
    cache_creation = num_prompt_tokens if is_new_prefix else 0
    cache_read = num_cached_tokens

    # Noumena-native billing rates go here — independent of Anthropic pricing.
    # The shape is the only thing inherited; the numbers are ours.

    return {
        "input_tokens": num_prompt_tokens - cache_read,    # billed at base rate
        "output_tokens": getattr(metrics, "num_output_tokens", 0),
        "cache_read_input_tokens": cache_read,             # billed at cache-hit rate
        "cache_creation_input_tokens": cache_creation,     # billed at write rate
    }
```

### 4. Hooking it into the request path

The translation runs after the engine returns, before the gateway serializes the response. In an ASGI gateway (Starlette/FastAPI shape that most OpenAI/Anthropic-compat proxies use):

```python
# gateway/routes/messages.py
from fastapi import APIRouter, Request
from .cache.middleware import translate_usage

router = APIRouter()

@router.post("/v1/messages")
async def create_message(request: Request):
    body = await request.json()
    engine_response = await call_glm_engine(body)
    usage = await translate_usage(body, engine_response)
    # Merge usage into the Anthropic-shape response body.
    engine_response_dict = engine_response.to_dict()
    engine_response_dict["usage"] = {
        **engine_response_dict.get("usage", {}),
        **usage,
    }
    return engine_response_dict
```

## Client-side evidence the above is sufficient

Once the gateway populates `cache_read_input_tokens` and `cache_creation_input_tokens` honestly, no NCode change is required to see the benefit:

1. **Cache-break detector** (`src/services/api/promptCacheBreakDetection.ts:437-466`) — already consumes `cacheReadTokens` / `cacheCreationTokens` and fires `ncode_prompt_cache_break` analytics on legitimate drops. Will start producing real signal instead of always-zero inputs.
2. **`/insights-context` Token Economics** — the recently-updated scanner (`~/.ncode/commands/insights-context-scripts/scan.py:139-147`) already sums these fields; the current zero output is purely downstream of this missing translation.
3. **`getCacheControl` markers** (`src/services/api/claude.ts:365-381`) — NCode stops sending these usefully under auto-caching, but they remain harmless to accept, so wire-compat is preserved while the engine runs the show.

## Acceptance

- `cache_read_input_tokens` > 0 on the second+ turn of a `glm-5.2[1m]` session with a stable system prompt + tool schemas.
- `cache_creation_input_tokens` > 0 only on the first turn with a new prefix; subsequent turns with the same prefix show `cache_read_input_tokens` equal to the prompt length.
- `/insights-context` Token Economics card shows a non-zero cache-read tier on real sessions, with no client-side code change.
- Wall-clock latency per turn drops measurably on stable-prefix sessions (engine KV reuse).
- `promptCacheBreakDetection.ts` legitimately fires or stays silent based on real engine cache state, not always-silent-because-zero.
- `cache_control` requests continue to be accepted without error (back-compat with all Anthropic-shape clients, not just NCode).

## Non-goals

- Reimplementing Anthropic's breakpoint accounting or billing tiers verbatim.
- Routing NCode to Anthropic's API. The goal is operational independence from Anthropic, with the Anthropic wire shape as a thin translation boundary only.
- Changing NCode's request shape on the client side.
- Adding client-side feature flags or `Settings` toggles. The client is correct as-shipped; the work is server-side.

## Open questions for the serving team

1. Which inference engine backs `/data/models/hf/zai-org__GLM-5.2-FP8` (vLLM / SGLang / TRT-LLM / custom)? The metric field names in `translate_usage` need to match.
2. Is prefix caching currently enabled in the deployment? If not, that alone is the highest-leverage change.
3. What KV-cache budget is available? Affects the ledger TTL — too aggressive and prefixes churn mid-session; too generous and they thrash.
4. Is the gateway single-process or multi-process? Single-process ledger works for the former; multi-process needs Redis or shared memory for the same key shape.

Authored by GLM 5.2 [1m] via NCode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GLM-5.2 serving endpoint: surface prefix cache hits as Anthropic-shape usage fields (decouple from Anthropic) #50

Problem

Root cause

Architectural direction — decouple mechanism from protocol

What to keep vs drop from the Anthropic shape

Implementation

1. Engine prefix caching — verify it is on

2. Tokenizer-stable prefix key

3. Gateway middleware — translate engine metrics to Anthropic-shape usage

4. Hooking it into the request path

Client-side evidence the above is sufficient

Acceptance

Non-goals

Open questions for the serving team

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Anthropic shape	Keep?	Why
Wire format of `cache_control` requests	Accept silently (don't crash)	Keeps every Anthropic-shape client working with zero changes
`cache_read_input_tokens` / `cache_creation_input_tokens` in `usage`	Keep — populate from engine reality	NCode's `promptCacheBreakDetection.ts`, `/insights-context` Token Economics, and any Anthropic-shape dashboard key off these. Honest population unlocks the existing observability layer for free.
4-breakpoint-per-request cap, 1024-token minimum	Drop	Engine-centric Anthropic constraints; irrelevant under auto-caching
`ttl: "5m"` / `ttl: "1h"`	Replace	Evict on the cadence the GLM deployment actually supports, not Anthropic's spec
`anthropic_beta` feature flags	Drop (use native flags)	One fewer protocol dependency
`scope: "global"`	Drop	Anthropic org/global cache routing — not meaningful for a single-tenant deployment

Uh oh!

GLM-5.2 serving endpoint: surface prefix cache hits as Anthropic-shape usage fields (decouple from Anthropic) #50

Description

Problem

Root cause

Architectural direction — decouple mechanism from protocol

What to keep vs drop from the Anthropic shape

Implementation

1. Engine prefix caching — verify it is on

2. Tokenizer-stable prefix key

3. Gateway middleware — translate engine metrics to Anthropic-shape usage

4. Hooking it into the request path

Client-side evidence the above is sufficient

Acceptance

Non-goals

Open questions for the serving team

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions