cuda: reserve space for quantize kv-cache at startup by am17an · Pull Request #23907 · ggml-org/llama.cpp

am17an · 2026-05-30T10:25:34Z

Overview

ref #23646 (comment). Quantized kv-cache can lead to OOM even when using --fit since it does not know about these backend allocations. There are some other quantization buffers in FA and MMQ which should also be removed, but this one seems it takes the most space as it scales with ctx size.

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, codex wrote this on my direction. I tested it on a few devices

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

ggerganov · 2026-05-31T16:05:03Z

Did you test -sm tensor setups? I think it should work, but might be worth double-checking.

JohannesGaessler · 2026-05-31T16:09:22Z

I was going to say that -sm tensor is implicitly being tested via test-llama-archs but there only the FP16/FP16 configuration is being tested. More generally: do we already have automated tests for different KV cache types? If not it may make sense to add some in test-llama-archs.

am17an · 2026-05-31T17:01:19Z

I did not check yet, we can wait for #23792 to get merged and check again because currently it will throw.

AbdulrahmanHashem · 2026-05-31T17:23:29Z

Did you test -sm tensor setups? I think it should work, but might be worth double-checking.

i have after also merging #23792, and as far as i see there is not more memory creeping, on my system (5060ti + 2060 super)

in all the following cases there is a 125 mp allocation during first prompt processing

without kv quant
without kv quant + MTP
without kv quant + ngram-mod
with kv quant
with kv quant + MTP
with kv quant + ngram-mod

with or without kv quant + ngram-mod
there is an additional variable amount of allocation under 150mp during tg

on a different issue with ngram-mod with or without kv quant
i tried ngram-mod + MTP it causes a crash with no error during tg after thinking is done.
it lags the llama ui very hard and just stops tg with no logs before it crashes

coder543 · 2026-05-31T17:36:03Z

A couple of other things that I believe the fit algorithm is not reserving space for: cache-ram and ctx-checkpoints. On unified memory systems like the DGX Spark, this makes it hard to rely on the fit algorithm without specifying an arbitrarily large fit-target.

cuda: reserve space for quantize kv-cache at startup

a4273ef

am17an requested a review from a team as a code owner May 30, 2026 10:25

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 30, 2026

JohannesGaessler reviewed May 30, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/fattn-common.cuh Outdated

Comment thread ggml/src/ggml-cuda/fattn.cu

Comment thread ggml/src/ggml-cuda/fattn.cu Outdated

address review comments

9f584d3

JohannesGaessler approved these changes May 31, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/common.cuh Outdated

remove forward decl

32e6898

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

JohannesGaessler reviewed May 31, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/fattn-common.cuh Outdated

remove assert in ggml-cuda.cu

b324987

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

JohannesGaessler approved these changes May 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: reserve space for quantize kv-cache at startup#23907

cuda: reserve space for quantize kv-cache at startup#23907
am17an wants to merge 4 commits into
ggml-org:masterfrom
am17an:fattn-static-kv-cache

am17an commented May 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov commented May 31, 2026

Uh oh!

JohannesGaessler commented May 31, 2026

Uh oh!

am17an commented May 31, 2026

Uh oh!

AbdulrahmanHashem commented May 31, 2026

Uh oh!

coder543 commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

am17an commented May 30, 2026

Overview

Additional information

Requirements

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov commented May 31, 2026

Uh oh!

JohannesGaessler commented May 31, 2026

Uh oh!

am17an commented May 31, 2026

Uh oh!

AbdulrahmanHashem commented May 31, 2026

Uh oh!

coder543 commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants