Skip to content

cuda: reserve space for quantize kv-cache at startup#23907

Open
am17an wants to merge 4 commits into
ggml-org:masterfrom
am17an:fattn-static-kv-cache
Open

cuda: reserve space for quantize kv-cache at startup#23907
am17an wants to merge 4 commits into
ggml-org:masterfrom
am17an:fattn-static-kv-cache

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented May 30, 2026

Overview

ref #23646 (comment). Quantized kv-cache can lead to OOM even when using --fit since it does not know about these backend allocations. There are some other quantization buffers in FA and MMQ which should also be removed, but this one seems it takes the most space as it scales with ctx size.

Additional information

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, codex wrote this on my direction. I tested it on a few devices

@am17an am17an requested a review from a team as a code owner May 30, 2026 10:25
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 30, 2026
Comment thread ggml/src/ggml-cuda/fattn-common.cuh Outdated
Comment thread ggml/src/ggml-cuda/fattn.cu
Comment thread ggml/src/ggml-cuda/fattn.cu Outdated
Comment thread ggml/src/ggml-cuda/common.cuh Outdated
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Comment thread ggml/src/ggml-cuda/fattn-common.cuh Outdated
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
@ggerganov
Copy link
Copy Markdown
Member

Did you test -sm tensor setups? I think it should work, but might be worth double-checking.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

I was going to say that -sm tensor is implicitly being tested via test-llama-archs but there only the FP16/FP16 configuration is being tested. More generally: do we already have automated tests for different KV cache types? If not it may make sense to add some in test-llama-archs.

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 31, 2026

I did not check yet, we can wait for #23792 to get merged and check again because currently it will throw.

@AbdulrahmanHashem
Copy link
Copy Markdown

Did you test -sm tensor setups? I think it should work, but might be worth double-checking.

i have after also merging #23792, and as far as i see there is not more memory creeping, on my system (5060ti + 2060 super)

in all the following cases there is a 125 mp allocation during first prompt processing

without kv quant
without kv quant + MTP
without kv quant + ngram-mod
with kv quant
with kv quant + MTP
with kv quant + ngram-mod

with or without kv quant + ngram-mod
there is an additional variable amount of allocation under 150mp during tg

on a different issue with ngram-mod with or without kv quant
i tried ngram-mod + MTP it causes a crash with no error during tg after thinking is done.
it lags the llama ui very hard and just stops tg with no logs before it crashes

@coder543
Copy link
Copy Markdown

A couple of other things that I believe the fit algorithm is not reserving space for: cache-ram and ctx-checkpoints. On unified memory systems like the DGX Spark, this makes it hard to rely on the fit algorithm without specifying an arbitrarily large fit-target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants