cuda: reserve space for quantize kv-cache at startup#23907
Conversation
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
|
Did you test |
|
I was going to say that |
|
I did not check yet, we can wait for #23792 to get merged and check again because currently it will throw. |
i have after also merging #23792, and as far as i see there is not more memory creeping, on my system (5060ti + 2060 super) in all the following cases there is a 125 mp allocation during first prompt processing without kv quant with or without kv quant + ngram-mod on a different issue with ngram-mod with or without kv quant |
|
A couple of other things that I believe the fit algorithm is not reserving space for: |
Overview
ref #23646 (comment). Quantized kv-cache can lead to OOM even when using
--fitsince it does not know about these backend allocations. There are some other quantization buffers in FA and MMQ which should also be removed, but this one seems it takes the most space as it scales with ctx size.Additional information
Requirements