cuda: enable streaming auto cache (implement recommended_working_set_size)#488
Open
riccardo-galbani wants to merge 1 commit into
Open
cuda: enable streaming auto cache (implement recommended_working_set_size)#488riccardo-galbani wants to merge 1 commit into
riccardo-galbani wants to merge 1 commit into
Conversation
|
This worked for me on a 7800 XT but I also needed some stuff from #461 like these flags: It's not exactly fast, ~33 t/s PP, 5-6 on TG out of the gate. I just wanted to see if I could. :) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ds4_backend_supports_streaming_auto_cache()only allowed the SSDstreaming auto cache planner to run under
DS4_BACKEND_METAL, orunder
DS4_BACKEND_CUDAwhen built withDS4_ROCM_BUILD— which aplain
make cuda-generic(nvcc) build never defines. As a result,CUDA users always had to pass
--ssd-streaming-cache-expertsexplicitly, and
ds4_gpu_recommended_working_set_size()inds4_cuda.cu was an unimplemented stub returning 0.
This implements the CUDA working set size using
cudaMemGetInfo'stotal device memory (the closest analogue to Metal's
recommendedMaxWorkingSetSize), and extends the guard so CUDA can use
the same auto cache planner Metal already has.
Tested on a CUDA GPU with 8GB VRAM with DeepSeek V4 Flash (the
project's reference model):
ds4: SSD streaming auto cache budget
ds4: cuda recommends 7.62 GiB working set
ds4: using 80% total for model + cached experts: 6.10 GiB
ds4: non-routed weights: 8.20 GiB
ds4: routed expert size: 6.75 MiB
ds4: cached expert count: 1 (0.01 GiB)