Skip to content

cuda: fall back to pinned host memory when the model arena runs out of VRAM#487

Open
riccardo-galbani wants to merge 1 commit into
antirez:mainfrom
riccardo-galbani:cuda-arena-pinned-oom-fallback
Open

cuda: fall back to pinned host memory when the model arena runs out of VRAM#487
riccardo-galbani wants to merge 1 commit into
antirez:mainfrom
riccardo-galbani:cuda-arena-pinned-oom-fallback

Conversation

@riccardo-galbani

Copy link
Copy Markdown

cuda_model_arena_alloc() would fail hard (return NULL,
g_model_cache_full = 1) the moment cudaMalloc failed for a model
weight chunk. On GPUs where the non-routed (always-resident) weights
of a model exceed available VRAM, this made --ssd-streaming
unusable even though the model's routed experts would otherwise fit
comfortably via the existing streaming expert cache.

This adds a fallback: when the device allocation fails, allocate
pinned host memory (cudaHostAlloc with cudaHostAllocMapped) and
expose it to the GPU via cudaHostGetDevicePointer (zero-copy). This
keeps such models loadable and correct, at the cost of PCIe latency
on every access to the affected chunk. Set
DS4_CUDA_NO_PINNED_ARENA_FALLBACK to restore the previous
fail-fast behavior.

Tested on a consumer laptop low memory CUDA GPU with a model whose
non-routed weights (8.20 GiB) exceed available VRAM. Previously
this crashed with an illegal memory access / OOM during prefill.
With this patch it loads and generates correctly, though slowly due
to PCIe round-trips on the pinned chunks:

ds4: CUDA model arena using pinned host RAM fallback for q8_0 (1792.00 MiB chunk, zero-copy device access)
ds4: CUDA model arena using pinned host RAM fallback for q8_0_pair0 (1792.00 MiB chunk, zero-copy device access)
ds4: CUDA model arena using pinned host RAM fallback for q8_hc_expand (1792.00 MiB chunk, zero-copy device access)
...
ds4: prefill: 0.52 t/s, generation: 0.58 t/s

On GPUs with enough VRAM for a model's non-routed weights, this
fallback never triggers and behavior is unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant