cuda: fall back to pinned host memory when the model arena runs out of VRAM by riccardo-galbani · Pull Request #487 · antirez/ds4

riccardo-galbani · 2026-07-02T21:46:55Z

cuda_model_arena_alloc() would fail hard (return NULL,
g_model_cache_full = 1) the moment cudaMalloc failed for a model
weight chunk. On GPUs where the non-routed (always-resident) weights
of a model exceed available VRAM, this made --ssd-streaming
unusable even though the model's routed experts would otherwise fit
comfortably via the existing streaming expert cache.

This adds a fallback: when the device allocation fails, allocate
pinned host memory (cudaHostAlloc with cudaHostAllocMapped) and
expose it to the GPU via cudaHostGetDevicePointer (zero-copy). This
keeps such models loadable and correct, at the cost of PCIe latency
on every access to the affected chunk. Set
DS4_CUDA_NO_PINNED_ARENA_FALLBACK to restore the previous
fail-fast behavior.

Tested on a consumer laptop low memory CUDA GPU with a model whose
non-routed weights (8.20 GiB) exceed available VRAM. Previously
this crashed with an illegal memory access / OOM during prefill.
With this patch it loads and generates correctly, though slowly due
to PCIe round-trips on the pinned chunks:

ds4: CUDA model arena using pinned host RAM fallback for q8_0 (1792.00 MiB chunk, zero-copy device access)
ds4: CUDA model arena using pinned host RAM fallback for q8_0_pair0 (1792.00 MiB chunk, zero-copy device access)
ds4: CUDA model arena using pinned host RAM fallback for q8_hc_expand (1792.00 MiB chunk, zero-copy device access)
...
ds4: prefill: 0.52 t/s, generation: 0.58 t/s

On GPUs with enough VRAM for a model's non-routed weights, this
fallback never triggers and behavior is unchanged.

cuda: add pinned host RAM fallback when model arena hits OOM

eeefe0e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: fall back to pinned host memory when the model arena runs out of VRAM#487

cuda: fall back to pinned host memory when the model arena runs out of VRAM#487
riccardo-galbani wants to merge 1 commit into
antirez:mainfrom
riccardo-galbani:cuda-arena-pinned-oom-fallback

riccardo-galbani commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

riccardo-galbani commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant