Remove redundant CUDA copies after gated_delta_net. by gaugarg-nv · Pull Request #23940 · ggml-org/llama.cpp

gaugarg-nv · 2026-05-31T14:09:11Z

Currently, GDN writes recurrent state snapshots into its output tail, then the graph immediately copies those snapshots into ssm_states_all. With MTP draft length 3, target decode uses K=4, so that becomes 4 extra ggml_cuda_cpy calls.

The change detects that gated_delta_net -> view -> cpy pattern and makes the CUDA GDN kernel write the state snapshots directly into the recurrent cache, skipping the intermediate tail writes and copy kernels when safe.

Performance on DGX Spark with Qwen3.6-35B-A3B-UD-Q4_K_M.gguf:

MTP OFF: shows a gain of 3% in the decode phase.
MTP ON: shows an average gain of 4%.

More perf details on DGX Spark with Qwen3.6-35B-A3B-UD-Q4_K_M.gguf

MTP off:

model	test	Master -t/s	PR- t/s	Speed-up
qwen35moe 35B.A3B Q4_K - Medium	pp512	2295.70 ± 17.03	2296.19 ± 10.92	1.0x
qwen35moe 35B.A3B Q4_K - Medium	tg128	68.50 ± 0.21	70.50 ± 0.21	1.03x

MTP ON: Command- llama-server -m Qwen3.6-35B-A3B-UD-Q4_K_M.gguf --spec-type draft-mtp

Master:

python3 mtp-bench.py
  code_python        pred= 192 draft= 178 acc= 131 rate=0.736 tok/s=91.4
  code_cpp           pred= 192 draft= 207 acc= 121 rate=0.585 tok/s=79.6
  explain_concept    pred= 192 draft= 196 acc= 125 rate=0.638 tok/s=84.4
  summarize          pred= 192 draft= 167 acc= 135 rate=0.808 tok/s=97.7
  qa_factual         pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=95.4
  translation        pred= 192 draft= 205 acc= 121 rate=0.590 tok/s=79.9
  creative_short     pred= 192 draft= 195 acc= 125 rate=0.641 tok/s=85.0
  stepwise_math      pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=96.3
  long_code_review   pred= 192 draft= 198 acc= 124 rate=0.626 tok/s=82.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1688,
  "total_draft_accepted": 1148,
  "aggregate_accept_rate": 0.6801,
  "wall_s_total": 22.06
}

PR:

python3 mtp-bench.py
  code_python        pred= 192 draft= 178 acc= 131 rate=0.736 tok/s=95.4
  code_cpp           pred= 192 draft= 207 acc= 121 rate=0.585 tok/s=82.8
  explain_concept    pred= 192 draft= 196 acc= 125 rate=0.638 tok/s=87.5
  summarize          pred= 192 draft= 167 acc= 135 rate=0.808 tok/s=101.3
  qa_factual         pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=99.2
  translation        pred= 192 draft= 205 acc= 121 rate=0.590 tok/s=83.0
  creative_short     pred= 192 draft= 195 acc= 125 rate=0.641 tok/s=88.5
  stepwise_math      pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=99.5
  long_code_review   pred= 192 draft= 198 acc= 124 rate=0.626 tok/s=85.7

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1688,
  "total_draft_accepted": 1148,
  "aggregate_accept_rate": 0.6801,
  "wall_s_total": 20.79
}

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES. Paired with Codex on this.

Currently, GDN writes recurrent state snapshots into its output tail, then the graph immediately copies those snapshots into ssm_states_all. With MTP draft length 3, target decode uses K=4, so that becomes 4 extra ggml_cuda_cpy calls. The change detects that gated_delta_net -> view -> cpy pattern and makes the CUDA GDN kernel write the state snapshot(s) directly into the recurrent cache, skipping the intermediate tail writes and copy kernels when safe.

am17an · 2026-05-31T14:18:12Z

This seems like it should be solved at the graph level instead of fusion at the CUDA level. Is there a reason not to do that?

gaugarg-nv · 2026-05-31T14:32:26Z

This seems like it should be solved at the graph level instead of fusion at the CUDA level. Is there a reason not to do that?

Yes, solving this at the graph level makes sense to me as it will help all the backends. But it will require modifying the ggml API for gated_delta_net and updating all the backends. I can create a POC if that looks like a reasonable approach.

am17an · 2026-05-31T14:40:03Z

Yes that would be better I think since the extra copy is not specific to CUDA itself, but I'm not sure if it exists for a reason or can be safely removed.

gaugarg-nv · 2026-05-31T16:05:50Z

@ggerganov I would love to know your thoughts on this, as it will require updating the GGML API and changing the op to directly write into the persistent cache.

ggerganov · 2026-05-31T16:13:18Z

I don't think we can avoid the copy at the graph level? @am17an Do you have something in mind?

In general, the ggml pattern for caching results of partial tensors is:

result = ggml_some_op(...);
ggml_cpy(result, ggml_view(cache));

am17an · 2026-05-31T16:41:14Z

Can we not pass a view in the op itself?

gaugarg-nv requested a review from a team as a code owner May 31, 2026 14:09

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove redundant CUDA copies after gated_delta_net.#23940

Remove redundant CUDA copies after gated_delta_net.#23940
gaugarg-nv wants to merge 1 commit into
ggml-org:masterfrom
gaugarg-nv:fuse_gated_delta_net_with_copy

gaugarg-nv commented May 31, 2026

Uh oh!

am17an commented May 31, 2026

Uh oh!

gaugarg-nv commented May 31, 2026

Uh oh!

am17an commented May 31, 2026

Uh oh!

gaugarg-nv commented May 31, 2026

Uh oh!

ggerganov commented May 31, 2026

Uh oh!

am17an commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gaugarg-nv commented May 31, 2026

Requirements

Uh oh!

am17an commented May 31, 2026

Uh oh!

gaugarg-nv commented May 31, 2026

Uh oh!

am17an commented May 31, 2026

Uh oh!

gaugarg-nv commented May 31, 2026

Uh oh!

ggerganov commented May 31, 2026

Uh oh!

am17an commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants