Skip to content

Remove redundant CUDA copies after gated_delta_net.#23940

Open
gaugarg-nv wants to merge 1 commit into
ggml-org:masterfrom
gaugarg-nv:fuse_gated_delta_net_with_copy
Open

Remove redundant CUDA copies after gated_delta_net.#23940
gaugarg-nv wants to merge 1 commit into
ggml-org:masterfrom
gaugarg-nv:fuse_gated_delta_net_with_copy

Conversation

@gaugarg-nv
Copy link
Copy Markdown
Contributor

Currently, GDN writes recurrent state snapshots into its output tail, then the graph immediately copies those snapshots into ssm_states_all. With MTP draft length 3, target decode uses K=4, so that becomes 4 extra ggml_cuda_cpy calls.

The change detects that gated_delta_net -> view -> cpy pattern and makes the CUDA GDN kernel write the state snapshots directly into the recurrent cache, skipping the intermediate tail writes and copy kernels when safe.

Performance on DGX Spark with Qwen3.6-35B-A3B-UD-Q4_K_M.gguf:

MTP OFF: shows a gain of 3% in the decode phase.
MTP ON: shows an average gain of 4%.

More perf details on DGX Spark with Qwen3.6-35B-A3B-UD-Q4_K_M.gguf

MTP off:

model test Master -t/s PR- t/s Speed-up
qwen35moe 35B.A3B Q4_K - Medium pp512 2295.70 ± 17.03 2296.19 ± 10.92 1.0x
qwen35moe 35B.A3B Q4_K - Medium tg128 68.50 ± 0.21 70.50 ± 0.21 1.03x

MTP ON: Command- llama-server -m Qwen3.6-35B-A3B-UD-Q4_K_M.gguf --spec-type draft-mtp

Master:

python3 mtp-bench.py
  code_python        pred= 192 draft= 178 acc= 131 rate=0.736 tok/s=91.4
  code_cpp           pred= 192 draft= 207 acc= 121 rate=0.585 tok/s=79.6
  explain_concept    pred= 192 draft= 196 acc= 125 rate=0.638 tok/s=84.4
  summarize          pred= 192 draft= 167 acc= 135 rate=0.808 tok/s=97.7
  qa_factual         pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=95.4
  translation        pred= 192 draft= 205 acc= 121 rate=0.590 tok/s=79.9
  creative_short     pred= 192 draft= 195 acc= 125 rate=0.641 tok/s=85.0
  stepwise_math      pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=96.3
  long_code_review   pred= 192 draft= 198 acc= 124 rate=0.626 tok/s=82.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1688,
  "total_draft_accepted": 1148,
  "aggregate_accept_rate": 0.6801,
  "wall_s_total": 22.06
}

PR:

python3 mtp-bench.py
  code_python        pred= 192 draft= 178 acc= 131 rate=0.736 tok/s=95.4
  code_cpp           pred= 192 draft= 207 acc= 121 rate=0.585 tok/s=82.8
  explain_concept    pred= 192 draft= 196 acc= 125 rate=0.638 tok/s=87.5
  summarize          pred= 192 draft= 167 acc= 135 rate=0.808 tok/s=101.3
  qa_factual         pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=99.2
  translation        pred= 192 draft= 205 acc= 121 rate=0.590 tok/s=83.0
  creative_short     pred= 192 draft= 195 acc= 125 rate=0.641 tok/s=88.5
  stepwise_math      pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=99.5
  long_code_review   pred= 192 draft= 198 acc= 124 rate=0.626 tok/s=85.7

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1688,
  "total_draft_accepted": 1148,
  "aggregate_accept_rate": 0.6801,
  "wall_s_total": 20.79
}

Requirements

Currently, GDN writes recurrent state snapshots into its output tail, then the graph immediately copies those snapshots into ssm_states_all. With MTP draft length 3, target decode uses K=4, so that becomes 4 extra ggml_cuda_cpy calls.

The change detects that gated_delta_net -> view -> cpy pattern and makes the CUDA GDN kernel write the state snapshot(s) directly into the recurrent cache, skipping the intermediate tail writes and copy kernels when safe.
@gaugarg-nv gaugarg-nv requested a review from a team as a code owner May 31, 2026 14:09
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 31, 2026
@am17an
Copy link
Copy Markdown
Contributor

am17an commented May 31, 2026

This seems like it should be solved at the graph level instead of fusion at the CUDA level. Is there a reason not to do that?

@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

This seems like it should be solved at the graph level instead of fusion at the CUDA level. Is there a reason not to do that?

Yes, solving this at the graph level makes sense to me as it will help all the backends. But it will require modifying the ggml API for gated_delta_net and updating all the backends. I can create a POC if that looks like a reasonable approach.

@am17an
Copy link
Copy Markdown
Contributor

am17an commented May 31, 2026

Yes that would be better I think since the extra copy is not specific to CUDA itself, but I'm not sure if it exists for a reason or can be safely removed.

@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

@ggerganov I would love to know your thoughts on this, as it will require updating the GGML API and changing the op to directly write into the persistent cache.

@ggerganov
Copy link
Copy Markdown
Member

I don't think we can avoid the copy at the graph level? @am17an Do you have something in mind?

In general, the ggml pattern for caching results of partial tensors is:

result = ggml_some_op(...);
ggml_cpy(result, ggml_view(cache));

@am17an
Copy link
Copy Markdown
Contributor

am17an commented May 31, 2026

Can we not pass a view in the op itself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants