Stateless broadcast optimization by cavusmustafa · Pull Request #190 · ravi9/llama.cpp

cavusmustafa · 2026-05-26T23:31:48Z

Stateless translation updated so gpu plugin can capture ov::op::internal::RoPE
Temporary performance solution for stateless gpu: Bypass v13 SDPA entirely. Express attention manually so MatMul's NUMPY broadcast handles the GQA expansion at kernel level. K and V stay at n_heads_kv shape; the GEMM kernel reads them once and broadcasts via stride trick. We can revert this once we can utilize internal SDPA kernel which supports GQA broadcasting.

wine99 · 2026-06-01T06:01:05Z

+        auto q_5d_shape = ov::op::v0::Constant::create(
+            ov::element::i64, {5},
+            std::vector<int64_t>{1, num_heads_kv, factor, -1, head_size});
+


I believe qkv arrive as [B, n_heads, S, head_size] where B is the extra input n_seq_active, so this code does not work correctly with llama-perplexity or llama-server -np > 1.

If the ov pattern supports multiple sequences, i.e. B != 1, we can change the shape to {0, num_heads_kv, 1, -1, head_size} and set special_zero = true in Reshape. If the ov pattern does not support multiple sequences, we can set use_manual_gqa_attention to false if n_seq > 1 or manually run perplexity with GGML_OPENVINO_MANUAL_GQA_ATTN=0. Otherwise LGTM

FYI to run llama-server -np > 1 or llama-perplexity you need to include the commit from #199

github-actions Bot added OpenVINO ggml labels May 26, 2026

cavusmustafa added 2 commits May 29, 2026 14:29

stateless boradcast and rope optimizations

952f603

Enable manual gqa attn by default for stateless gpu

115a310

cavusmustafa force-pushed the stateless_broadcast_optimization branch from a56fb28 to 115a310 Compare May 29, 2026 22:36

cavusmustafa marked this pull request as ready for review May 29, 2026 22:49

cavusmustafa requested a review from wine99 as a code owner May 29, 2026 22:49

wine99 approved these changes Jun 1, 2026

View reviewed changes

cavusmustafa marked this pull request as draft June 1, 2026 17:45

manual gqa: fixed static batch

c41aa02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stateless broadcast optimization#190

Stateless broadcast optimization#190
cavusmustafa wants to merge 3 commits into
ravi9:dev_backend_openvinofrom
cavusmustafa:stateless_broadcast_optimization

cavusmustafa commented May 26, 2026

Uh oh!

wine99 Jun 1, 2026

Uh oh!

wine99 Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cavusmustafa commented May 26, 2026

Uh oh!

wine99 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

wine99 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants