Skip to content

Stateless broadcast optimization#190

Draft
cavusmustafa wants to merge 3 commits into
ravi9:dev_backend_openvinofrom
cavusmustafa:stateless_broadcast_optimization
Draft

Stateless broadcast optimization#190
cavusmustafa wants to merge 3 commits into
ravi9:dev_backend_openvinofrom
cavusmustafa:stateless_broadcast_optimization

Conversation

@cavusmustafa
Copy link
Copy Markdown
Collaborator

  • Stateless translation updated so gpu plugin can capture ov::op::internal::RoPE
  • Temporary performance solution for stateless gpu: Bypass v13 SDPA entirely. Express attention manually so MatMul's NUMPY broadcast handles the GQA expansion at kernel level. K and V stay at n_heads_kv shape; the GEMM kernel reads them once and broadcasts via stride trick. We can revert this once we can utilize internal SDPA kernel which supports GQA broadcasting.

@cavusmustafa cavusmustafa force-pushed the stateless_broadcast_optimization branch from a56fb28 to 115a310 Compare May 29, 2026 22:36
@cavusmustafa cavusmustafa marked this pull request as ready for review May 29, 2026 22:49
@cavusmustafa cavusmustafa requested a review from wine99 as a code owner May 29, 2026 22:49
auto q_5d_shape = ov::op::v0::Constant::create(
ov::element::i64, {5},
std::vector<int64_t>{1, num_heads_kv, factor, -1, head_size});

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe qkv arrive as [B, n_heads, S, head_size] where B is the extra input n_seq_active, so this code does not work correctly with llama-perplexity or llama-server -np > 1.

If the ov pattern supports multiple sequences, i.e. B != 1, we can change the shape to {0, num_heads_kv, 1, -1, head_size} and set special_zero = true in Reshape. If the ov pattern does not support multiple sequences, we can set use_manual_gqa_attention to false if n_seq > 1 or manually run perplexity with GGML_OPENVINO_MANUAL_GQA_ATTN=0. Otherwise LGTM

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI to run llama-server -np > 1 or llama-perplexity you need to include the commit from #199

@cavusmustafa cavusmustafa marked this pull request as draft June 1, 2026 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants