Gemma4 initial npu support by cavusmustafa · Pull Request #195 · ravi9/llama.cpp

cavusmustafa · 2026-05-28T17:03:11Z

Gemma4 Initial NPU Support

Adds initial Intel NPU support for Gemma4 E2B via the OpenVINO backend, and fixes a regression that broke Gemma4 on CPU/GPU.

Changes

NPU static path support (view.cpp, utils.cpp, node_context.h, ggml-decoder.h, ggml-decoder.cpp):

translate_view: In static mode, produce explicit Slice ops for views so NPUW's FOLD can handle per-layer embeddings without introducing dynamic shapes
process_view_input_new in utils.cpp: Add Split-based routing for single-dimension views in static mode (FOLD-friendly), and shape-match early-out for
already-resolved views
node_context.h: Add view resolution check in get_input() — return Slice result when translate_view produced one
ggml-decoder.h: Expose get_static_n_tokens() for static shape substitution
ggml-decoder.cpp: Use get_static_n_tokens() in get_view_input_ov_shape / get_view_input_src_ov_shape for static mode

NPU accuracy fix (glu_geglu.cpp):

Clamp GEGLU input to [-10, 10] in static mode to avoid fp16 overflow on NPU

Fix: Gemma4 FLASH_ATTN_EXT regression (ggml-openvino.cpp):

is_gemma3n_flash_attn_pattern() was falsely matching Gemma4 (and any model with scale=1.0 + KV cache), causing FLASH_ATTN_EXT to fall back to CPU and
breaking inference
Fix: remove the overly broad is_kv_cache condition, keep only the gemma3n-specific direct attention pattern (q=ROPE, k=ROPE, v=RMS_NORM)

Fix: test-llama-archs (utils.cpp):

Fix early-return shape check in process_view_input_new to not skip view processing when dynamic dimensions mask a shape difference (fixes qwen3next crash)

Known Limitation

NPU compilation for Gemma4 triggers NPUW FOLD's dynamic shape issue — the per-layer embedding tensor (inp_per_layer) gets parameterized during folding,
which the NPU compiler cannot handle. Current workaround uses specific NPUW config settings.

cavusmustafa · 2026-05-28T23:04:02Z

@zhaixuejun1993 this PR modifies gemma3 falsh attention fallback part as it was falling back for gemma4 as well. Can you verify the changes?

cavusmustafa added 5 commits May 28, 2026 10:03

Initiall gemma4 npu support

a036626

temp. fix for gemma4 accuracy bug on npu

3f26dd8

Remove hardcoded names for npu-fold handling

ab2dc43

revert static n tokens for cont translation as it is not needed

c505918

removed unused variable

8b77fab

cavusmustafa force-pushed the gemma4_initial_npu_support branch from 59b8969 to 8b77fab Compare May 28, 2026 17:09

cavusmustafa added 2 commits May 28, 2026 12:17

test-llama-archs fix

8ed4aca

Fix gemma4 flash_attn fallback

1c7fe37

cavusmustafa marked this pull request as ready for review May 28, 2026 22:55

cavusmustafa requested a review from wine99 as a code owner May 28, 2026 22:55

cavusmustafa requested a review from Copilot May 28, 2026 23:01

Copilot started reviewing on behalf of cavusmustafa May 28, 2026 23:01 View session

cavusmustafa requested review from zhaixuejun1993 and removed request for Copilot May 28, 2026 23:01

ravi9 closed this May 29, 2026

ravi9 reopened this May 29, 2026

cavusmustafa merged commit af9fb52 into ravi9:dev_backend_openvino May 29, 2026
6 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma4 initial npu support#195

Gemma4 initial npu support#195
cavusmustafa merged 7 commits into
ravi9:dev_backend_openvinofrom
cavusmustafa:gemma4_initial_npu_support

cavusmustafa commented May 28, 2026 •

edited

Loading

Uh oh!

cavusmustafa commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cavusmustafa commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cavusmustafa commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cavusmustafa commented May 28, 2026 •

edited

Loading