Skip to content

Gemma4 initial npu support#195

Merged
cavusmustafa merged 7 commits into
ravi9:dev_backend_openvinofrom
cavusmustafa:gemma4_initial_npu_support
May 29, 2026
Merged

Gemma4 initial npu support#195
cavusmustafa merged 7 commits into
ravi9:dev_backend_openvinofrom
cavusmustafa:gemma4_initial_npu_support

Conversation

@cavusmustafa
Copy link
Copy Markdown
Collaborator

@cavusmustafa cavusmustafa commented May 28, 2026

Gemma4 Initial NPU Support

Adds initial Intel NPU support for Gemma4 E2B via the OpenVINO backend, and fixes a regression that broke Gemma4 on CPU/GPU.

Changes

NPU static path support (view.cpp, utils.cpp, node_context.h, ggml-decoder.h, ggml-decoder.cpp):

  • translate_view: In static mode, produce explicit Slice ops for views so NPUW's FOLD can handle per-layer embeddings without introducing dynamic shapes
  • process_view_input_new in utils.cpp: Add Split-based routing for single-dimension views in static mode (FOLD-friendly), and shape-match early-out for
    already-resolved views
  • node_context.h: Add view resolution check in get_input() — return Slice result when translate_view produced one
  • ggml-decoder.h: Expose get_static_n_tokens() for static shape substitution
  • ggml-decoder.cpp: Use get_static_n_tokens() in get_view_input_ov_shape / get_view_input_src_ov_shape for static mode

NPU accuracy fix (glu_geglu.cpp):

  • Clamp GEGLU input to [-10, 10] in static mode to avoid fp16 overflow on NPU

Fix: Gemma4 FLASH_ATTN_EXT regression (ggml-openvino.cpp):

  • is_gemma3n_flash_attn_pattern() was falsely matching Gemma4 (and any model with scale=1.0 + KV cache), causing FLASH_ATTN_EXT to fall back to CPU and
    breaking inference
  • Fix: remove the overly broad is_kv_cache condition, keep only the gemma3n-specific direct attention pattern (q=ROPE, k=ROPE, v=RMS_NORM)

Fix: test-llama-archs (utils.cpp):

  • Fix early-return shape check in process_view_input_new to not skip view processing when dynamic dimensions mask a shape difference (fixes qwen3next crash)

Known Limitation

NPU compilation for Gemma4 triggers NPUW FOLD's dynamic shape issue — the per-layer embedding tensor (inp_per_layer) gets parameterized during folding,
which the NPU compiler cannot handle. Current workaround uses specific NPUW config settings.

@cavusmustafa cavusmustafa force-pushed the gemma4_initial_npu_support branch from 59b8969 to 8b77fab Compare May 28, 2026 17:09
@cavusmustafa cavusmustafa marked this pull request as ready for review May 28, 2026 22:55
@cavusmustafa cavusmustafa requested a review from wine99 as a code owner May 28, 2026 22:55
@cavusmustafa cavusmustafa requested a review from Copilot May 28, 2026 23:01
@cavusmustafa cavusmustafa requested review from zhaixuejun1993 and removed request for Copilot May 28, 2026 23:01
@cavusmustafa
Copy link
Copy Markdown
Collaborator Author

@zhaixuejun1993 this PR modifies gemma3 falsh attention fallback part as it was falling back for gemma4 as well. Can you verify the changes?

@ravi9 ravi9 closed this May 29, 2026
@ravi9 ravi9 reopened this May 29, 2026
@cavusmustafa cavusmustafa merged commit af9fb52 into ravi9:dev_backend_openvino May 29, 2026
6 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants