vocab: add normalizer.lowercase support to WPM by o7si · Pull Request #23899 · ggml-org/llama.cpp

o7si · 2026-05-30T07:40:16Z

Follow-up to #18756: the WPM tokenizer now honors tokenizer.ggml.normalizer.lowercase instead of always lowercasing.

While implementing this I noticed strip_accents (another BertNormalizer option) isn't honored by WPM either, so accented words can still differ.

Tests (German_Semantic_V3):

Input	transformers	llama.cpp
`Hallo Welt`	`[102, 4485, 866, 103]`	`[102, 4485, 866, 103]`
`Berlin`	`[102, 1270, 103]`	`[102, 1270, 103]`

o7si · 2026-05-30T08:05:25Z


    // TODO: reduce string copies by using cpts_offs array
-    static std::vector<std::string> preprocess(const std::string & text)  {
+    static std::vector<std::string> preprocess(const std::string & text, bool lowercase)  {


Only lowercase is needed here for now, so I kept this as a single bool.
Should I make this an options struct up front to allow for future BertNormalizer flags like strip_accents, or keep it minimal?

I think options make sense, add a TODO for the other ones.

CISC

Rebase so we can proceed.

CISC · 2026-05-31T11:03:56Z

-        } else if (tokenizer_model == "gpt2" || tokenizer_model == "hybriddna") {
+
+            // BERT lowercases by default (used when the metadata flag is absent, e.g. legacy GGUFs)
+            normalizer_lowercase = true;


Since I flipped default to true we should probably just set it to false for whitespace pre-tokenizer.

Hi @CISC, I've resolved the conflict.

As for the whitespace branch, it's new with no legacy GGUFs and the converter always writes lowercase, so the default is never consulted there and false would be a no-op.

I'd slightly lean towards leaving it out, but I may not have thought this through fully. Happy to add the line if you'd prefer.

The converter only writes it if something sets it, in the case of Whitespace it will only do that if normalizer type is Lowercase.

…wercase * upstream/master: (27 commits) vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-org#18756) ui: fix ETag truncation with MSVC compiler (ggml-org#23917) docs : update ZenDNN docs for Q8 support (ggml-org#23791) llama: only use one iGPU device by default (ggml-org#23897) webui: add custom CSS injection via config (ggml-org#23904) Support `-fa auto` in llama-bench (ggml-org#23714) opencl: support bf16 by converting to f16 (ggml-org#23839) ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (ggml-org#23910) TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (ggml-org#23843) metal : restore im2col implementation for large kernels (ggml-org#23901) test: (test-llama-archs) log the config name first (ggml-org#23885) ci : update ios-xcode release job to macos-26 (ggml-org#23906) ggml : add some lsx support (ggml-org#23798) vulkan: add Flash Attention support for BFloat16 KV cache (ggml-org#23420) ci : fix s390x release job (ggml-org#23898) ci : clear cache instead of "no timestamp" keys + fix macos (ggml-org#23895) llama : do not skip iGPU when only RPC devices are present (ggml-org#23868) server: in SSE mode, send HTTP headers when slot starts (ggml-org#23884) ggml-webgpu: Check earlier for WebGPU required features (ggml-org#23879) ggml-webgpu: add q4_0/q8_0 SET_ROWS (ggml-org#23760) ... # Conflicts: # gguf-py/gguf/vocab.py # src/llama-vocab.cpp

o7si added 2 commits May 30, 2026 03:50

vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer)

0c1c9d3

vocab : add normalizer.lowercase support to WPM

f7a7610

github-actions Bot added the python python script changes label May 30, 2026

o7si commented May 30, 2026

View reviewed changes

o7si mentioned this pull request May 30, 2026

vocab: add tokenizer support for jina-embeddings-v2-base-zh #18756

Merged

o7si marked this pull request as ready for review May 30, 2026 08:10

o7si requested a review from CISC as a code owner May 30, 2026 08:10

CISC reviewed May 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vocab: add normalizer.lowercase support to WPM#23899

vocab: add normalizer.lowercase support to WPM#23899
o7si wants to merge 3 commits into
ggml-org:masterfrom
o7si:wpm-normalizer-lowercase

o7si commented May 30, 2026 •

edited

Loading

Uh oh!

o7si May 30, 2026

Uh oh!

CISC May 30, 2026

Uh oh!

CISC left a comment

Uh oh!

CISC May 31, 2026 •

edited

Loading

Uh oh!

o7si May 31, 2026

Uh oh!

CISC May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

o7si commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

o7si May 30, 2026

Choose a reason for hiding this comment

Uh oh!

CISC May 30, 2026

Choose a reason for hiding this comment

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

CISC May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

o7si May 31, 2026

Choose a reason for hiding this comment

Uh oh!

CISC May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

o7si commented May 30, 2026 •

edited

Loading

CISC May 31, 2026 •

edited

Loading