Skip to content

vocab: add normalizer.lowercase support to WPM#23899

Open
o7si wants to merge 3 commits into
ggml-org:masterfrom
o7si:wpm-normalizer-lowercase
Open

vocab: add normalizer.lowercase support to WPM#23899
o7si wants to merge 3 commits into
ggml-org:masterfrom
o7si:wpm-normalizer-lowercase

Conversation

@o7si
Copy link
Copy Markdown
Contributor

@o7si o7si commented May 30, 2026

Follow-up to #18756: the WPM tokenizer now honors tokenizer.ggml.normalizer.lowercase instead of always lowercasing.

While implementing this I noticed strip_accents (another BertNormalizer option) isn't honored by WPM either, so accented words can still differ.

Tests (German_Semantic_V3):

Input transformers llama.cpp
Hallo Welt [102, 4485, 866, 103] [102, 4485, 866, 103]
Berlin [102, 1270, 103] [102, 1270, 103]

@github-actions github-actions Bot added the python python script changes label May 30, 2026
Comment thread src/llama-vocab.cpp

// TODO: reduce string copies by using cpts_offs array
static std::vector<std::string> preprocess(const std::string & text) {
static std::vector<std::string> preprocess(const std::string & text, bool lowercase) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only lowercase is needed here for now, so I kept this as a single bool.
Should I make this an options struct up front to allow for future BertNormalizer flags like strip_accents, or keep it minimal?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think options make sense, add a TODO for the other ones.

@o7si o7si marked this pull request as ready for review May 30, 2026 08:10
@o7si o7si requested a review from CISC as a code owner May 30, 2026 08:10
Copy link
Copy Markdown
Member

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase so we can proceed.

Comment thread src/llama-vocab.cpp Outdated
} else if (tokenizer_model == "gpt2" || tokenizer_model == "hybriddna") {

// BERT lowercases by default (used when the metadata flag is absent, e.g. legacy GGUFs)
normalizer_lowercase = true;
Copy link
Copy Markdown
Member

@CISC CISC May 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I flipped default to true we should probably just set it to false for whitespace pre-tokenizer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @CISC, I've resolved the conflict.

As for the whitespace branch, it's new with no legacy GGUFs and the converter always writes lowercase, so the default is never consulted there and false would be a no-op.

I'd slightly lean towards leaving it out, but I may not have thought this through fully. Happy to add the line if you'd prefer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The converter only writes it if something sets it, in the case of Whitespace it will only do that if normalizer type is Lowercase.

…wercase

* upstream/master: (27 commits)
  vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-org#18756)
  ui: fix ETag truncation with MSVC compiler (ggml-org#23917)
  docs : update ZenDNN docs for Q8 support (ggml-org#23791)
  llama: only use one iGPU device by default (ggml-org#23897)
  webui: add custom CSS injection via config (ggml-org#23904)
  Support `-fa auto` in llama-bench (ggml-org#23714)
  opencl: support bf16 by converting to f16 (ggml-org#23839)
  ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (ggml-org#23910)
  TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (ggml-org#23843)
  metal : restore im2col implementation for large kernels (ggml-org#23901)
  test: (test-llama-archs) log the config name first (ggml-org#23885)
  ci : update ios-xcode release job to macos-26 (ggml-org#23906)
  ggml : add some lsx support (ggml-org#23798)
  vulkan: add Flash Attention support for BFloat16 KV cache (ggml-org#23420)
  ci : fix s390x release job (ggml-org#23898)
  ci : clear cache instead of "no timestamp" keys + fix macos (ggml-org#23895)
  llama : do not skip iGPU when only RPC devices are present (ggml-org#23868)
  server: in SSE mode, send HTTP headers when slot starts (ggml-org#23884)
  ggml-webgpu: Check earlier for WebGPU required features (ggml-org#23879)
  ggml-webgpu: add q4_0/q8_0 SET_ROWS (ggml-org#23760)
  ...

# Conflicts:
#	gguf-py/gguf/vocab.py
#	src/llama-vocab.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants