vocab: add normalizer.lowercase support to WPM#23899
Conversation
|
|
||
| // TODO: reduce string copies by using cpts_offs array | ||
| static std::vector<std::string> preprocess(const std::string & text) { | ||
| static std::vector<std::string> preprocess(const std::string & text, bool lowercase) { |
There was a problem hiding this comment.
Only lowercase is needed here for now, so I kept this as a single bool.
Should I make this an options struct up front to allow for future BertNormalizer flags like strip_accents, or keep it minimal?
There was a problem hiding this comment.
I think options make sense, add a TODO for the other ones.
| } else if (tokenizer_model == "gpt2" || tokenizer_model == "hybriddna") { | ||
|
|
||
| // BERT lowercases by default (used when the metadata flag is absent, e.g. legacy GGUFs) | ||
| normalizer_lowercase = true; |
There was a problem hiding this comment.
Since I flipped default to true we should probably just set it to false for whitespace pre-tokenizer.
There was a problem hiding this comment.
Hi @CISC, I've resolved the conflict.
As for the whitespace branch, it's new with no legacy GGUFs and the converter always writes lowercase, so the default is never consulted there and false would be a no-op.
I'd slightly lean towards leaving it out, but I may not have thought this through fully. Happy to add the line if you'd prefer.
There was a problem hiding this comment.
The converter only writes it if something sets it, in the case of Whitespace it will only do that if normalizer type is Lowercase.
…wercase * upstream/master: (27 commits) vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-org#18756) ui: fix ETag truncation with MSVC compiler (ggml-org#23917) docs : update ZenDNN docs for Q8 support (ggml-org#23791) llama: only use one iGPU device by default (ggml-org#23897) webui: add custom CSS injection via config (ggml-org#23904) Support `-fa auto` in llama-bench (ggml-org#23714) opencl: support bf16 by converting to f16 (ggml-org#23839) ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (ggml-org#23910) TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (ggml-org#23843) metal : restore im2col implementation for large kernels (ggml-org#23901) test: (test-llama-archs) log the config name first (ggml-org#23885) ci : update ios-xcode release job to macos-26 (ggml-org#23906) ggml : add some lsx support (ggml-org#23798) vulkan: add Flash Attention support for BFloat16 KV cache (ggml-org#23420) ci : fix s390x release job (ggml-org#23898) ci : clear cache instead of "no timestamp" keys + fix macos (ggml-org#23895) llama : do not skip iGPU when only RPC devices are present (ggml-org#23868) server: in SSE mode, send HTTP headers when slot starts (ggml-org#23884) ggml-webgpu: Check earlier for WebGPU required features (ggml-org#23879) ggml-webgpu: add q4_0/q8_0 SET_ROWS (ggml-org#23760) ... # Conflicts: # gguf-py/gguf/vocab.py # src/llama-vocab.cpp
Follow-up to #18756: the WPM tokenizer now honors
tokenizer.ggml.normalizer.lowercaseinstead of always lowercasing.While implementing this I noticed
strip_accents(another BertNormalizer option) isn't honored by WPM either, so accented words can still differ.Tests (
German_Semantic_V3):Hallo Welt[102, 4485, 866, 103][102, 4485, 866, 103]Berlin[102, 1270, 103][102, 1270, 103]