Search before asking
Description
Doris inverted index currently has no Japanese-aware tokenizer. Japanese text has no whitespace word boundaries, so the existing english/unicode/standard parsers either index whole runs of text or split on characters, both of which give poor MATCH / MATCH_PHRASE recall and precision for Japanese content.
OpenSearch and Elasticsearch solve this with the Lucene kuromoji morphological analyzer. Doris already ships a comparable CJK analyzer for Chinese aka the IK analyzer (be/src/storage/index/inverted/analyzer/ik/). but there is no equivalent for Japanese.
This enhancement proposes a built-in kuromoji parser, selectable per inverted-indexed column via DDL, that segments Japanese text into morphemes at index and query time:
INDEX content_idx (`content`) USING INVERTED
PROPERTIES("parser" = "kuromoji", "parser_mode" = "search")
Once indexed, MATCH, MATCH_PHRASE, and TOKENIZE() operate over the segmented Japanese terms.
Motivation
- Enables accurate full-text search over Japanese columns, on par with OpenSearch/Lucene kuromoji.
- Fills the obvious gap next to the existing Chinese (IK) analyzer.
- Implemented natively in C++ with no JVM on the indexing hot path, and Apache-license-clean (engine is Apache-2.0; the IPADIC dictionary is NAIST-2003, the same permissive lexicon Apache Lucene already
bundles).
Solution
Add a native C++ port of the Lucene kuromoji analyzer, following the proven IK pattern (native C++ analyzer + tokenizer, dictionary as runtime data files).
- An offline converter that compiles raw IPADIC into a C++-native runtime format, rather than reimplementing Lucene's FST byte format.
parser_mode support: search (default, with SEARCH-mode decompounding), normal, and extended.
Are you willing to submit PR?
Code of Conduct
Search before asking
Description
Doris inverted index currently has no Japanese-aware tokenizer. Japanese text has no whitespace word boundaries, so the existing english/unicode/standard parsers either index whole runs of text or split on characters, both of which give poor MATCH / MATCH_PHRASE recall and precision for Japanese content.
OpenSearch and Elasticsearch solve this with the Lucene kuromoji morphological analyzer. Doris already ships a comparable CJK analyzer for Chinese aka the IK analyzer
(be/src/storage/index/inverted/analyzer/ik/). but there is no equivalent for Japanese.This enhancement proposes a built-in kuromoji parser, selectable per inverted-indexed column via DDL, that segments Japanese text into morphemes at index and query time:
Once indexed, MATCH, MATCH_PHRASE, and TOKENIZE() operate over the segmented Japanese terms.
Motivation
bundles).
Solution
Add a native C++ port of the Lucene kuromoji analyzer, following the proven IK pattern (native C++ analyzer + tokenizer, dictionary as runtime data files).
parser_modesupport: search (default, with SEARCH-mode decompounding), normal, and extended.Are you willing to submit PR?
Code of Conduct