Skip to content

[Enhancement] Support Japanese (Kuromoji) morphological analyzer for inverted index #64646

Description

@nishant94

Search before asking

  • I had searched in the issues and found no similar issues.

Description

Doris inverted index currently has no Japanese-aware tokenizer. Japanese text has no whitespace word boundaries, so the existing english/unicode/standard parsers either index whole runs of text or split on characters, both of which give poor MATCH / MATCH_PHRASE recall and precision for Japanese content.

OpenSearch and Elasticsearch solve this with the Lucene kuromoji morphological analyzer. Doris already ships a comparable CJK analyzer for Chinese aka the IK analyzer (be/src/storage/index/inverted/analyzer/ik/). but there is no equivalent for Japanese.

This enhancement proposes a built-in kuromoji parser, selectable per inverted-indexed column via DDL, that segments Japanese text into morphemes at index and query time:

  INDEX content_idx (`content`) USING INVERTED
  PROPERTIES("parser" = "kuromoji", "parser_mode" = "search")

Once indexed, MATCH, MATCH_PHRASE, and TOKENIZE() operate over the segmented Japanese terms.

Motivation

  • Enables accurate full-text search over Japanese columns, on par with OpenSearch/Lucene kuromoji.
  • Fills the obvious gap next to the existing Chinese (IK) analyzer.
  • Implemented natively in C++ with no JVM on the indexing hot path, and Apache-license-clean (engine is Apache-2.0; the IPADIC dictionary is NAIST-2003, the same permissive lexicon Apache Lucene already
    bundles).

Solution

Add a native C++ port of the Lucene kuromoji analyzer, following the proven IK pattern (native C++ analyzer + tokenizer, dictionary as runtime data files).

  • An offline converter that compiles raw IPADIC into a C++-native runtime format, rather than reimplementing Lucene's FST byte format.
  • parser_mode support: search (default, with SEARCH-mode decompounding), normal, and extended.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions