[Enhancement] Support Japanese (Kuromoji) morphological analyzer for inverted index

### Search before asking

- [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues.


### Description

Doris inverted index currently has no Japanese-aware tokenizer. Japanese text has no whitespace word boundaries, so the existing english/unicode/standard parsers either index whole runs of text or split on characters, both of which give poor MATCH / MATCH_PHRASE recall and precision for Japanese content.

OpenSearch and Elasticsearch solve this with the Lucene kuromoji morphological analyzer. Doris already ships a comparable CJK analyzer for Chinese aka the IK analyzer `(be/src/storage/index/inverted/analyzer/ik/)`. but there is no equivalent for Japanese.

This enhancement proposes a built-in kuromoji parser, selectable per inverted-indexed column via DDL, that segments Japanese text into morphemes at index and query time:

```
  INDEX content_idx (`content`) USING INVERTED
  PROPERTIES("parser" = "kuromoji", "parser_mode" = "search")
```
 
Once indexed, MATCH, MATCH_PHRASE, and TOKENIZE() operate over the segmented Japanese terms.

### Motivation
  - Enables accurate full-text search over Japanese columns, on par with OpenSearch/Lucene kuromoji.
  - Fills the obvious gap next to the existing Chinese (IK) analyzer.
  - Implemented natively in C++ with no JVM on the indexing hot path, and Apache-license-clean (engine is Apache-2.0; the IPADIC dictionary is NAIST-2003, the same permissive lexicon Apache Lucene already
  bundles).

### Solution

Add a native C++ port of the Lucene kuromoji analyzer, following the proven IK pattern (native C++ analyzer + tokenizer, dictionary as runtime data files).

- An offline converter that compiles raw IPADIC into a C++-native runtime format, rather than reimplementing Lucene's FST byte format.
- `parser_mode` support: search (default, with SEARCH-mode decompounding), normal, and extended.

### Are you willing to submit PR?

- [x] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Support Japanese (Kuromoji) morphological analyzer for inverted index #64646

Search before asking

Description

Motivation

Solution

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Enhancement] Support Japanese (Kuromoji) morphological analyzer for inverted index #64646

Description

Search before asking

Description

Motivation

Solution

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions