feat(llm): make graph extraction split configurable by nw9663644-eng · Pull Request #359 · apache/hugegraph-ai

nw9663644-eng · 2026-06-04T07:23:25Z

Purpose

Closes #343.

This PR makes the graph extraction split type configurable instead of always forcing document.

Design

The graph extraction flow now accepts an optional split_type argument and keeps document as the default to preserve the existing behavior.

Split modes

This PR keeps document as the default graph extraction split mode to preserve the existing behavior.

Supported graph extraction split modes:

document: keeps each uploaded or raw document as one chunk.
paragraph: uses the existing paragraph chunking strategy with chunk_size=500, chunk_overlap=30, and language-aware separators.
sentence: uses punctuation-based sentence-boundary splitting for ., ?, !, 。, ？, ！, ；, and ;.

The selected split mode is passed from the demo UI to extract_graph(), then forwarded to SchedulerSingleton.schedule_flow(..., split_type=split_type), and finally used by GraphExtractFlow.prepare() / build_flow().

The selected split mode is also persisted through the prompt config and restored into the demo dropdown after reload.

Invalid split types fail early with a clear error message listing the supported values.

The existing vertices / edges JSON contract used by “Load into GraphDB” is preserved. chunk_count is logged for debugging instead of being added to the returned JSON.

For PDF compatibility, this PR treats extracted PDF text the same as other text input and includes representative PDF-like extracted text coverage in tests.

Changes

Added configurable graph extraction split type support.
Added document, paragraph, and sentence split mode options in the demo UI.
Forwarded the selected split mode through extract_graph() and the graph extraction flow.
Persisted the selected split mode through the prompt config path.
Added punctuation-based sentence-boundary splitting for sentence mode.
Added tests for default behavior, non-default split modes, invalid split type handling, helper forwarding, prompt config persistence, sentence splitting, long document splitting, and representative PDF-like extracted text.

Tests

uv run ruff format --check .
uv run pytest src/tests/document/test_graph_extract_configurable_split.py
uv run pytest src/tests/document

nw9663644-eng · 2026-06-04T07:51:04Z

I added an additional flow-level test to verify that a non-default graph extraction split type is passed into the workflow input used by the graph extraction flow.

Updated local checks:

uv run ruff format --check .
uv run pytest src/tests/document/test_graph_extract_configurable_split.py
uv run pytest src/tests/document

imbajin

Blocking: yes. Summary: the new split option has sentence-semantics and lint regressions that should be fixed before merge. Evidence: targeted pytest passed, but local chunk-split repro and ruff check exposed the issues.

imbajin · 2026-06-05T04:03:52Z

                    graph_data_btn0 = gr.Button("Clear Graph Data", size="sm")

            vector_import_bt = gr.Button("Import into Vector", variant="primary")
+            graph_split_type = gr.Dropdown(


⚠️ Persist the selected split type

Evidence: the dropdown is wired into extract_graph, but the existing store_prompt() call only saves doc, schema, and example_prompt; reload also only restores those fields, and BasePromptConfig.save_to_yaml() has no split-type field.

Impact: after reload, a user who selected paragraph or sentence silently falls back to document, so the next extraction can run with different chunking than the UI state they expected.

Requested fix: save and reload this split type through the prompt config path, or make the control explicitly transient. A prompt-config round-trip test would cover the regression.

Thanks for the review. I updated the demo prompt config path to persist graph_extract_split_type and reload it into the graph split dropdown. This should prevent the selected paragraph or sentence value from silently falling back to document after reload.

imbajin · 2026-06-05T04:03:53Z


 from hugegraph_llm.flows.common import BaseFlow
 from hugegraph_llm.nodes.document_node.chunk_split import ChunkSplitNode
+from hugegraph_llm.operators.document_op.chunk_split import (


🧹 Sort the import block

Evidence: uv run --project .. --extra llm --extra dev ruff check src/hugegraph_llm/flows/graph_extract.py src/hugegraph_llm/utils/graph_index_utils.py src/hugegraph_llm/operators/document_op/chunk_split.py src/tests/document/test_graph_extract_configurable_split.py fails with I001 Import block is un-sorted or un-formatted on this file.

Impact: the PR will fail the repository lint gate even though the targeted tests pass.

Requested fix: run Ruff import sorting on this file and commit the formatted import order.

Thanks. I ran Ruff import sorting and formatting on the touched files, and the import block has been reordered by ruff check --fix.

imbajin

+1, the current head looks merge-safe to me. Non-blocking: please keep the PR description updated with the short note about the three graph extraction split modes.

feat: make graph extraction split configurable

c009344

dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. enhancement New feature or request labels Jun 4, 2026

github-actions Bot added the llm label Jun 4, 2026

chore: cover graph split flow forwarding

c2890de

imbajin reviewed Jun 5, 2026

View reviewed changes

fix: address graph split review comments

574a637

dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jun 5, 2026

nw9663644-eng requested a review from imbajin June 5, 2026 14:11

imbajin approved these changes Jun 8, 2026

View reviewed changes

dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 8, 2026

imbajin changed the title ~~feat: make graph extraction split configurable~~ feat(llm): make graph extraction split configurable Jun 8, 2026

imbajin merged commit 876d67b into apache:main Jun 8, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm): make graph extraction split configurable#359

feat(llm): make graph extraction split configurable#359
imbajin merged 3 commits into
apache:mainfrom
nw9663644-eng:feat-configurable-graph-split

nw9663644-eng commented Jun 4, 2026 •

edited

Loading

Uh oh!

nw9663644-eng commented Jun 4, 2026

Uh oh!

imbajin left a comment

Uh oh!

Uh oh!

imbajin Jun 5, 2026

Uh oh!

nw9663644-eng Jun 5, 2026

Uh oh!

imbajin Jun 5, 2026

Uh oh!

nw9663644-eng Jun 5, 2026

Uh oh!

imbajin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nw9663644-eng commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Design

Split modes

Changes

Tests

Uh oh!

nw9663644-eng commented Jun 4, 2026

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

imbajin Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

nw9663644-eng Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

imbajin Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

nw9663644-eng Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nw9663644-eng commented Jun 4, 2026 •

edited

Loading