Skip to content

feat(llm): make graph extraction split configurable#359

Merged
imbajin merged 3 commits into
apache:mainfrom
nw9663644-eng:feat-configurable-graph-split
Jun 8, 2026
Merged

feat(llm): make graph extraction split configurable#359
imbajin merged 3 commits into
apache:mainfrom
nw9663644-eng:feat-configurable-graph-split

Conversation

@nw9663644-eng

@nw9663644-eng nw9663644-eng commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Purpose

Closes #343.

This PR makes the graph extraction split type configurable instead of always forcing document.

Design

The graph extraction flow now accepts an optional split_type argument and keeps document as the default to preserve the existing behavior.

Split modes

This PR keeps document as the default graph extraction split mode to preserve the existing behavior.

Supported graph extraction split modes:

  • document: keeps each uploaded or raw document as one chunk.
  • paragraph: uses the existing paragraph chunking strategy with chunk_size=500, chunk_overlap=30, and language-aware separators.
  • sentence: uses punctuation-based sentence-boundary splitting for ., ?, !, , , , , and ;.

The selected split mode is passed from the demo UI to extract_graph(), then forwarded to SchedulerSingleton.schedule_flow(..., split_type=split_type), and finally used by GraphExtractFlow.prepare() / build_flow().

The selected split mode is also persisted through the prompt config and restored into the demo dropdown after reload.

Invalid split types fail early with a clear error message listing the supported values.

The existing vertices / edges JSON contract used by “Load into GraphDB” is preserved. chunk_count is logged for debugging instead of being added to the returned JSON.

For PDF compatibility, this PR treats extracted PDF text the same as other text input and includes representative PDF-like extracted text coverage in tests.

Changes

  • Added configurable graph extraction split type support.
  • Added document, paragraph, and sentence split mode options in the demo UI.
  • Forwarded the selected split mode through extract_graph() and the graph extraction flow.
  • Persisted the selected split mode through the prompt config path.
  • Added punctuation-based sentence-boundary splitting for sentence mode.
  • Added tests for default behavior, non-default split modes, invalid split type handling, helper forwarding, prompt config persistence, sentence splitting, long document splitting, and representative PDF-like extracted text.

Tests

  • uv run ruff format --check .
  • uv run pytest src/tests/document/test_graph_extract_configurable_split.py
  • uv run pytest src/tests/document

@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. enhancement New feature or request labels Jun 4, 2026
@github-actions github-actions Bot added the llm label Jun 4, 2026
@nw9663644-eng

Copy link
Copy Markdown
Contributor Author

I added an additional flow-level test to verify that a non-default graph extraction split type is passed into the workflow input used by the graph extraction flow.

Updated local checks:

  • uv run ruff format --check .
  • uv run pytest src/tests/document/test_graph_extract_configurable_split.py
  • uv run pytest src/tests/document

@imbajin imbajin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: yes. Summary: the new split option has sentence-semantics and lint regressions that should be fixed before merge. Evidence: targeted pytest passed, but local chunk-split repro and ruff check exposed the issues.

Comment thread hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py Outdated
graph_data_btn0 = gr.Button("Clear Graph Data", size="sm")

vector_import_bt = gr.Button("Import into Vector", variant="primary")
graph_split_type = gr.Dropdown(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Persist the selected split type

Evidence: the dropdown is wired into extract_graph, but the existing store_prompt() call only saves doc, schema, and example_prompt; reload also only restores those fields, and BasePromptConfig.save_to_yaml() has no split-type field.

Impact: after reload, a user who selected paragraph or sentence silently falls back to document, so the next extraction can run with different chunking than the UI state they expected.

Requested fix: save and reload this split type through the prompt config path, or make the control explicitly transient. A prompt-config round-trip test would cover the regression.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I updated the demo prompt config path to persist graph_extract_split_type and reload it into the graph split dropdown. This should prevent the selected paragraph or sentence value from silently falling back to document after reload.


from hugegraph_llm.flows.common import BaseFlow
from hugegraph_llm.nodes.document_node.chunk_split import ChunkSplitNode
from hugegraph_llm.operators.document_op.chunk_split import (

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Sort the import block

Evidence: uv run --project .. --extra llm --extra dev ruff check src/hugegraph_llm/flows/graph_extract.py src/hugegraph_llm/utils/graph_index_utils.py src/hugegraph_llm/operators/document_op/chunk_split.py src/tests/document/test_graph_extract_configurable_split.py fails with I001 Import block is un-sorted or un-formatted on this file.

Impact: the PR will fail the repository lint gate even though the targeted tests pass.

Requested fix: run Ruff import sorting on this file and commit the formatted import order.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I ran Ruff import sorting and formatting on the touched files, and the import block has been reordered by ruff check --fix.

@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jun 5, 2026
@nw9663644-eng nw9663644-eng requested a review from imbajin June 5, 2026 14:11

@imbajin imbajin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, the current head looks merge-safe to me. Non-blocking: please keep the PR description updated with the short note about the three graph extraction split modes.

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 8, 2026
@imbajin imbajin changed the title feat: make graph extraction split configurable feat(llm): make graph extraction split configurable Jun 8, 2026
@imbajin imbajin merged commit 876d67b into apache:main Jun 8, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request lgtm This PR has been approved by a maintainer llm size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Make graph extraction use configurable chunk splitting

3 participants