feat(llm): make graph extraction split configurable#359
Conversation
|
I added an additional flow-level test to verify that a non-default graph extraction split type is passed into the workflow input used by the graph extraction flow. Updated local checks:
|
imbajin
left a comment
There was a problem hiding this comment.
Blocking: yes. Summary: the new split option has sentence-semantics and lint regressions that should be fixed before merge. Evidence: targeted pytest passed, but local chunk-split repro and ruff check exposed the issues.
| graph_data_btn0 = gr.Button("Clear Graph Data", size="sm") | ||
|
|
||
| vector_import_bt = gr.Button("Import into Vector", variant="primary") | ||
| graph_split_type = gr.Dropdown( |
There was a problem hiding this comment.
Evidence: the dropdown is wired into extract_graph, but the existing store_prompt() call only saves doc, schema, and example_prompt; reload also only restores those fields, and BasePromptConfig.save_to_yaml() has no split-type field.
Impact: after reload, a user who selected paragraph or sentence silently falls back to document, so the next extraction can run with different chunking than the UI state they expected.
Requested fix: save and reload this split type through the prompt config path, or make the control explicitly transient. A prompt-config round-trip test would cover the regression.
There was a problem hiding this comment.
Thanks for the review. I updated the demo prompt config path to persist graph_extract_split_type and reload it into the graph split dropdown. This should prevent the selected paragraph or sentence value from silently falling back to document after reload.
|
|
||
| from hugegraph_llm.flows.common import BaseFlow | ||
| from hugegraph_llm.nodes.document_node.chunk_split import ChunkSplitNode | ||
| from hugegraph_llm.operators.document_op.chunk_split import ( |
There was a problem hiding this comment.
🧹 Sort the import block
Evidence: uv run --project .. --extra llm --extra dev ruff check src/hugegraph_llm/flows/graph_extract.py src/hugegraph_llm/utils/graph_index_utils.py src/hugegraph_llm/operators/document_op/chunk_split.py src/tests/document/test_graph_extract_configurable_split.py fails with I001 Import block is un-sorted or un-formatted on this file.
Impact: the PR will fail the repository lint gate even though the targeted tests pass.
Requested fix: run Ruff import sorting on this file and commit the formatted import order.
There was a problem hiding this comment.
Thanks. I ran Ruff import sorting and formatting on the touched files, and the import block has been reordered by ruff check --fix.
imbajin
left a comment
There was a problem hiding this comment.
+1, the current head looks merge-safe to me. Non-blocking: please keep the PR description updated with the short note about the three graph extraction split modes.
Purpose
Closes #343.
This PR makes the graph extraction split type configurable instead of always forcing
document.Design
The graph extraction flow now accepts an optional
split_typeargument and keepsdocumentas the default to preserve the existing behavior.Split modes
This PR keeps
documentas the default graph extraction split mode to preserve the existing behavior.Supported graph extraction split modes:
document: keeps each uploaded or raw document as one chunk.paragraph: uses the existing paragraph chunking strategy withchunk_size=500,chunk_overlap=30, and language-aware separators.sentence: uses punctuation-based sentence-boundary splitting for.,?,!,。,?,!,;, and;.The selected split mode is passed from the demo UI to
extract_graph(), then forwarded toSchedulerSingleton.schedule_flow(..., split_type=split_type), and finally used byGraphExtractFlow.prepare()/build_flow().The selected split mode is also persisted through the prompt config and restored into the demo dropdown after reload.
Invalid split types fail early with a clear error message listing the supported values.
The existing
vertices/edgesJSON contract used by “Load into GraphDB” is preserved.chunk_countis logged for debugging instead of being added to the returned JSON.For PDF compatibility, this PR treats extracted PDF text the same as other text input and includes representative PDF-like extracted text coverage in tests.
Changes
document,paragraph, andsentencesplit mode options in the demo UI.extract_graph()and the graph extraction flow.sentencemode.Tests
uv run ruff format --check .uv run pytest src/tests/document/test_graph_extract_configurable_split.pyuv run pytest src/tests/document