Skip to content

feat(llm): expand graph extraction service APIs#361

Open
LRriver wants to merge 15 commits into
apache:mainfrom
LRriver:extract_api
Open

feat(llm): expand graph extraction service APIs#361
LRriver wants to merge 15 commits into
apache:mainfrom
LRriver:extract_api

Conversation

@LRriver

@LRriver LRriver commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Extend the graph extraction API introduced by feat(llm): add /graph/extract API for programmatic graph extraction #351 with service-backed synchronous extraction, async jobs, graph import, and extract-and-import endpoints.
  • Add content_type/content support for raw text and pre-split chunks, request-bounded chunk parallelism, metadata, and structured error responses.
  • Harden schema validation, request-local HugeGraph client configuration, import result reporting, LLM config compatibility, and route registration.

Relation to #351

This builds on the initial synchronous POST /graph/extract endpoint from #351. The deprecated texts alias remains accepted: a string maps to content_type=text, and a list maps to pre-split chunks. Multi-document extraction remains caller-managed through multiple API requests instead of hidden batch semantics in the synchronous endpoint.

Write API Safety

  • /graph/import and /graph/extract-and-import require write_to_graph=true.
  • Inline schema writes require client_config.graph so the target graph is explicit in the request and response metadata.
  • Property-graph import payloads are validated at the request boundary before reaching HugeGraph.

Job Endpoint Notes

  • /graph/extract/jobs uses an in-memory, process-local job store.
  • Jobs and results are lost on service restart and are not shared across multiple API worker processes.
  • Cancellation only applies before a queued job starts; it cannot interrupt an active LLM call.

Tests

  • uv run ruff format --check .
  • uv run ruff check .
  • uv run pytest hugegraph-llm/src/tests/api/test_graph_extract_api.py hugegraph-llm/src/tests/api/test_graph_import_api.py hugegraph-llm/src/tests/api/test_graph_extract_jobs.py -q
  • SKIP_EXTERNAL_SERVICES=true uv run pytest hugegraph-llm/src/tests/api -v --tb=short
  • SKIP_EXTERNAL_SERVICES=true uv run pytest hugegraph-llm/src/tests/config/ hugegraph-llm/src/tests/document/ hugegraph-llm/src/tests/operators/ hugegraph-llm/src/tests/models/ hugegraph-llm/src/tests/indices/ hugegraph-llm/src/tests/test_utils.py -v --tb=short

Review

Copilot AI review requested due to automatic review settings June 9, 2026 08:04
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels Jun 9, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces a new graph extraction/import service layer and expands the FastAPI surface area with job-based extraction and import endpoints, while also making extraction behavior more configurable (chunk handling, split types, and parallel chunk processing) and improving robustness around malformed LLM output and import result reporting.

Changes:

  • Added GraphExtractService/GraphImportService plus request/response model updates, including redaction and schema normalization.
  • Added async-style job endpoints (/graph/extract/jobs/*) with an in-memory job store, plus new /graph/import and /graph/extract-and-import routes.
  • Updated extraction and import flows/operators to support configurable split types, pre-split chunks, parallel chunk extraction, and structured import stats.

Reviewed changes

Copilot reviewed 42 out of 42 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
hugegraph-llm/src/tests/utils/test_graph_index_utils.py Adds regression coverage for extract_graph() scheduler call shape.
hugegraph-llm/src/tests/operators/llm_op/test_property_graph_extract.py Adds coverage for parent edgelabel schemas, parallel chunk ordering, serial fallback, and malformed JSON handling.
hugegraph-llm/src/tests/operators/llm_op/test_info_extract.py Adds coverage for regex extraction with schema “shape” normalization.
hugegraph-llm/src/tests/operators/hugegraph_op/test_commit_to_hugegraph_load_into_graph.py Updates expectations to “continue + report import_result” rather than raise on create failures.
hugegraph-llm/src/tests/operators/hugegraph_op/test_commit_to_hugegraph.py Adds broader import behavior coverage including counts, id mapping, and normalized extraction inputs.
hugegraph-llm/src/tests/operators/document_op/test_chunk_split.py Adds paragraph boundary behavior tests for short paragraphs.
hugegraph-llm/src/tests/nodes/test_request_graph_config.py Tests request-scoped graph config propagation into nodes/operators.
hugegraph-llm/src/tests/nodes/test_extract_node.py Ensures ExtractNode uses extract-LLM config and wires max-parallel-chunks.
hugegraph-llm/src/tests/nodes/test_base_node.py Adds coverage that unexpected operator exceptions become error statuses.
hugegraph-llm/src/tests/models/llms/test_init_llm.py Adds coverage for extract-LLM config fallback behavior across providers.
hugegraph-llm/src/tests/flows/test_graph_extract_flow.py Tests split/content-type defaults and state reset semantics.
hugegraph-llm/src/tests/document/test_graph_extract_configurable_split.py Extends flow post-deal expectations to include max_parallel_chunks in output.
hugegraph-llm/src/tests/api/test_graph_import_api.py Adds coverage for import + extract-and-import endpoints, client_config behavior, and embedding updates.
hugegraph-llm/src/tests/api/test_graph_extract_jobs.py Adds end-to-end coverage for job creation, execution, cancellation, expiry, and concurrency.
hugegraph-llm/src/tests/api/test_graph_extract_api.py Refactors API tests around service layer, adds concurrency isolation and structured error expectations.
hugegraph-llm/src/hugegraph_llm/utils/hugegraph_utils.py Adds request-scoped graph_config support to client creation.
hugegraph-llm/src/hugegraph_llm/state/ai_state.py Extends workflow input/state with content_type, max_parallel_chunks, and graph_config.
hugegraph-llm/src/hugegraph_llm/services/graph_extract_service.py Introduces extraction/import services, schema normalization, redaction, and flow JSON validation.
hugegraph-llm/src/hugegraph_llm/services/graph_extract_jobs.py Adds an in-memory job store with TTL, queueing, worker threads, and status transitions.
hugegraph-llm/src/hugegraph_llm/services/init.py Adds services package marker.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py Adds parallel chunk extraction and explicit malformed-JSON failure path.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py Adds schema shape normalization helper for regex extraction.
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py Makes semantic index building respect request-scoped graph config.
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py Adds graph_config support while retaining full “connection unit” behavior.
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/commit_to_hugegraph.py Adds request-scoped graph config and structured import_result counts/errors; continues on create failures.
hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py Adds paragraph-boundary splitting that preserves explicit paragraph breaks.
hugegraph-llm/src/hugegraph_llm/nodes/llm_node/extract_info.py Switches to extract-LLM configuration and wires max-parallel-chunks into operator.
hugegraph-llm/src/hugegraph_llm/nodes/index_node/build_semantic_index.py Passes request-scoped graph config into semantic index operator.
hugegraph-llm/src/hugegraph_llm/nodes/hugegraph_node/schema.py Plumbs request graph_config into SchemaManager when not using full connection dict.
hugegraph-llm/src/hugegraph_llm/nodes/hugegraph_node/fetch_graph_data.py Uses request-scoped graph_config for HugeGraph client creation.
hugegraph-llm/src/hugegraph_llm/nodes/hugegraph_node/commit_to_hugegraph.py Creates Commit2Graph with request-scoped graph_config.
hugegraph-llm/src/hugegraph_llm/nodes/document_node/chunk_split.py Skips splitting when content is already chunks; sets context["chunks"] directly.
hugegraph-llm/src/hugegraph_llm/nodes/base_node.py Broadens exception handling to convert unexpected operator exceptions into error statuses.
hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py Adds extract-LLM fallback rules (e.g., reuse chat config when extract config missing).
hugegraph-llm/src/hugegraph_llm/flows/update_vid_embeddings.py Adds graph_config plumbing into the flow input.
hugegraph-llm/src/hugegraph_llm/flows/import_graph_data.py Adds graph_config plumbing into the import flow input.
hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py Adds content_type/max_parallel_chunks parameters and enriches post-deal output.
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py Adjusts route registration order with new graph endpoints.
hugegraph-llm/src/hugegraph_llm/config/llm_config.py Adds global defaults/limits for graph-extract parallel chunk calls.
hugegraph-llm/src/hugegraph_llm/api/models/graph_extract_responses.py Adds typed error/job/import response models.
hugegraph-llm/src/hugegraph_llm/api/models/graph_extract_requests.py Redesigns request contract around content_type/content, adds parallelism validation and import request models.
hugegraph-llm/src/hugegraph_llm/api/graph_extract_api.py Adds job endpoints, import endpoints, structured error semantics, and request validation wrapping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +101 to +114
try:
max_parallel_chunks = max(1, int(context.get("max_parallel_chunks") or self.max_parallel_chunks))
except (TypeError, ValueError):
max_parallel_chunks = max(1, self.max_parallel_chunks)
chunk_count = len(chunks)
worker_count = min(max_parallel_chunks, chunk_count)
context["max_parallel_chunks"] = worker_count
if worker_count <= 1:
proceeded_chunks = [self.extract_property_graph_by_llm(schema, chunk) for chunk in chunks]
else:
with ThreadPoolExecutor(max_workers=worker_count) as executor:
proceeded_chunks = list(
executor.map(lambda chunk: self.extract_property_graph_by_llm(schema, chunk), chunks)
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7850be7. Empty chunks now return without LLM calls and keep max_parallel_chunks metadata positive, with regression coverage.

Comment on lines +111 to +114
with ThreadPoolExecutor(max_workers=worker_count) as executor:
proceeded_chunks = list(
executor.map(lambda chunk: self.extract_property_graph_by_llm(schema, chunk), chunks)
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not changed in this pass. The current LLM wrappers used here do not maintain per-request mutable response buffers in PropertyGraphExtract; serializing extract_property_graph_by_llm behind one lock would effectively disable the chunk-level parallelism this API is adding. The API also bounds per-request parallelism by config and request. If a future provider wrapper proves non-thread-safe, the better fix would be provider-local isolation rather than a global lock in the extraction operator.

Comment on lines +220 to +221
@router.post("/graph/import", status_code=status.HTTP_200_OK)
def graph_import_api(req: GraphImportRequest) -> GraphImportResponse:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7850be7. Added response_model declarations for job, import, and extract-and-import endpoints, with route registration coverage.

Comment on lines +245 to +246
@router.post("/graph/extract-and-import", status_code=status.HTTP_200_OK)
def graph_extract_and_import_api(req: GraphExtractAndImportRequest) -> GraphExtractAndImportResponse:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7850be7. Added response_model declarations for job, import, and extract-and-import endpoints, with route registration coverage.

Comment on lines 60 to 62
if not vertices and not edges and not triples:
log.critical("(Loading) Both vertices and edges are empty. Please check the input data again.")
raise ValueError("Both vertices and edges input are empty.")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7850be7. Empty input messages now mention vertices, edges, and triples. Schema-free mode rejects vertices or edges, and schema mode rejects triples so mixed inputs are not silently dropped. Added regression coverage.

Comment on lines +69 to +71
if not vertices and not edges:
log.critical("(Loading) Both vertices and edges are empty. Please check the input data again.")
raise ValueError("Both vertices and edges input are empty.")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7850be7. Empty input messages now mention vertices, edges, and triples. Schema-free mode rejects vertices or edges, and schema mode rejects triples so mixed inputs are not silently dropped. Added regression coverage.

Comment on lines +75 to +85
except RequestValidationError as exc:
return JSONResponse(
status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
content={
"detail": _error(
"GRAPH_EXTRACT_VALIDATION_ERROR",
str(exc),
"request",
)
},
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7850be7. Validation responses now use sanitized loc, msg, and type summaries from exc.errors() and omit raw input values. Added coverage for password and URL not being echoed.

@github-actions github-actions Bot added the llm label Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request llm size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants