feat(llm): property embedding#240
Conversation
Change-Id: I31e9532990c2e7e09ddf518ac9c35cdf996a5a25 doc: update repo & url link for inner repo Change-Id: Ieac00904c4ab793de87c1cae7324e3d2d7946884 doc: update repo & url link for inner repo (GraphPlatform-4190) doc: update the link info
…ph/hugegraph-ai into master-icode
sync Change-Id: Id2138dfecfbd377dd39ec51fa9d46b630e9ea88f
Change-Id: I8cb2139be06f63378f87bccd9035357a7cb7a91e
Change-Id: I07414cd8b418afabcb1f8a40454a679f10760bf8
Change-Id: I909b71df67bef316b8cef02010bcae9e91168f4e
Change-Id: I0d4ce8e5a227941ecb3e5972bec35a79e25a6263
Change-Id: I803b36fde6d0b0291dc6fe5cc40c37a946b354f0
Change-Id: I1e3f97afeaa853250966a66c096582339f3c51d6
Change-Id: I6107a5d63c9949ac5e47f93521aa248d0a8ec121
Change-Id: I56329c6aa219fd04198519ee1d1220e395fe82f2
Change-Id: Id93b59ca9a325f79623741d919a18d8f64c12ccd
Change-Id: I81c24fdabd84b92b7cc6e79e19ce3286700806f6
Change-Id: I799844222f194d54990b27d71770543ad6c707ca
There was a problem hiding this comment.
Pull Request Overview
This PR introduces property embedding support and extends existing vector indexing and query functionality. Key changes include:
- Adding property vector indexing and cleaning in the vector and graph index utilities.
- Updating embedding retrieval and query generation to support property keywords and property embeddings.
- Integrating property-related logic into semantic queries, index building, API endpoints, and prompt configuration.
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py | Added property vector index loading and updated index info output. |
| hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py | Incorporated cleaning and update logging for property vector indexes. |
| hugegraph-llm/src/hugegraph_llm/operators/llm_op/gremlin_generate.py | Passed an additional 'properties' parameter to LLM generation. |
| hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py | Integrated exact and fuzzy matching for property keys and values. |
| hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py | Added diffing and update logic for property embeddings with limit checks. |
| hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py | Updated gremlin queries to use property name/value filtering in fallback scenarios. |
| hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/fetch_graph_data.py | Extended graph summary with extracted index label properties. |
| hugegraph-llm/src/hugegraph_llm/operators/gremlin_generate_task.py | Passed property data to the gremlin generate operator. |
| hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py | Registered and imported the new vector API endpoint. |
| hugegraph-llm/src/hugegraph_llm/config/prompt_config.py | Updated prompt content to include property references for query generation. |
| hugegraph-llm/src/hugegraph_llm/api/vector_api.py | Introduced new API endpoint for updating vector embeddings with rate limits. |
Comments suppressed due to low confidence (1)
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py:349
- The variable 'use_id_to_match' is used without a visible definition in this context. Please ensure it is properly declared and initialized before use.
node_str = f"{item['id']}{{{props_str}}}" if use_id_to_match else f"{item['props']}{{{props_str}}}"
| add_propsets.append(propset) | ||
| add_prop_values.append(prop_value) | ||
| if add_prop_values: | ||
| if len(add_prop_values) > 100000: |
There was a problem hiding this comment.
[nitpick] Consider refactoring the hardcoded property update limit (100000) into a configurable parameter to improve maintainability and flexibility.
| for keyword in keywords: | ||
| keyword_vector = self.embedding.get_text_embedding(keyword) | ||
| results = self.vector_index.search(keyword_vector, top_k=self.topk_per_keyword, | ||
| keyword_vector = self.embedding.get_texts_embeddings([keyword]) |
There was a problem hiding this comment.
Ensure that 'get_texts_embeddings' returns a non-empty list before accessing its first element to avoid potential IndexError.
| keyword_vector = self.embedding.get_texts_embeddings([keyword]) | |
| keyword_vector = self.embedding.get_texts_embeddings([keyword]) | |
| if not keyword_vector: # Ensure the list is non-empty | |
| log.warning("No embeddings found for keyword: %s", keyword) | |
| continue |
Change-Id: Ie99469b659abe5efd53d8a52581c4ad91622ef6f
Change-Id: I04b1a71a5eda13c8369cd9c4b99e83f6c1373405
Change-Id: I8d9d5e3a46e16162c45b1b07a05bd187ebe0f4b1
Change-Id: Ibf83629dde9373398dac75b945a7fa19ae029a08
Change-Id: I8923df363da0ddc301b4a3c4833cec478a6c83f9
Change-Id: I84b2fdbfc0745d1556d5d29059ce5f9dfa311352
Change-Id: I79ef4dc7852788823b249a80d7979e5917d2a8c0
…arhugegraph/hugegraph-ai into property_embedding
imbajin
left a comment
There was a problem hiding this comment.
Blocking: yes. Summary: Property embedding and GraphRAG paths have correctness regressions, and the current head fails local lint/format checks. Evidence: current head a0e460c; git diff --check refs/remotes/apache/main...refs/remotes/pr/240 fails, uv run ruff check <changed files> reports F401, and uv run ruff format --check <changed files> would reformat 9 files.
|
|
||
| INDEX_PROPERTY_GREMLIN = """ | ||
| g.V().hasLabel('{label}') | ||
| .limit(100000) |
There was a problem hiding this comment.
Evidence: this query only reads the first 100000 vertices for each indexed label, but _update_property_index() later treats every previously indexed property value missing from this truncated snapshot as deleted and removes it from graph_props. On graphs with more than 100000 matching vertices, incremental rebuilds can silently delete valid property vectors and degrade property recall. Please page through all matching vertices or avoid deleting existing property vectors when the present snapshot is known to be capped.
| add_propsets.append(propset) | ||
| add_prop_values.append(prop_value) | ||
| if add_prop_values: | ||
| if len(add_prop_values) > 100000: |
There was a problem hiding this comment.
Evidence: to_remove is already removed and persisted at lines 184-187 before this > 100000 guard returns early. That leaves the on-disk property index half-updated: deletions are committed, but additions/updates are skipped. Please compute and validate the full mutation set before any remove() / to_index_file() call, or stage the new index and commit it atomically only after the limit check passes.
| query = context["query"] | ||
| query_vector = self.embedding.get_text_embedding(query) | ||
| query_vector = self.embedding.get_texts_embeddings([query]) | ||
| results = self.vector_index.search(query_vector, top_k=self.topk_per_query) |
There was a problem hiding this comment.
Evidence: get_texts_embeddings([query]) returns a list of embeddings, but VectorIndex.search() expects one List[float] and checks len(query_vector) == index.d. With an existing index this path passes a 2-D value and raises a dimension mismatch, breaking SemanticIdQuery(by="query"). Please use the first embedding, as the keyword/property branches already do, and add a regression test for the whole-query path.
| results = self.vector_index.search(query_vector, top_k=self.topk_per_query) | |
| results = self.vector_index.search(query_vector[0], top_k=self.topk_per_query) |
| self._max_items = context["max_items"] | ||
| if isinstance(context.get("prop_to_match"), str): | ||
| self._prop_to_match = context["prop_to_match"] | ||
| if isinstance(context.get("match_props"), list): |
There was a problem hiding this comment.
Evidence: whenever match_props is a list, this sets _prop_to_match, so use_id_to_match becomes false even if match_vids also contains exact vertex hits. A query that matches both a vertex id and an indexed property now skips the more precise VID-neighbor query and is forced into the property fallback path. Please prefer VID matching when matched_vids is non-empty, and use the property path only when no VID recall is available.
| graph_chain_knowledge, vertex_degree_list, knowledge_with_degree = self._format_graph_query_result( | ||
| query_paths=paths | ||
| ) | ||
| paths: List[Any] = self._client.gremlin().exec(gremlin=gremlin_query)["data"] |
There was a problem hiding this comment.
Evidence: this traversal starts from vertices with current_prop_name, but then walks arbitrary neighbors through bothE().otherV(). _process_vertex() and _process_edge() later read item["props"][self._prop_to_match]; if a neighbor label does not have the same property key, a normal heterogeneous graph raises KeyError instead of returning partial graph context. Please format property fallback paths using ids or safe .get() values for non-matching neighbors.
| def graph_config_api(req: GraphConfigRequest): | ||
| # Accept status code | ||
| res = apply_graph_conf(req.url, req.name, req.user, req.pwd, req.gs, origin_call="http") | ||
| res = apply_graph_conf(req.url, req.graph, req.user, req.pwd, req.gs, origin_call="http") |
There was a problem hiding this comment.
Evidence: GraphConfigRequest now accepts token, and per-request client_config copies it into huge_settings, but /config/graph still calls apply_graph_conf() without req.token. Token-only users can submit the field successfully, yet the saved graph configuration still validates and persists only basic-auth credentials. Please pass and persist token here, and make the connectivity check use bearer auth when a token is provided.
| def vector_http_api(router: APIRouter, update_embedding_func): | ||
| @router.post("/vector/embedding", status_code=status.HTTP_200_OK) | ||
| def update_embedding_api( | ||
| daily_limit: int = 50, |
There was a problem hiding this comment.
Evidence: the endpoint compares call_count >= daily_limit, but daily_limit is a request parameter with a permissive default. Any authenticated caller can bypass the protection by calling /vector/embedding?daily_limit=999999, so the expensive embedding rebuild is not actually rate-limited. Please move the limit to server-side config or clamp user input to a configured maximum.
| name: str = Query('hugegraph', description="hugegraph client name.") | ||
| user: str = Query('', description="hugegraph client user.") | ||
| pwd: str = Query('', description="hugegraph client pwd.") | ||
| graph: str = Query('hugegraph', description="hugegraph client name.") |
There was a problem hiding this comment.
name field compatible
Evidence: the public graph config model changed name to graph and gives graph a default of "hugegraph". Existing clients that still send {"name": "custom_graph"} will not get a validation error; the value is silently ignored and requests fall back to the default graph. Please add an alias/backward-compatible field mapping, or make old payloads fail loudly instead of targeting the wrong graph.
|
|
||
| from hugegraph_llm.config import prompt | ||
|
|
||
| from hugegraph_llm.config import huge_settings |
There was a problem hiding this comment.
Evidence: uv run ruff check on the changed Python files fails with F401 huge_settings imported but unused on this line. Since the PR currently only has the triage check on GitHub, this would be caught by the repository's normal Python quality gate later. Please remove the import and run uv run ruff check plus uv run ruff format --check for the touched Python files.
| @@ -0,0 +1,242 @@ | |||
| # Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
Evidence: this added test.py lives inside the production package and duplicates operator code instead of adding tests under hugegraph-llm/src/tests/..., which is where this module's CI-oriented tests live. The new property-embedding, token config, property fallback, and vector API behavior therefore has no runnable regression coverage. Please move this into focused tests under src/tests/api/ and src/tests/operators/index_op/ rather than shipping it as production package code.
VGalaxies
left a comment
There was a problem hiding this comment.
Review summary
- Blocking: yes
- Summary: The remaining non-duplicate issues are API/runtime regressions and credential/query-safety problems introduced by the PR.
- Evidence:
- static review of
git diff origin/main...HEAD, withrg,nl -ba, and targeted diff inspection
- static review of
| pwd: str, | ||
| user: Optional[str] = None, | ||
| pwd: Optional[str] = None, | ||
| token: Optional[str] = None, |
There was a problem hiding this comment.
High: Preserve positional graphspace compatibility
hugegraph-python-client/src/pyhugegraph/client.py:56
Evidence
tokenwas inserted beforegraphspaceinPyHugeClient.__init__, while the previous fifth positional argument wasgraphspace; an existing call still usesPyHugeClient(url, graph, user, pwd, gs)athugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py:262.
Impact
- Existing positional callers silently bind graphspace as a bearer token, disabling graphspace routing and changing auth behavior.
Requested fix
- Keep
graphspacein its old positional slot and maketokenkeyword-only, or appendtokenafter the existing positional parameters.
| "template_execution_result", | ||
| "raw_execution_result", | ||
| ] | ||
| original_results = gremlin_generate(inp, example_num, schema_input, gremlin_prompt_input) |
There was a problem hiding this comment.
High: Do not execute Gremlin before filtering requested outputs
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/text2gremlin_block.py:237
Evidence
gremlin_generate_selective()callsgremlin_generate()unconditionally before checkingrequested_outputs;gremlin_generate()executes both generated queries throughrun_gremlin_query()at lines 98 and 102, even when the request only asks fortemplate_gremlin.
Impact
/text2gremlincan execute LLM-generated Gremlin against HugeGraph when the caller requested generation output only.
Requested fix
- Split generation from execution and call
run_gremlin_query()only whentemplate_execution_resultorraw_execution_resultis explicitly requested.
| unmatched_keywords = set(keywords) | ||
| for key in property_keys: | ||
| for keyword in list(unmatched_keywords): | ||
| gremlin_query = f"g.V().has('{key.name}', '{keyword}').limit(1)" |
There was a problem hiding this comment.
High: Bind or escape property Gremlin values
hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py:96
Evidence
_exact_match_properties()interpolateskey.nameand user-derivedkeyworddirectly intog.V().has('{key.name}', '{keyword}'); the property fallback also formatsprop_nameandprop_valueinto Gremlin atgraph_rag_query.py:219.
Impact
- A keyword or indexed property value containing quotes or Gremlin syntax can alter the query sent to HugeGraph.
Requested fix
- Use Gremlin bindings or a shared safe serialization/escaping helper for property keys and values before execution.
| """ | ||
| Resolve the path to a full URL. | ||
| """ | ||
| url = f"http://{self._cfg.ip}:{self._cfg.port}/" |
There was a problem hiding this comment.
Medium: Avoid sending Vermeer tokens only over plaintext HTTP
vermeer-python-client/src/pyvermeer/utils/vermeer_requests.py:72
Evidence
VermeerSessionrequires a token and sets it in theAuthorizationheader at line 42, butresolve()always buildshttp://{ip}:{port}/and the client exposes onlyip,port, andtoken.
Impact
- Bearer tokens are exposed on the network for non-local Vermeer deployments.
Requested fix
- Accept a full base URL or scheme option, support HTTPS, and avoid defaulting credentialed requests to plaintext transport.
link to #228, V1 version
update, get and delete property embeddings
update_vid_embedding(embedding will only be performed for indexed properties).clean_all_graph_indexkeywords match
Match properties according to keywords (exact + fuzzy)
text2gremlin
Add properties in the prompt
subgraph_query
Give priority to executing the vid subgraph query. If it fails and properties have been matched based on keywords previously, then execute the property subgraph query.