Skip to content

feat(llm): property embedding#240

Open
MrJs133 wants to merge 65 commits into
apache:mainfrom
MrJs133:property_embedding
Open

feat(llm): property embedding#240
MrJs133 wants to merge 65 commits into
apache:mainfrom
MrJs133:property_embedding

Conversation

@MrJs133

@MrJs133 MrJs133 commented May 19, 2025

Copy link
Copy Markdown
Contributor

link to #228, V1 version

update, get and delete property embeddings

  • Added property embedding updates to update_vid_embedding (embedding will only be performed for indexed properties).
    • Adopt incremental updates (support adding and deleting prop_value, as well as adding and deleting prop_key). If the number of properties to be updated exceeds 100,000, a warning will be issued, and the update will be rejected.
  • Added property embedding removes to clean_all_graph_index

keywords match

Match properties according to keywords (exact + fuzzy)

text2gremlin

Add properties in the prompt

subgraph_query

Give priority to executing the vid subgraph query. If it fails and properties have been matched based on keywords previously, then execute the property subgraph query.

Jin and others added 25 commits January 2, 2025 19:24
Change-Id: I31e9532990c2e7e09ddf518ac9c35cdf996a5a25

doc: update repo & url link for inner repo

Change-Id: Ieac00904c4ab793de87c1cae7324e3d2d7946884

doc: update repo & url link for inner repo (GraphPlatform-4190)

doc: update the link info
sync

Change-Id: Id2138dfecfbd377dd39ec51fa9d46b630e9ea88f
Change-Id: I8cb2139be06f63378f87bccd9035357a7cb7a91e
Change-Id: I07414cd8b418afabcb1f8a40454a679f10760bf8
Change-Id: I909b71df67bef316b8cef02010bcae9e91168f4e
Change-Id: I0d4ce8e5a227941ecb3e5972bec35a79e25a6263
Change-Id: I803b36fde6d0b0291dc6fe5cc40c37a946b354f0
Change-Id: I1e3f97afeaa853250966a66c096582339f3c51d6
Change-Id: I6107a5d63c9949ac5e47f93521aa248d0a8ec121
Change-Id: I56329c6aa219fd04198519ee1d1220e395fe82f2
Change-Id: Id93b59ca9a325f79623741d919a18d8f64c12ccd
Change-Id: I81c24fdabd84b92b7cc6e79e19ce3286700806f6
Change-Id: I799844222f194d54990b27d71770543ad6c707ca
@dosubot dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label May 19, 2025
@github-actions github-actions Bot added the llm label May 19, 2025
@dosubot dosubot Bot added the enhancement New feature or request label May 19, 2025
@dosubot dosubot Bot removed the size:M This PR changes 30-99 lines, ignoring generated files. label May 20, 2025
@imbajin imbajin requested a review from Copilot May 26, 2025 11:15

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces property embedding support and extends existing vector indexing and query functionality. Key changes include:

  • Adding property vector indexing and cleaning in the vector and graph index utilities.
  • Updating embedding retrieval and query generation to support property keywords and property embeddings.
  • Integrating property-related logic into semantic queries, index building, API endpoints, and prompt configuration.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py Added property vector index loading and updated index info output.
hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py Incorporated cleaning and update logging for property vector indexes.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/gremlin_generate.py Passed an additional 'properties' parameter to LLM generation.
hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py Integrated exact and fuzzy matching for property keys and values.
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py Added diffing and update logic for property embeddings with limit checks.
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py Updated gremlin queries to use property name/value filtering in fallback scenarios.
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/fetch_graph_data.py Extended graph summary with extracted index label properties.
hugegraph-llm/src/hugegraph_llm/operators/gremlin_generate_task.py Passed property data to the gremlin generate operator.
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py Registered and imported the new vector API endpoint.
hugegraph-llm/src/hugegraph_llm/config/prompt_config.py Updated prompt content to include property references for query generation.
hugegraph-llm/src/hugegraph_llm/api/vector_api.py Introduced new API endpoint for updating vector embeddings with rate limits.
Comments suppressed due to low confidence (1)

hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py:349

  • The variable 'use_id_to_match' is used without a visible definition in this context. Please ensure it is properly declared and initialized before use.
node_str = f"{item['id']}{{{props_str}}}" if use_id_to_match else f"{item['props']}{{{props_str}}}"

add_propsets.append(propset)
add_prop_values.append(prop_value)
if add_prop_values:
if len(add_prop_values) > 100000:

Copilot AI May 26, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider refactoring the hardcoded property update limit (100000) into a configurable parameter to improve maintainability and flexibility.

Copilot uses AI. Check for mistakes.
for keyword in keywords:
keyword_vector = self.embedding.get_text_embedding(keyword)
results = self.vector_index.search(keyword_vector, top_k=self.topk_per_keyword,
keyword_vector = self.embedding.get_texts_embeddings([keyword])

Copilot AI May 26, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure that 'get_texts_embeddings' returns a non-empty list before accessing its first element to avoid potential IndexError.

Suggested change
keyword_vector = self.embedding.get_texts_embeddings([keyword])
keyword_vector = self.embedding.get_texts_embeddings([keyword])
if not keyword_vector: # Ensure the list is non-empty
log.warning("No embeddings found for keyword: %s", keyword)
continue

Copilot uses AI. Check for mistakes.
imbajin and others added 12 commits May 26, 2025 19:46
Change-Id: Ie99469b659abe5efd53d8a52581c4ad91622ef6f
Change-Id: I04b1a71a5eda13c8369cd9c4b99e83f6c1373405
Change-Id: I8d9d5e3a46e16162c45b1b07a05bd187ebe0f4b1
Change-Id: Ibf83629dde9373398dac75b945a7fa19ae029a08
Change-Id: Idb8c1476ebda521e195022f14f6733a33288f552
Change-Id: I8923df363da0ddc301b4a3c4833cec478a6c83f9
Change-Id: I84b2fdbfc0745d1556d5d29059ce5f9dfa311352
Change-Id: I79ef4dc7852788823b249a80d7979e5917d2a8c0
@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels May 28, 2025

@imbajin imbajin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: yes. Summary: Property embedding and GraphRAG paths have correctness regressions, and the current head fails local lint/format checks. Evidence: current head a0e460c; git diff --check refs/remotes/apache/main...refs/remotes/pr/240 fails, uv run ruff check <changed files> reports F401, and uv run ruff format --check <changed files> would reformat 9 files.


INDEX_PROPERTY_GREMLIN = """
g.V().hasLabel('{label}')
.limit(100000)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ Avoid truncating the property snapshot before diffing

Evidence: this query only reads the first 100000 vertices for each indexed label, but _update_property_index() later treats every previously indexed property value missing from this truncated snapshot as deleted and removes it from graph_props. On graphs with more than 100000 matching vertices, incremental rebuilds can silently delete valid property vectors and degrade property recall. Please page through all matching vertices or avoid deleting existing property vectors when the present snapshot is known to be capped.

add_propsets.append(propset)
add_prop_values.append(prop_value)
if add_prop_values:
if len(add_prop_values) > 100000:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ Check the update limit before mutating the index

Evidence: to_remove is already removed and persisted at lines 184-187 before this > 100000 guard returns early. That leaves the on-disk property index half-updated: deletions are committed, but additions/updates are skipped. Please compute and validate the full mutation set before any remove() / to_index_file() call, or stage the new index and commit it atomically only after the limit check passes.

query = context["query"]
query_vector = self.embedding.get_text_embedding(query)
query_vector = self.embedding.get_texts_embeddings([query])
results = self.vector_index.search(query_vector, top_k=self.topk_per_query)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ Pass a single query embedding to vector search

Evidence: get_texts_embeddings([query]) returns a list of embeddings, but VectorIndex.search() expects one List[float] and checks len(query_vector) == index.d. With an existing index this path passes a 2-D value and raises a dimension mismatch, breaking SemanticIdQuery(by="query"). Please use the first embedding, as the keyword/property branches already do, and add a regression test for the whole-query path.

Suggested change
results = self.vector_index.search(query_vector, top_k=self.topk_per_query)
results = self.vector_index.search(query_vector[0], top_k=self.topk_per_query)

self._max_items = context["max_items"]
if isinstance(context.get("prop_to_match"), str):
self._prop_to_match = context["prop_to_match"]
if isinstance(context.get("match_props"), list):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ Do not force property fallback when VID matches exist

Evidence: whenever match_props is a list, this sets _prop_to_match, so use_id_to_match becomes false even if match_vids also contains exact vertex hits. A query that matches both a vertex id and an indexed property now skips the more precise VID-neighbor query and is forced into the property fallback path. Please prefer VID matching when matched_vids is non-empty, and use the property path only when no VID recall is available.

graph_chain_knowledge, vertex_degree_list, knowledge_with_degree = self._format_graph_query_result(
query_paths=paths
)
paths: List[Any] = self._client.gremlin().exec(gremlin=gremlin_query)["data"]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ Handle heterogeneous neighbors in property fallback

Evidence: this traversal starts from vertices with current_prop_name, but then walks arbitrary neighbors through bothE().otherV(). _process_vertex() and _process_edge() later read item["props"][self._prop_to_match]; if a neighbor label does not have the same property key, a normal heterogeneous graph raises KeyError instead of returning partial graph context. Please format property fallback paths using ids or safe .get() values for non-matching neighbors.

def graph_config_api(req: GraphConfigRequest):
# Accept status code
res = apply_graph_conf(req.url, req.name, req.user, req.pwd, req.gs, origin_call="http")
res = apply_graph_conf(req.url, req.graph, req.user, req.pwd, req.gs, origin_call="http")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ Thread token auth through the persistent graph config API

Evidence: GraphConfigRequest now accepts token, and per-request client_config copies it into huge_settings, but /config/graph still calls apply_graph_conf() without req.token. Token-only users can submit the field successfully, yet the saved graph configuration still validates and persists only basic-auth credentials. Please pass and persist token here, and make the connectivity check use bearer auth when a token is provided.

def vector_http_api(router: APIRouter, update_embedding_func):
@router.post("/vector/embedding", status_code=status.HTTP_200_OK)
def update_embedding_api(
daily_limit: int = 50,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Do not let callers choose their own rate limit

Evidence: the endpoint compares call_count >= daily_limit, but daily_limit is a request parameter with a permissive default. Any authenticated caller can bypass the protection by calling /vector/embedding?daily_limit=999999, so the expensive embedding rebuild is not actually rate-limited. Please move the limit to server-side config or clamp user input to a configured maximum.

name: str = Query('hugegraph', description="hugegraph client name.")
user: str = Query('', description="hugegraph client user.")
pwd: str = Query('', description="hugegraph client pwd.")
graph: str = Query('hugegraph', description="hugegraph client name.")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Keep the old name field compatible

Evidence: the public graph config model changed name to graph and gives graph a default of "hugegraph". Existing clients that still send {"name": "custom_graph"} will not get a validation error; the value is silently ignored and requests fall back to the default graph. Please add an alias/backward-compatible field mapping, or make old payloads fail loudly instead of targeting the wrong graph.


from hugegraph_llm.config import prompt

from hugegraph_llm.config import huge_settings

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Remove the unused import before merging

Evidence: uv run ruff check on the changed Python files fails with F401 huge_settings imported but unused on this line. Since the PR currently only has the triage check on GitHub, this would be caught by the repository's normal Python quality gate later. Please remove the import and run uv run ruff check plus uv run ruff format --check for the touched Python files.

@@ -0,0 +1,242 @@
# Licensed to the Apache Software Foundation (ASF) under one

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Put regression coverage under the real test tree

Evidence: this added test.py lives inside the production package and duplicates operator code instead of adding tests under hugegraph-llm/src/tests/..., which is where this module's CI-oriented tests live. The new property-embedding, token config, property fallback, and vector API behavior therefore has no runnable regression coverage. Please move this into focused tests under src/tests/api/ and src/tests/operators/index_op/ rather than shipping it as production package code.

@VGalaxies VGalaxies left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary

  • Blocking: yes
  • Summary: The remaining non-duplicate issues are API/runtime regressions and credential/query-safety problems introduced by the PR.
  • Evidence:
    • static review of git diff origin/main...HEAD, with rg, nl -ba, and targeted diff inspection

pwd: str,
user: Optional[str] = None,
pwd: Optional[str] = None,
token: Optional[str] = None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High: Preserve positional graphspace compatibility

hugegraph-python-client/src/pyhugegraph/client.py:56

Evidence

  • token was inserted before graphspace in PyHugeClient.__init__, while the previous fifth positional argument was graphspace; an existing call still uses PyHugeClient(url, graph, user, pwd, gs) at hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py:262.

Impact

  • Existing positional callers silently bind graphspace as a bearer token, disabling graphspace routing and changing auth behavior.

Requested fix

  • Keep graphspace in its old positional slot and make token keyword-only, or append token after the existing positional parameters.

"template_execution_result",
"raw_execution_result",
]
original_results = gremlin_generate(inp, example_num, schema_input, gremlin_prompt_input)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High: Do not execute Gremlin before filtering requested outputs

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/text2gremlin_block.py:237

Evidence

  • gremlin_generate_selective() calls gremlin_generate() unconditionally before checking requested_outputs; gremlin_generate() executes both generated queries through run_gremlin_query() at lines 98 and 102, even when the request only asks for template_gremlin.

Impact

  • /text2gremlin can execute LLM-generated Gremlin against HugeGraph when the caller requested generation output only.

Requested fix

  • Split generation from execution and call run_gremlin_query() only when template_execution_result or raw_execution_result is explicitly requested.

unmatched_keywords = set(keywords)
for key in property_keys:
for keyword in list(unmatched_keywords):
gremlin_query = f"g.V().has('{key.name}', '{keyword}').limit(1)"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High: Bind or escape property Gremlin values

hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py:96

Evidence

  • _exact_match_properties() interpolates key.name and user-derived keyword directly into g.V().has('{key.name}', '{keyword}'); the property fallback also formats prop_name and prop_value into Gremlin at graph_rag_query.py:219.

Impact

  • A keyword or indexed property value containing quotes or Gremlin syntax can alter the query sent to HugeGraph.

Requested fix

  • Use Gremlin bindings or a shared safe serialization/escaping helper for property keys and values before execution.

"""
Resolve the path to a full URL.
"""
url = f"http://{self._cfg.ip}:{self._cfg.port}/"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Avoid sending Vermeer tokens only over plaintext HTTP

vermeer-python-client/src/pyvermeer/utils/vermeer_requests.py:72

Evidence

  • VermeerSession requires a token and sets it in the Authorization header at line 42, but resolve() always builds http://{ip}:{port}/ and the client exposes only ip, port, and token.

Impact

  • Bearer tokens are exposed on the network for non-local Vermeer deployments.

Requested fix

  • Accept a full base URL or scheme option, support HTTPS, and avoid defaulting credentialed requests to plaintext transport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request llm python-client size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants