feat(llm): property embedding by MrJs133 · Pull Request #240 · apache/hugegraph-ai

MrJs133 · 2025-05-19T08:06:31Z

link to #228, V1 version

update, get and delete property embeddings

Added property embedding updates to update_vid_embedding (embedding will only be performed for indexed properties).
- Adopt incremental updates (support adding and deleting prop_value, as well as adding and deleting prop_key). If the number of properties to be updated exceeds 100,000, a warning will be issued, and the update will be rejected.
Added property embedding removes to clean_all_graph_index

keywords match

Match properties according to keywords (exact + fuzzy)

text2gremlin

Add properties in the prompt

subgraph_query

Give priority to executing the vid subgraph query. If it fails and properties have been matched based on keywords previously, then execute the property subgraph query.

Change-Id: I31e9532990c2e7e09ddf518ac9c35cdf996a5a25 doc: update repo & url link for inner repo Change-Id: Ieac00904c4ab793de87c1cae7324e3d2d7946884 doc: update repo & url link for inner repo (GraphPlatform-4190) doc: update the link info

…ph/hugegraph-ai into master-icode

sync Change-Id: Id2138dfecfbd377dd39ec51fa9d46b630e9ea88f

Change-Id: I8cb2139be06f63378f87bccd9035357a7cb7a91e

Change-Id: I07414cd8b418afabcb1f8a40454a679f10760bf8

Change-Id: I909b71df67bef316b8cef02010bcae9e91168f4e

Change-Id: I0d4ce8e5a227941ecb3e5972bec35a79e25a6263

Change-Id: I803b36fde6d0b0291dc6fe5cc40c37a946b354f0

Change-Id: I1e3f97afeaa853250966a66c096582339f3c51d6

Change-Id: I6107a5d63c9949ac5e47f93521aa248d0a8ec121

Change-Id: I56329c6aa219fd04198519ee1d1220e395fe82f2

Change-Id: Id93b59ca9a325f79623741d919a18d8f64c12ccd

Change-Id: I81c24fdabd84b92b7cc6e79e19ce3286700806f6

Change-Id: I799844222f194d54990b27d71770543ad6c707ca

Change-Id: Ica35d454bf7a1fa12d3cfb14332530565cd08f94

Copilot

Pull Request Overview

This PR introduces property embedding support and extends existing vector indexing and query functionality. Key changes include:

Adding property vector indexing and cleaning in the vector and graph index utilities.
Updating embedding retrieval and query generation to support property keywords and property embeddings.
Integrating property-related logic into semantic queries, index building, API endpoints, and prompt configuration.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py	Added property vector index loading and updated index info output.
hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py	Incorporated cleaning and update logging for property vector indexes.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/gremlin_generate.py	Passed an additional 'properties' parameter to LLM generation.
hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py	Integrated exact and fuzzy matching for property keys and values.
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py	Added diffing and update logic for property embeddings with limit checks.
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py	Updated gremlin queries to use property name/value filtering in fallback scenarios.
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/fetch_graph_data.py	Extended graph summary with extracted index label properties.
hugegraph-llm/src/hugegraph_llm/operators/gremlin_generate_task.py	Passed property data to the gremlin generate operator.
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py	Registered and imported the new vector API endpoint.
hugegraph-llm/src/hugegraph_llm/config/prompt_config.py	Updated prompt content to include property references for query generation.
hugegraph-llm/src/hugegraph_llm/api/vector_api.py	Introduced new API endpoint for updating vector embeddings with rate limits.

Comments suppressed due to low confidence (1)

hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py:349

The variable 'use_id_to_match' is used without a visible definition in this context. Please ensure it is properly declared and initialized before use.

node_str = f"{item['id']}{{{props_str}}}" if use_id_to_match else f"{item['props']}{{{props_str}}}"

Copilot · 2025-05-26T11:16:51Z

+                add_propsets.append(propset)
+                add_prop_values.append(prop_value)
+            if add_prop_values:
+                if len(add_prop_values) > 100000:


[nitpick] Consider refactoring the hardcoded property update limit (100000) into a configurable parameter to improve maintainability and flexibility.

Copilot · 2025-05-26T11:16:52Z

        for keyword in keywords:
-            keyword_vector = self.embedding.get_text_embedding(keyword)
-            results = self.vector_index.search(keyword_vector, top_k=self.topk_per_keyword,
+            keyword_vector = self.embedding.get_texts_embeddings([keyword])


Ensure that 'get_texts_embeddings' returns a non-empty list before accessing its first element to avoid potential IndexError.

Suggested change

keyword_vector = self.embedding.get_texts_embeddings([keyword])

keyword_vector = self.embedding.get_texts_embeddings([keyword])

if not keyword_vector: # Ensure the list is non-empty

log.warning("No embeddings found for keyword: %s", keyword)

continue

Change-Id: Ie99469b659abe5efd53d8a52581c4ad91622ef6f

Change-Id: I04b1a71a5eda13c8369cd9c4b99e83f6c1373405

Change-Id: I8d9d5e3a46e16162c45b1b07a05bd187ebe0f4b1

Change-Id: Ibf83629dde9373398dac75b945a7fa19ae029a08

Change-Id: Idb8c1476ebda521e195022f14f6733a33288f552

Change-Id: I8923df363da0ddc301b4a3c4833cec478a6c83f9

Change-Id: I84b2fdbfc0745d1556d5d29059ce5f9dfa311352

Change-Id: I79ef4dc7852788823b249a80d7979e5917d2a8c0

…arhugegraph/hugegraph-ai into property_embedding

imbajin

Blocking: yes. Summary: Property embedding and GraphRAG paths have correctness regressions, and the current head fails local lint/format checks. Evidence: current head a0e460c; git diff --check refs/remotes/apache/main...refs/remotes/pr/240 fails, uv run ruff check <changed files> reports F401, and uv run ruff format --check <changed files> would reformat 9 files.

imbajin · 2026-06-10T01:15:12Z


+INDEX_PROPERTY_GREMLIN = """
+g.V().hasLabel('{label}')
+ .limit(100000)


‼️ Avoid truncating the property snapshot before diffing

Evidence: this query only reads the first 100000 vertices for each indexed label, but _update_property_index() later treats every previously indexed property value missing from this truncated snapshot as deleted and removes it from graph_props. On graphs with more than 100000 matching vertices, incremental rebuilds can silently delete valid property vectors and degrade property recall. Please page through all matching vertices or avoid deleting existing property vectors when the present snapshot is known to be capped.

imbajin · 2026-06-10T01:15:12Z

+            add_propsets.append(propset)
+            add_prop_values.append(prop_value)
+        if add_prop_values:
+            if len(add_prop_values) > 100000:


‼️ Check the update limit before mutating the index

Evidence: to_remove is already removed and persisted at lines 184-187 before this > 100000 guard returns early. That leaves the on-disk property index half-updated: deletions are committed, but additions/updates are skipped. Please compute and validate the full mutation set before any remove() / to_index_file() call, or stage the new index and commit it atomically only after the limit check passes.

imbajin · 2026-06-10T01:15:12Z

            query = context["query"]
-            query_vector = self.embedding.get_text_embedding(query)
+            query_vector = self.embedding.get_texts_embeddings([query])
            results = self.vector_index.search(query_vector, top_k=self.topk_per_query)


‼️ Pass a single query embedding to vector search

Evidence: get_texts_embeddings([query]) returns a list of embeddings, but VectorIndex.search() expects one List[float] and checks len(query_vector) == index.d. With an existing index this path passes a 2-D value and raises a dimension mismatch, breaking SemanticIdQuery(by="query"). Please use the first embedding, as the keyword/property branches already do, and add a regression test for the whole-query path.

Suggested change

results = self.vector_index.search(query_vector, top_k=self.topk_per_query)

results = self.vector_index.search(query_vector[0], top_k=self.topk_per_query)

imbajin · 2026-06-10T01:15:12Z

            self._max_items = context["max_items"]
-        if isinstance(context.get("prop_to_match"), str):
-            self._prop_to_match = context["prop_to_match"]
+        if isinstance(context.get("match_props"), list):


‼️ Do not force property fallback when VID matches exist

Evidence: whenever match_props is a list, this sets _prop_to_match, so use_id_to_match becomes false even if match_vids also contains exact vertex hits. A query that matches both a vertex id and an indexed property now skips the more precise VID-neighbor query and is forced into the property fallback path. Please prefer VID matching when matched_vids is non-empty, and use the property path only when no VID recall is available.

imbajin · 2026-06-10T01:15:12Z

-            graph_chain_knowledge, vertex_degree_list, knowledge_with_degree = self._format_graph_query_result(
-                query_paths=paths
-            )
+                paths: List[Any] = self._client.gremlin().exec(gremlin=gremlin_query)["data"]


‼️ Handle heterogeneous neighbors in property fallback

Evidence: this traversal starts from vertices with current_prop_name, but then walks arbitrary neighbors through bothE().otherV(). _process_vertex() and _process_edge() later read item["props"][self._prop_to_match]; if a neighbor label does not have the same property key, a normal heterogeneous graph raises KeyError instead of returning partial graph context. Please format property fallback paths using ids or safe .get() values for non-matching neighbors.

imbajin · 2026-06-10T01:15:12Z

    def graph_config_api(req: GraphConfigRequest):
        # Accept status code
-        res = apply_graph_conf(req.url, req.name, req.user, req.pwd, req.gs, origin_call="http")
+        res = apply_graph_conf(req.url, req.graph, req.user, req.pwd, req.gs, origin_call="http")


‼️ Thread token auth through the persistent graph config API

Evidence: GraphConfigRequest now accepts token, and per-request client_config copies it into huge_settings, but /config/graph still calls apply_graph_conf() without req.token. Token-only users can submit the field successfully, yet the saved graph configuration still validates and persists only basic-auth credentials. Please pass and persist token here, and make the connectivity check use bearer auth when a token is provided.

imbajin · 2026-06-10T01:15:12Z

+def vector_http_api(router: APIRouter, update_embedding_func):
+    @router.post("/vector/embedding", status_code=status.HTTP_200_OK)
+    def update_embedding_api(
+            daily_limit: int = 50,


⚠️ Do not let callers choose their own rate limit

Evidence: the endpoint compares call_count >= daily_limit, but daily_limit is a request parameter with a permissive default. Any authenticated caller can bypass the protection by calling /vector/embedding?daily_limit=999999, so the expensive embedding rebuild is not actually rate-limited. Please move the limit to server-side config or clamp user input to a configured maximum.

imbajin · 2026-06-10T01:15:12Z

-    name: str = Query('hugegraph', description="hugegraph client name.")
-    user: str = Query('', description="hugegraph client user.")
-    pwd: str = Query('', description="hugegraph client pwd.")
+    graph: str = Query('hugegraph', description="hugegraph client name.")


⚠️ Keep the old name field compatible

Evidence: the public graph config model changed name to graph and gives graph a default of "hugegraph". Existing clients that still send {"name": "custom_graph"} will not get a validation error; the value is silently ignored and requests fall back to the default graph. Please add an alias/backward-compatible field mapping, or make old payloads fail loudly instead of targeting the wrong graph.

imbajin · 2026-06-10T01:15:12Z


 from hugegraph_llm.config import prompt
-
+from hugegraph_llm.config import huge_settings


⚠️ Remove the unused import before merging

Evidence: uv run ruff check on the changed Python files fails with F401 huge_settings imported but unused on this line. Since the PR currently only has the triage check on GitHub, this would be caught by the repository's normal Python quality gate later. Please remove the import and run uv run ruff check plus uv run ruff format --check for the touched Python files.

imbajin · 2026-06-10T01:15:12Z

@@ -0,0 +1,242 @@
+# Licensed to the Apache Software Foundation (ASF) under one


⚠️ Put regression coverage under the real test tree

Evidence: this added test.py lives inside the production package and duplicates operator code instead of adding tests under hugegraph-llm/src/tests/..., which is where this module's CI-oriented tests live. The new property-embedding, token config, property fallback, and vector API behavior therefore has no runnable regression coverage. Please move this into focused tests under src/tests/api/ and src/tests/operators/index_op/ rather than shipping it as production package code.

VGalaxies

Review summary

Blocking: yes
Summary: The remaining non-duplicate issues are API/runtime regressions and credential/query-safety problems introduced by the PR.
Evidence:
- static review of git diff origin/main...HEAD, with rg, nl -ba, and targeted diff inspection

VGalaxies · 2026-06-12T05:35:29Z

-        pwd: str,
+        user: Optional[str] = None,
+        pwd: Optional[str] = None,
+        token: Optional[str] = None,


High: Preserve positional graphspace compatibility

hugegraph-python-client/src/pyhugegraph/client.py:56

Evidence

token was inserted before graphspace in PyHugeClient.__init__, while the previous fifth positional argument was graphspace; an existing call still uses PyHugeClient(url, graph, user, pwd, gs) at hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py:262.

Impact

Existing positional callers silently bind graphspace as a bearer token, disabling graphspace routing and changing auth behavior.

Requested fix

Keep graphspace in its old positional slot and make token keyword-only, or append token after the existing positional parameters.

VGalaxies · 2026-06-12T05:35:29Z

+        "template_execution_result",
+        "raw_execution_result",
+    ]
+    original_results = gremlin_generate(inp, example_num, schema_input, gremlin_prompt_input)


High: Do not execute Gremlin before filtering requested outputs

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/text2gremlin_block.py:237

Evidence

gremlin_generate_selective() calls gremlin_generate() unconditionally before checking requested_outputs; gremlin_generate() executes both generated queries through run_gremlin_query() at lines 98 and 102, even when the request only asks for template_gremlin.

Impact

/text2gremlin can execute LLM-generated Gremlin against HugeGraph when the caller requested generation output only.

Requested fix

Split generation from execution and call run_gremlin_query() only when template_execution_result or raw_execution_result is explicitly requested.

VGalaxies · 2026-06-12T05:35:29Z

+        unmatched_keywords = set(keywords)
+        for key in property_keys:
+            for keyword in list(unmatched_keywords):
+                gremlin_query = f"g.V().has('{key.name}', '{keyword}').limit(1)"


High: Bind or escape property Gremlin values

hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py:96

Evidence

_exact_match_properties() interpolates key.name and user-derived keyword directly into g.V().has('{key.name}', '{keyword}'); the property fallback also formats prop_name and prop_value into Gremlin at graph_rag_query.py:219.

Impact

A keyword or indexed property value containing quotes or Gremlin syntax can alter the query sent to HugeGraph.

Requested fix

Use Gremlin bindings or a shared safe serialization/escaping helper for property keys and values before execution.

VGalaxies · 2026-06-12T05:35:29Z

+        """
+        Resolve the path to a full URL.
+        """
+        url = f"http://{self._cfg.ip}:{self._cfg.port}/"


Medium: Avoid sending Vermeer tokens only over plaintext HTTP

vermeer-python-client/src/pyvermeer/utils/vermeer_requests.py:72

Evidence

VermeerSession requires a token and sets it in the Authorization header at line 42, but resolve() always builds http://{ip}:{port}/ and the client exposes only ip, port, and token.

Impact

Bearer tokens are exposed on the network for non-local Vermeer deployments.

Requested fix

Accept a full base URL or scheme option, support HTTPS, and avoid defaulting credentialed requests to plaintext transport.

Jin and others added 25 commits January 2, 2025 19:24

doc: update for inner repo (GraphPlatform-4190)

74cf3c5

Change-Id: I31e9532990c2e7e09ddf518ac9c35cdf996a5a25 doc: update repo & url link for inner repo Change-Id: Ieac00904c4ab793de87c1cae7324e3d2d7946884 doc: update repo & url link for inner repo (GraphPlatform-4190) doc: update the link info

Merge branch 'master' of ssh://icode.baidu.com:8235/baidu/starhugegra…

e9941ad

…ph/hugegraph-ai into master-icode

Merge branch 'main' into master-icode

d2e4e2e

Merge branch 'main' into master-icode

c5ec3ea

docs(client): update README.md

12221c9

Merge remote-tracking branch 'github/main'

8343d3c

sync Change-Id: Id2138dfecfbd377dd39ec51fa9d46b630e9ea88f

Merge branch 'main'

0036b2c

Merge branch 'main' into master-icode

173b9b2

fix(llm): enable concurrency config in rag answer

c9ef9ed

Merge branch 'main' into master-icode

c933c58

Merge branch 'main' into master-icode

263758b

Merge branch 'main' into master-icode

8a0e6cd

GraphPlatform-4765 [Tech Task] vermeer-client框架开发

a60887d

Change-Id: I8cb2139be06f63378f87bccd9035357a7cb7a91e

update llm settings

43cdcae

Change-Id: I07414cd8b418afabcb1f8a40454a679f10760bf8

set num_gremlin_generate_example 0

b4f1a51

Change-Id: I909b71df67bef316b8cef02010bcae9e91168f4e

disable text2gql by default

142a44e

Change-Id: I0d4ce8e5a227941ecb3e5972bec35a79e25a6263

remove log info

9f245a0

Change-Id: I803b36fde6d0b0291dc6fe5cc40c37a946b354f0

temperature 0.01

87cb1cf

Change-Id: I1e3f97afeaa853250966a66c096582339f3c51d6

merge community

961f582

Change-Id: I6107a5d63c9949ac5e47f93521aa248d0a8ec121

modify prompt to only output gql

4cbb47a

Change-Id: I56329c6aa219fd04198519ee1d1220e395fe82f2

test_api_connection

36116e6

Change-Id: Id93b59ca9a325f79623741d919a18d8f64c12ccd

merge main to master

5db4e0c

Change-Id: I81c24fdabd84b92b7cc6e79e19ce3286700806f6

empty chunk

bf19a84

Change-Id: I799844222f194d54990b27d71770543ad6c707ca

build, get and remove property embeddings

026c65c

limit the number of props to be updated

f2ee374

dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label May 19, 2025

github-actions Bot added the llm label May 19, 2025

dosubot Bot added the enhancement New feature or request label May 19, 2025

disable pylint and modify limit logic

0e98879

dosubot Bot removed the size:M This PR changes 30-99 lines, ignoring generated files. label May 20, 2025

jinsong04 and others added 7 commits May 26, 2025 09:35

pylint

923502a

limit 2 times one day

739479a

pylint

9acaa96

inner

86e6098

Merge branch 'main' into master-icode

b27d925

Merge branch 'master-icode' into property_embedding

765e93f

format

b5f31ff

Change-Id: Ica35d454bf7a1fa12d3cfb14332530565cd08f94

imbajin requested a review from Copilot May 26, 2025 11:15

Copilot AI reviewed May 26, 2025

View reviewed changes

imbajin and others added 12 commits May 26, 2025 19:46

fix lint & comment

50ec338

text2gremlin api

7d9d67c

Change-Id: Ie99469b659abe5efd53d8a52581c4ad91622ef6f

change params

c918e73

Change-Id: I04b1a71a5eda13c8369cd9c4b99e83f6c1373405

change params

7b7260a

Change-Id: I8d9d5e3a46e16162c45b1b07a05bd187ebe0f4b1

text to json

3754211

Change-Id: Ibf83629dde9373398dac75b945a7fa19ae029a08

detail

7a2cf2b

Change-Id: Idb8c1476ebda521e195022f14f6733a33288f552

add graph_space in text2gremlin api

7b3e5e2

Change-Id: I8923df363da0ddc301b4a3c4833cec478a6c83f9

add graph_space in text2gremlin api

cb31b35

Change-Id: I84b2fdbfc0745d1556d5d29059ce5f9dfa311352

change default in text2gremlin api

9778c37

Change-Id: I79ef4dc7852788823b249a80d7979e5917d2a8c0

split build_semantic_index.run()

cf6d2e4

Merge branch 'property_embedding' of https://icode.baidu.com/baidu/st…

eb5e9f1

…arhugegraph/hugegraph-ai into property_embedding

conflict

e75f68f

dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels May 28, 2025

github-actions Bot added the python-client label May 28, 2025

jinsong04 added 4 commits May 29, 2025 10:33

change daily limit

4aa1a4d

create pyhugegraph client by token

7b7c6d2

change param

31bf971

name -> graph

a0e460c

imbajin reviewed Jun 10, 2026

View reviewed changes

VGalaxies reviewed Jun 12, 2026

View reviewed changes

-            keyword_vector = self.embedding.get_texts_embeddings([keyword])
+            keyword_vector = self.embedding.get_texts_embeddings([keyword])
+            if not keyword_vector:  # Ensure the list is non-empty
+                log.warning("No embeddings found for keyword: %s", keyword)
+                continue

	results = self.vector_index.search(query_vector, top_k=self.topk_per_query)
	results = self.vector_index.search(query_vector[0], top_k=self.topk_per_query)


		from hugegraph_llm.config import prompt

		from hugegraph_llm.config import huge_settings

		@@ -0,0 +1,242 @@
		# Licensed to the Apache Software Foundation (ASF) under one

Conversation

MrJs133 commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

update, get and delete property embeddings

keywords match

text2gremlin

subgraph_query

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 26, 2025

Choose a reason for hiding this comment

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VGalaxies left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MrJs133 commented May 19, 2025 •

edited

Loading