Skip to content

Support DataFusion lateral vector search joins#452

Merged
JingsongLi merged 1 commit into
apache:mainfrom
JingsongLi:codex/datafusion-lateral-vector-search
Jul 5, 2026
Merged

Support DataFusion lateral vector search joins#452
JingsongLi merged 1 commit into
apache:mainfrom
JingsongLi:codex/datafusion-lateral-vector-search

Conversation

@JingsongLi

Copy link
Copy Markdown
Contributor

Summary

Adds DataFusion support for CROSS JOIN LATERAL vector_search(...) when the query vector comes from the left input column, planning it as batched vector search per input batch. The existing literal JSON vector_search path remains unchanged.

Changes

  • Registers a custom optimizer rule and query planner in SQLContext to rewrite supported lateral joins into LateralVectorSearchExec.
  • Extends the vector_search UDTF with a lateral marker provider for column query vector arguments.
  • Executes batch vector search per left record batch, reads target rows by _ROW_ID, and emits joined left/right rows.
  • Adds DataFusion coverage with per-row query vectors that produce different top-k matches.

Testing

  • cargo test -p paimon-datafusion --test read_tables vector_search_tests::test_vector_search_lateral_join_uses_query_vectors -- --nocapture
  • cargo test -p paimon-datafusion --test read_tables vector_search_tests::test_vector_search_java_vindex_table -- --nocapture
  • cargo test -p paimon-datafusion --test read_tables vector_search_tests::test_vector_search_without_matching_index_returns_empty -- --nocapture
  • cargo test -p paimon table::vector_search_builder -- --nocapture
  • cargo fmt --all --check
  • git diff --check
  • cargo clippy -p paimon-datafusion --test read_tables -- -D warnings

Notes

  • First version supports CROSS JOIN LATERAL vector_search(...) / inner joins with no join predicate. Unsupported shapes continue to fail through the marker provider scan path.
  • The lateral query vector currently must be a column expression yielding List<Float32> or FixedSizeList<Float32>.

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the lateral vector_search implementation. The existing literal JSON vector_search path remains intact, and the new lateral path is cleanly routed through a logical rewrite plus a custom physical extension planner. The execution path batches query vectors from the left input, preserves the per-query result mapping, and reads target rows by _ROW_ID without falling back to a full table scan.

Validation passed locally:

  • git diff --cached --check
  • cargo fmt --all -- --check
  • cargo test -p paimon-datafusion --test read_tables vector_search_tests::test_vector_search_lateral_join_uses_query_vectors -- --nocapture
  • cargo test -p paimon-datafusion --test read_tables vector_search_tests::test_vector_search_java_vindex_table -- --nocapture
  • cargo test -p paimon-datafusion --test read_tables vector_search_tests::test_vector_search_without_matching_index_returns_empty -- --nocapture
  • cargo test -p paimon table::vector_search_builder -- --nocapture
  • cargo clippy -p paimon-datafusion --test read_tables -- -D warnings

No blocking issues found.

@QuakeWang QuakeWang left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The change is well-scoped, reuses the existing batch vector search/read path cleanly, and the added lateral join test covers the main behavior. I only left one non-blocking nit around the extension node Eq/Hash contract.

}))
}

fn dyn_hash(&self, mut state: &mut dyn Hasher) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should input also participate in dyn_hash / dyn_eq? The node's semantics depend on the child plan too, and DataFusion extension nodes generally expect Eq/Hash to represent the full logical node. Non-blocking, but including it would make the contract tighter.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by including the child logical plan in both dyn_hash and dyn_eq. Thanks for the catch.

@JingsongLi JingsongLi force-pushed the codex/datafusion-lateral-vector-search branch from 4011abc to 4425fce Compare July 5, 2026 00:29
@JingsongLi JingsongLi merged commit b44ffea into apache:main Jul 5, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants