Support DataFusion lateral vector search joins#452
Conversation
leaves12138
left a comment
There was a problem hiding this comment.
Reviewed the lateral vector_search implementation. The existing literal JSON vector_search path remains intact, and the new lateral path is cleanly routed through a logical rewrite plus a custom physical extension planner. The execution path batches query vectors from the left input, preserves the per-query result mapping, and reads target rows by _ROW_ID without falling back to a full table scan.
Validation passed locally:
git diff --cached --checkcargo fmt --all -- --checkcargo test -p paimon-datafusion --test read_tables vector_search_tests::test_vector_search_lateral_join_uses_query_vectors -- --nocapturecargo test -p paimon-datafusion --test read_tables vector_search_tests::test_vector_search_java_vindex_table -- --nocapturecargo test -p paimon-datafusion --test read_tables vector_search_tests::test_vector_search_without_matching_index_returns_empty -- --nocapturecargo test -p paimon table::vector_search_builder -- --nocapturecargo clippy -p paimon-datafusion --test read_tables -- -D warnings
No blocking issues found.
QuakeWang
left a comment
There was a problem hiding this comment.
LGTM. The change is well-scoped, reuses the existing batch vector search/read path cleanly, and the added lateral join test covers the main behavior. I only left one non-blocking nit around the extension node Eq/Hash contract.
| })) | ||
| } | ||
|
|
||
| fn dyn_hash(&self, mut state: &mut dyn Hasher) { |
There was a problem hiding this comment.
nit: Should input also participate in dyn_hash / dyn_eq? The node's semantics depend on the child plan too, and DataFusion extension nodes generally expect Eq/Hash to represent the full logical node. Non-blocking, but including it would make the contract tighter.
There was a problem hiding this comment.
Fixed by including the child logical plan in both dyn_hash and dyn_eq. Thanks for the catch.
4011abc to
4425fce
Compare
Summary
Adds DataFusion support for
CROSS JOIN LATERAL vector_search(...)when the query vector comes from the left input column, planning it as batched vector search per input batch. The existing literal JSONvector_searchpath remains unchanged.Changes
SQLContextto rewrite supported lateral joins intoLateralVectorSearchExec.vector_searchUDTF with a lateral marker provider for column query vector arguments._ROW_ID, and emits joined left/right rows.Testing
cargo test -p paimon-datafusion --test read_tables vector_search_tests::test_vector_search_lateral_join_uses_query_vectors -- --nocapturecargo test -p paimon-datafusion --test read_tables vector_search_tests::test_vector_search_java_vindex_table -- --nocapturecargo test -p paimon-datafusion --test read_tables vector_search_tests::test_vector_search_without_matching_index_returns_empty -- --nocapturecargo test -p paimon table::vector_search_builder -- --nocapturecargo fmt --all --checkgit diff --checkcargo clippy -p paimon-datafusion --test read_tables -- -D warningsNotes
CROSS JOIN LATERAL vector_search(...)/ inner joins with no join predicate. Unsupported shapes continue to fail through the marker provider scan path.List<Float32>orFixedSizeList<Float32>.