[python] Add Split.to_dict() to expose planned split contents#455
[python] Add Split.to_dict() to expose planned split contents#455XiaoHongbo-Hope wants to merge 1 commit into
Conversation
a1c8b00 to
becb6aa
Compare
Let a non-Rust reader (e.g. pypaimon) rebuild its own DataSplit from a Rust-planned split and read the files directly, without re-running its own scan planning. to_dict() exposes bucket / bucket_path / total_buckets / partition / raw_convertible and, per data file, the fully-resolved file_path plus scalar metadata (schema_id, level, sequence numbers, first_row_id, write_cols, creation_time, ...) and aligned per-file deletion files. partition is the serialized BinaryRow, byte-identical to a manifest _PARTITION. Planning-only statistics (key/value stats, min/max key) are omitted since planning already happened. Tested via test_split_to_dict_exposes_fields and test_split_to_dict_partition_and_reads.
becb6aa to
2b534b3
Compare
|
Thanks @XiaoHongbo-Hope! A question on necessity first. The direction of the read effort (#413) is to let pypaimon run its DataFrame read on the Rust core — initially as a basic, opt-in path behind a config flag, not a wholesale replacement, so it can mature alongside the pure-Python path. In that model Rust both plans and reads: PR3 already exposes
So: what's the use case for exposing split internals? If it's for a Rust-plans / Python-reads path, I think we should align on that direction first — otherwise it risks pulling us away from the opt-in Rust read path we're building toward. |
Purpose
Add Split.to_dict() to the Python bindings so a non-Rust reader (e.g. pypaimon) can rebuild its own split from a Rust-planned split and read the files directly — planning runs in Rust (the serial, driver-side bottleneck), reading stays in the existing reader, no re-planning.
Brief change log
PySplit::to_dict() exposes a planned split as a plain dict — bucket / paths / partition plus per-file metadata (file_path, schema_id, first_row_id, write_cols, …) and deletion files. Planning-only stats omitted.
Tests
test_split_to_dict_exposes_fields, test_split_to_dict_partition_and_reads.API and Format
Additive — new Split.to_dict() Python method. No storage-format change.
Documentation
Covered by the method docstring.