Skip to content

[python] Add Split.to_dict() to expose planned split contents#455

Open
XiaoHongbo-Hope wants to merge 1 commit into
apache:mainfrom
XiaoHongbo-Hope:split_optimize
Open

[python] Add Split.to_dict() to expose planned split contents#455
XiaoHongbo-Hope wants to merge 1 commit into
apache:mainfrom
XiaoHongbo-Hope:split_optimize

Conversation

@XiaoHongbo-Hope

@XiaoHongbo-Hope XiaoHongbo-Hope commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

Purpose

Add Split.to_dict() to the Python bindings so a non-Rust reader (e.g. pypaimon) can rebuild its own split from a Rust-planned split and read the files directly — planning runs in Rust (the serial, driver-side bottleneck), reading stays in the existing reader, no re-planning.

Brief change log

PySplit::to_dict() exposes a planned split as a plain dict — bucket / paths / partition plus per-file metadata (file_path, schema_id, first_row_id, write_cols, …) and deletion files. Planning-only stats omitted.

Tests

test_split_to_dict_exposes_fields, test_split_to_dict_partition_and_reads.

API and Format

Additive — new Split.to_dict() Python method. No storage-format change.

Documentation

Covered by the method docstring.

@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as draft July 5, 2026 03:11
@XiaoHongbo-Hope XiaoHongbo-Hope force-pushed the split_optimize branch 8 times, most recently from a1c8b00 to becb6aa Compare July 5, 2026 08:37
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review July 5, 2026 08:47
Let a non-Rust reader (e.g. pypaimon) rebuild its own DataSplit from a
Rust-planned split and read the files directly, without re-running its
own scan planning.

to_dict() exposes bucket / bucket_path / total_buckets / partition /
raw_convertible and, per data file, the fully-resolved file_path plus
scalar metadata (schema_id, level, sequence numbers, first_row_id,
write_cols, creation_time, ...) and aligned per-file deletion files.
partition is the serialized BinaryRow, byte-identical to a manifest
_PARTITION. Planning-only statistics (key/value stats, min/max key) are
omitted since planning already happened.

Tested via test_split_to_dict_exposes_fields and
test_split_to_dict_partition_and_reads.
@JunRuiLee

Copy link
Copy Markdown
Contributor

Thanks @XiaoHongbo-Hope! A question on necessity first.

The direction of the read effort (#413) is to let pypaimon run its DataFrame read on the Rust core — initially as a basic, opt-in path behind a config flag, not a wholesale replacement, so it can mature alongside the pure-Python path. In that model Rust both plans and reads: PR3 already exposes new_read().read(splits) returning Arrow from the Rust TableRead, and splits stay opaque on the Python side — a transport token, nothing more — so pypaimon never needs to look inside one.

to_dict() exposes the full internal split contents, which is only needed if something rebuilds the split and reads the files outside Rust — i.e. reading in Python, which is the opposite of the direction #413 is moving in.

So: what's the use case for exposing split internals? If it's for a Rust-plans / Python-reads path, I think we should align on that direction first — otherwise it risks pulling us away from the opt-in Rust read path we're building toward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants