[python] Add Split.to_dict() to expose planned split contents by XiaoHongbo-Hope · Pull Request #455 · apache/paimon-rust

XiaoHongbo-Hope · 2026-07-05T03:11:04Z

Purpose

Add Split.to_dict() to the Python bindings so a non-Rust reader (e.g. pypaimon) can rebuild its own split from a Rust-planned split and read the files directly — planning runs in Rust (the serial, driver-side bottleneck), reading stays in the existing reader, no re-planning.

Brief change log

PySplit::to_dict() exposes a planned split as a plain dict — bucket / paths / partition plus per-file metadata (file_path, schema_id, first_row_id, write_cols, …) and deletion files. Planning-only stats omitted.

Tests

test_split_to_dict_exposes_fields, test_split_to_dict_partition_and_reads.

API and Format

Additive — new Split.to_dict() Python method. No storage-format change.

Documentation

Covered by the method docstring.

Let a non-Rust reader (e.g. pypaimon) rebuild its own DataSplit from a Rust-planned split and read the files directly, without re-running its own scan planning. to_dict() exposes bucket / bucket_path / total_buckets / partition / raw_convertible and, per data file, the fully-resolved file_path plus scalar metadata (schema_id, level, sequence numbers, first_row_id, write_cols, creation_time, ...) and aligned per-file deletion files. partition is the serialized BinaryRow, byte-identical to a manifest _PARTITION. Planning-only statistics (key/value stats, min/max key) are omitted since planning already happened. Tested via test_split_to_dict_exposes_fields and test_split_to_dict_partition_and_reads.

JunRuiLee · 2026-07-05T09:59:27Z

Thanks @XiaoHongbo-Hope! A question on necessity first.

The direction of the read effort (#413) is to let pypaimon run its DataFrame read on the Rust core — initially as a basic, opt-in path behind a config flag, not a wholesale replacement, so it can mature alongside the pure-Python path. In that model Rust both plans and reads: PR3 already exposes new_read().read(splits) returning Arrow from the Rust TableRead, and splits stay opaque on the Python side — a transport token, nothing more — so pypaimon never needs to look inside one.

to_dict() exposes the full internal split contents, which is only needed if something rebuilds the split and reads the files outside Rust — i.e. reading in Python, which is the opposite of the direction #413 is moving in.

So: what's the use case for exposing split internals? If it's for a Rust-plans / Python-reads path, I think we should align on that direction first — otherwise it risks pulling us away from the opt-in Rust read path we're building toward.

XiaoHongbo-Hope marked this pull request as draft July 5, 2026 03:11

XiaoHongbo-Hope force-pushed the split_optimize branch 8 times, most recently from a1c8b00 to becb6aa Compare July 5, 2026 08:37

XiaoHongbo-Hope marked this pull request as ready for review July 5, 2026 08:47

XiaoHongbo-Hope force-pushed the split_optimize branch from becb6aa to 2b534b3 Compare July 5, 2026 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[python] Add Split.to_dict() to expose planned split contents#455

[python] Add Split.to_dict() to expose planned split contents#455
XiaoHongbo-Hope wants to merge 1 commit into
apache:mainfrom
XiaoHongbo-Hope:split_optimize

XiaoHongbo-Hope commented Jul 5, 2026 •

edited

Loading

Uh oh!

JunRuiLee commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

XiaoHongbo-Hope commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

JunRuiLee commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

XiaoHongbo-Hope commented Jul 5, 2026 •

edited

Loading