Skip to content

Expose DataFrame-style read API (ReadBuilder / Scan / Split / TableRead) to Python #413

Description

@JunRuiLee

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

PyPaimon has two read paths today:

  • SQL (SQLContext.sql) — already runs on the Rust DataFusion engine.
  • DataFrame (ReadBuilder → Split → TableRead.to_arrow/to_pandas/to_ray) —
    still pure Python, even though the Rust core already implements the same
    model in crates/paimon/src/table/read_builder.rs. It's just not exposed
    through bindings/python (PyTable only has identifier/location/schema).

Goal: expose the existing Rust read API to Python so the DataFrame read path can
optionally run on Rust. Initially this lands as a basic, opt-in path behind a
config flag
, running alongside the pure-Python reader rather than replacing it,
so the Rust path can mature before it becomes a default. Write path is out of scope.

Scope (incremental PRs)

This can be implemented incrementally:

  • PR 1 — Expose scan planning:
    new_read_builder(), with_projection(), with_limit(), and
    new_scan().plan() returning serializable splits.

  • PR 2 — Expose filter pushdown:
    add with_filter() after the Python Predicate → Rust Predicate conversion
    layer is defined.

  • PR 3 — Expose split → Arrow read:
    new_read().read(splits) returning Arrow data backed by Rust TableRead.

  • PR 4 (in apache/paimon, [python]) — Wire PyPaimon's
    to_arrow / to_pandas / to_ray to the Rust reader as an opt-in path
    (config-gated), keeping the pure-Python reader as the default. Unsupported
    capabilities error out rather than silently falling back.

PR 1–3 land here; PR 4 lands in the main repo once bindings are released.

Notes

with_filter() is separated from the initial scan-planning PR because it
requires a dedicated Python Predicate → Rust Predicate conversion layer. PR 1
focuses on establishing the Python binding shape and serializable splits.

Design principle: in this model Rust both plans and reads.
new_read().read(splits) returns Arrow from the Rust TableRead, and splits
stay opaque on the Python side — a serializable transport token, not
something Python inspects or reads from. Exposing split internals would imply a
Rust-plans / Python-reads path, which is a different direction and out of scope
here.

Solution

No response

Anything else?

No response

Willingness to contribute

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Fields

    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions