Skip to content

CASSANALYTICS-171: Avoid Spark 4 partitioning warnings during reads#213

Open
liucao-dd wants to merge 1 commit into
apache:trunkfrom
liucao-dd:liu/CASSANALYTICS-171-spark4-unknown-partitioning
Open

CASSANALYTICS-171: Avoid Spark 4 partitioning warnings during reads#213
liucao-dd wants to merge 1 commit into
apache:trunkfrom
liucao-dd:liu/CASSANALYTICS-171-spark4-unknown-partitioning

Conversation

@liucao-dd
Copy link
Copy Markdown

@liucao-dd liucao-dd commented May 30, 2026

Summary

  • Report Spark 4 Cassandra scan partitioning as Spark's public UnknownPartitioning directly, instead of a custom Partitioning subclass.
  • Preserve the input partition count when Spark asks CassandraScanBuilder for output partitioning.
  • Add a Spark 4 unit test covering the reported partitioning contract.

Why UnknownPartitioning

Spark 4's DataSource V2 partitioning contract expects connectors to report one of Spark's public partitioning types, such as UnknownPartitioning or KeyGroupedPartitioning, rather than implementing Partitioning directly. When Spark sees a custom Partitioning, V2ScanPartitioningAndOrdering ignores it and logs a warning.

Cassandra analytics input partitions are token ranges. Rows in one token range can contain many distinct Cassandra partition keys and many distinct token values, so they do not satisfy KeyGroupedPartitioning's contract that every row in a Spark partition evaluates to the same partition value. Reporting UnknownPartitioning preserves the correct partition count without claiming keyed grouping semantics.

Jira: https://issues.apache.org/jira/browse/CASSANALYTICS-171

Copy link
Copy Markdown
Contributor

@frankgh frankgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find! I just have one small question

* Cassandra input partitions are token ranges, not groups of rows sharing one partition key value,
* so the connector reports Spark's unknown partitioning rather than keyed-group partitioning.
*/
class CassandraPartitioning extends UnknownPartitioning
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we even need the CassandraPartitioning? Can we directly instantiate an UnknownPartitioning instance from CassandraScanBuilder?

Copy link
Copy Markdown
Author

@liucao-dd liucao-dd May 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I checked Spark's Partitioning contract and several maintained Spark DSV2 connectors (Iceberg, Paimon, Lance). The common pattern is to instantiate UnknownPartitioning directly when the scan cannot guarantee keyed grouping, and reserve KeyGroupedPartitioning for cases where the connector can prove rows are grouped by the reported key expressions. Updated the patch to use UnknownPartitioning directly.

Spark 4 ignores custom DataSource V2 Partitioning implementations and logs a warning. Cassandra scan partitions are token ranges rather than keyed groups, so report Spark's UnknownPartitioning directly while preserving the input partition count.
@liucao-dd liucao-dd force-pushed the liu/CASSANALYTICS-171-spark4-unknown-partitioning branch from cc3b49c to d9c2995 Compare May 30, 2026 19:23
Copy link
Copy Markdown
Contributor

@frankgh frankgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@liucao-dd
Copy link
Copy Markdown
Author

liucao-dd commented Jun 1, 2026

https://github.com/apache/cassandra-analytics/actions/runs/26692748197/job/78896333634?pr=213 this test timed out somehow. The patch didn't touch spark 3 related path though. Maybe it needs a retry?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants