CASSANALYTICS-171: Avoid Spark 4 partitioning warnings during reads by liucao-dd · Pull Request #213 · apache/cassandra-analytics

liucao-dd · 2026-05-30T07:46:57Z

Summary

Report Spark 4 Cassandra scan partitioning as Spark's public UnknownPartitioning directly, instead of a custom Partitioning subclass.
Preserve the input partition count when Spark asks CassandraScanBuilder for output partitioning.
Add a Spark 4 unit test covering the reported partitioning contract.

Why `UnknownPartitioning`

Spark 4's DataSource V2 partitioning contract expects connectors to report one of Spark's public partitioning types, such as UnknownPartitioning or KeyGroupedPartitioning, rather than implementing Partitioning directly. When Spark sees a custom Partitioning, V2ScanPartitioningAndOrdering ignores it and logs a warning.

Cassandra analytics input partitions are token ranges. Rows in one token range can contain many distinct Cassandra partition keys and many distinct token values, so they do not satisfy KeyGroupedPartitioning's contract that every row in a Spark partition evaluates to the same partition value. Reporting UnknownPartitioning preserves the correct partition count without claiming keyed grouping semantics.

Jira: https://issues.apache.org/jira/browse/CASSANALYTICS-171

frankgh

Good find! I just have one small question

frankgh · 2026-05-30T12:16:56Z

+ * Cassandra input partitions are token ranges, not groups of rows sharing one partition key value,
+ * so the connector reports Spark's unknown partitioning rather than keyed-group partitioning.
+ */
+class CassandraPartitioning extends UnknownPartitioning


do we even need the CassandraPartitioning? Can we directly instantiate an UnknownPartitioning instance from CassandraScanBuilder?

Good point. I checked Spark's Partitioning contract and several maintained Spark DSV2 connectors (Iceberg, Paimon, Lance). The common pattern is to instantiate UnknownPartitioning directly when the scan cannot guarantee keyed grouping, and reserve KeyGroupedPartitioning for cases where the connector can prove rows are grouped by the reported key expressions. Updated the patch to use UnknownPartitioning directly.

Spark 4 ignores custom DataSource V2 Partitioning implementations and logs a warning. Cassandra scan partitions are token ranges rather than keyed groups, so report Spark's UnknownPartitioning directly while preserving the input partition count.

frankgh

+1

liucao-dd · 2026-06-01T19:58:01Z

https://github.com/apache/cassandra-analytics/actions/runs/26692748197/job/78896333634?pr=213 this test timed out somehow. The patch didn't touch spark 3 related path though. Maybe it needs a retry?

frankgh reviewed May 30, 2026

View reviewed changes

liucao-dd force-pushed the liu/CASSANALYTICS-171-spark4-unknown-partitioning branch from cc3b49c to d9c2995 Compare May 30, 2026 19:23

frankgh approved these changes Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CASSANALYTICS-171: Avoid Spark 4 partitioning warnings during reads#213

CASSANALYTICS-171: Avoid Spark 4 partitioning warnings during reads#213
liucao-dd wants to merge 1 commit into
apache:trunkfrom
liucao-dd:liu/CASSANALYTICS-171-spark4-unknown-partitioning

liucao-dd commented May 30, 2026 •

edited

Loading

Uh oh!

frankgh left a comment

Uh oh!

frankgh May 30, 2026

Uh oh!

liucao-dd May 30, 2026 •

edited

Loading

Uh oh!

frankgh left a comment

Uh oh!

liucao-dd commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

liucao-dd commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why UnknownPartitioning

Uh oh!

frankgh left a comment

Choose a reason for hiding this comment

Uh oh!

frankgh May 30, 2026

Choose a reason for hiding this comment

Uh oh!

liucao-dd May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frankgh left a comment

Choose a reason for hiding this comment

Uh oh!

liucao-dd commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liucao-dd commented May 30, 2026 •

edited

Loading

Why `UnknownPartitioning`

liucao-dd May 30, 2026 •

edited

Loading

liucao-dd commented Jun 1, 2026 •

edited

Loading