CASSANALYTICS-171: Avoid Spark 4 partitioning warnings during reads#213
CASSANALYTICS-171: Avoid Spark 4 partitioning warnings during reads#213liucao-dd wants to merge 1 commit into
Conversation
frankgh
left a comment
There was a problem hiding this comment.
Good find! I just have one small question
| * Cassandra input partitions are token ranges, not groups of rows sharing one partition key value, | ||
| * so the connector reports Spark's unknown partitioning rather than keyed-group partitioning. | ||
| */ | ||
| class CassandraPartitioning extends UnknownPartitioning |
There was a problem hiding this comment.
do we even need the CassandraPartitioning? Can we directly instantiate an UnknownPartitioning instance from CassandraScanBuilder?
There was a problem hiding this comment.
Good point. I checked Spark's Partitioning contract and several maintained Spark DSV2 connectors (Iceberg, Paimon, Lance). The common pattern is to instantiate UnknownPartitioning directly when the scan cannot guarantee keyed grouping, and reserve KeyGroupedPartitioning for cases where the connector can prove rows are grouped by the reported key expressions. Updated the patch to use UnknownPartitioning directly.
Spark 4 ignores custom DataSource V2 Partitioning implementations and logs a warning. Cassandra scan partitions are token ranges rather than keyed groups, so report Spark's UnknownPartitioning directly while preserving the input partition count.
cc3b49c to
d9c2995
Compare
|
https://github.com/apache/cassandra-analytics/actions/runs/26692748197/job/78896333634?pr=213 this test timed out somehow. The patch didn't touch spark 3 related path though. Maybe it needs a retry? |
Summary
UnknownPartitioningdirectly, instead of a customPartitioningsubclass.CassandraScanBuilderfor output partitioning.Why
UnknownPartitioningSpark 4's DataSource V2 partitioning contract expects connectors to report one of Spark's public partitioning types, such as
UnknownPartitioningorKeyGroupedPartitioning, rather than implementingPartitioningdirectly. When Spark sees a customPartitioning,V2ScanPartitioningAndOrderingignores it and logs a warning.Cassandra analytics input partitions are token ranges. Rows in one token range can contain many distinct Cassandra partition keys and many distinct token values, so they do not satisfy
KeyGroupedPartitioning's contract that every row in a Spark partition evaluates to the same partition value. ReportingUnknownPartitioningpreserves the correct partition count without claiming keyed grouping semantics.Jira: https://issues.apache.org/jira/browse/CASSANALYTICS-171