[SPARK-55886][SQL][PYTHON] Add DataFrame.zip for merging column-projected DataFrames#54976
Draft
zhengruifeng wants to merge 3 commits into
Draft
[SPARK-55886][SQL][PYTHON] Add DataFrame.zip for merging column-projected DataFrames#54976zhengruifeng wants to merge 3 commits into
zhengruifeng wants to merge 3 commits into
Conversation
DataFrame.zip for merging column-projected DataFramesDataFrame.zip for merging column-projected DataFrames
4a909d5 to
eec3de6
Compare
158319b to
c2cc51f
Compare
DataFrame.zip for merging column-projected DataFramesDataFrame.zip for merging column-projected DataFrames
…cted DataFrames Generated-by: Claude Code
DataFrame.zip for merging column-projected DataFramesResolveZip previously stripped at most one Project layer on each side before comparing bases with `sameResult`. CheckAnalysis on the other hand stripped Project chains recursively, so plans like `df.select(...).select(...).zip(df.select(...))` were: - not rewritten by ResolveZip (one-level strip leaves mismatched bases), - silently accepted by CheckAnalysis (recursive strip matches), - and then crashed at execution because Zip stays unresolved. Make `extractProjectAndBase` recursive: walk down nested Projects while composing alias substitutions so each layer's output references are rewritten back to the original base attributes. Bare Attribute references to an inner alias are wrapped in a fresh Alias to preserve the outer name/exprId; other expressions get their inner-alias attributes inlined via `transform`. Tests: - ResolveZipSuite: longer chain, aliased chain (verifies the resolved Project sits directly on the shared base), stacked withColumn-style. - DataFrameZipSuite: withColumn on both sides, chained withColumn, longer chain, parent-with-chained-child, withColumnRenamed. - test_zip.py: parallel PySpark coverage of the same scenarios. Generated-by: Claude Code
The connect doctest harness in pyspark.sql.connect.dataframe._test runs doctest.testmod against pyspark.sql.dataframe under a connect session. The Examples block in DataFrame.zip calls .show(), which invokes the connect zip stub and raises NOT_IMPLEMENTED, failing the build. Match the existing pattern used for toJSON/rdd and drop the parent docstring before testmod runs. Generated-by: Claude Code
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Add a new
DataFrame.zip(other)API that combines the columns of two DataFrames side-by-side without a join key by reusing the shared base plan rather than emitting a relational join.Zip(left, right), always unresolved, with aZIPtree pattern.ResolveZiprule walks both sides, strips Project chains, and -- when the two bases produce the same result -- rewrites theZipinto a singleProjectover that shared base. The rule rejects non-scalar Python UDFs (e.g.GROUPED_MAP) that would break the 1:1 row mapping.ResolveZipcannot merge (different base plans), the survivingZiptriggers a newZIP_PLANS_NOT_MERGEABLEerror (sqlState42K03).Dataset.zipdeclared insql/apiand implemented in the classicDataset; Spark Connect throwsUnsupportedOperationException(planned for a follow-up).DataFrame.zipon the parent class, classic implementation delegating to the JVM via_jdf.zip, Connect raisesPySparkNotImplementedError. New entry in the API reference index.Why are the changes needed?
RDD.zipis a natural way to project two views of the same data and recombine them row-for-row. There has been no DataFrame equivalent: users porting that pattern have to fall back to a join on a synthetic row id, or recompute the source and select both column sets, which adds unnecessary work to the plan (a shuffle/join, or duplicated source evaluation) when the two sides are known to be row-aligned by construction.DataFrame.ziplifts the RDD pattern to the DataFrame API. Because the analyzer rewrites the operator into a single Project over the shared base, the operation is free at runtime: no join, no extra scan, no shuffle.Side-by-side:
Additional patterns this enables:
Does this PR introduce any user-facing change?
Yes. New public API:
Dataset.zip(other: Dataset[_]): DataFrame,@since 4.3.0.DataFrame.zip(other),versionadded:: 4.3.0.How was this patch tested?
ResolveZipSuite(catalyst) covering the analyzer rewrite: matching bases, mismatched bases, expression projections, partial Project on one side, unresolved children, longer chains of Projects, alias composition through chains, stacked withColumn-style projections, and theZIP_PLANS_NOT_MERGEABLECheckAnalysis path.DataFrameZipSuite(sql/core) covering end-to-end results, the "resolved plan is a single Project" invariant, withColumn/withColumnRenamed on both sides, longer chains, parent-with-chained-child, and the two main error cases (unrelated DataFrames,spark.rangesources).python/pyspark/sql/tests/test_zip.pymirroring the Scala suite plus scalar Python UDF and pandas UDF cases.Dataset.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code