[SPARK-56618][CONNECT][TESTS] Add DSv2 join refresh tests for incrementally constructed queries in Spark Connect by longvu-db · Pull Request #55457 · apache/spark

longvu-db · 2026-04-21T20:30:58Z

What changes were proposed in this pull request?

Add DSv2 join test coverage for Spark Connect in a new DataSourceV2JoinConnectSuite, mirroring the classic DataSourceV2DataFrameSuite join scenarios from #55463.

The tests use an in-process Connect server (SparkConnectServerTest) so that a Connect client creates DataFrames, modifies the underlying table, and then executes joins. This verifies Connect's re-analysis behavior for incrementally constructed queries.

In Connect, Datasets are re-analyzed on every action, so operations that fail in classic mode (where each side pins its schema at analysis time) succeed here because both sides always get a fresh plan with the latest schema and data.

Scenario 1 (external + same-session insert): Both sides see all data (same as classic).
Scenario 2 (external + same-session ADD COLUMN): Both sides see 3-column schema. Classic preserves df1's 2-column schema because the plan is pinned at analysis time; Connect re-analyzes so both sides pick up the new schema.
Scenario 3 (external + same-session DROP COLUMN): Connect succeeds because both sides re-analyze and see the 1-column schema. Classic fails with COLUMNS_MISMATCH because df1's pinned plan still references the dropped column.
Scenario 4a (external drop/recreate, table with both table and column ID support): Connect succeeds because both sides re-analyze against the new table. Classic fails with TABLE_ID_MISMATCH because df1's pinned plan captured the original table ID.
Scenario 4b (external drop/recreate, table without table ID support, but with column ID support): Connect succeeds (re-analyzed). Classic fails with COLUMN_ID_MISMATCH because df1's pinned column IDs differ from the recreated table's.
Scenario 4c (external drop/recreate, table without table ID support and without column ID support): Succeeds in both classic and Connect since neither ID check fires.
Scenario 5a (external drop+re-add column, table without table ID support, but with column ID support): Connect succeeds (re-analyzed). Classic fails with COLUMN_ID_MISMATCH because the re-added column gets a fresh ID.
Scenario 5b (external drop+re-add column, table without table ID support and without column ID support): Succeeds in both classic and Connect since neither ID check fires.
Scenario 6 (external type change, table with both table and column ID support): Connect succeeds because both sides re-analyze and see the new STRING type. Classic fails with COLUMNS_MISMATCH because df1's pinned plan still expects INT.

The Connect and classic tests are aligned so the only differences are the APIs (Connect session vs classic sql()/spark.table(), assertRows vs checkAnswer, server-side serverCatalog access).

Shared test infrastructure changes

Same test catalogs as #55463: NullTableIdAndNullColumnIdInMemoryTableCatalog, TypeChangeResetsColIdTableCatalog, NullColumnIdInMemoryTableCatalog, MixedColumnIdTableCatalog, ComposedColumnIdTableCatalog, and InMemoryBaseTable/InMemoryTable accessor methods.

Why are the changes needed?

Connect has fundamentally different semantics from classic mode: every action re-analyzes the plan from scratch on the server, meaning there is no pinned QueryExecution and no stale schema detection. These tests document and verify that behavior across all catalog ID support variants.

This is the Spark Connect counterpart of #55463.

Does this PR introduce any user-facing change?

No. This PR only adds tests and test-only catalogs.

How was this patch tested?

New tests in DataSourceV2JoinConnectSuite.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-6)

andreaschat-db

Thanks @longvu-db. Left a comment.

Co-authored-by: Isaac

…rios Add 6 specific join tests with checkAnswer that mirror the classic DataSourceV2DataFrameSuite join scenarios from PR apache#55463. These tests verify Connect-specific behavior where both sides re-analyze on every action, so operations that fail in classic mode (DROP COLUMN, drop/recreate table, type change) succeed in Connect. Co-authored-by: Isaac

Co-authored-by: Isaac

These files belong in the classic test infrastructure PRs, not this Connect-only test PR. Co-authored-by: Isaac

Co-authored-by: Isaac

Replace the client-jvm test with a server-side test using SparkConnectServerTest and withSession pattern, matching the style of DataSourceV2CacheConnectSuite. Co-authored-by: Isaac

Co-authored-by: Isaac

Replace direct SQL statements (ALTER TABLE, DROP TABLE, INSERT) with catalog API calls to simulate truly external changes between df1 and df2 creation. This matches the pattern used in other DSv2 test sections. Co-authored-by: Isaac

Co-authored-by: Isaac

…alidation Remove external writer that adds rows after drop/recreate and drop/re-add column. Add schema assertions to verify the join result has the expected column names. Update expected results to reflect that Connect re-analyzes both sides against the new table state (empty after recreate, null salary after drop+re-add).

No longer needed after removing external writer from Scenario 6.

A single alterTable(ident, dropCol, addCol) applies both changes atomically, so the data migration sees salary in both old and new schemas and retains the old value. Splitting into two separate calls ensures the drop triggers data migration (nulling salary), matching the behavior of ALTER TABLE DROP/ADD COLUMN SQL statements.

Co-authored-by: Isaac

…eanup Co-authored-by: Isaac

Remove unnecessary ClassTag from serverCatalog, add externalAppend helper method, and use it to reduce boilerplate in test bodies. Co-authored-by: Isaac

Co-authored-by: Isaac

Scenario 5: use single alterTable call with both dropCol and addCol (matching classic test). Data is preserved since column name and type match. Add externalAppend to align test structure with classic. Scenario 6: keep two separate calls because InMemoryBaseTable preserves old INT data when names match in a single call, causing a type mismatch at read time. Document the difference in the test comment. Co-authored-by: Isaac

Rename existing tests to include "external" and add session-based variants that use SQL instead of catalog API for scenarios 4-6: - drop/recreate table - drop+re-add same-type column - drop+re-add different-type column Co-authored-by: Isaac

Rename existing tests to include "external" and add session-based variants using SQL for scenarios 1-3: - insert (session INSERT INTO) - ADD COLUMN (session ALTER TABLE ADD COLUMN) - DROP COLUMN (session ALTER TABLE DROP COLUMN) Co-authored-by: Isaac

Co-authored-by: Isaac

…support) Co-authored-by: Isaac

Co-authored-by: Isaac

…g ID suffixes Rewrite Connect join tests to mirror the classic DataSourceV2DataFrameSuite structure from PR apache#55463: - Add catalog ID support suffixes to all test names - Use external-only variants for scenarios 4 and 5 - Add nullidcat (4b) and nullbothidscat (4c, 5b) catalog variants - Scenario 6: use single alterTable call aligned with classic test - Remove same-session variants for scenarios 4, 5, 6 - Rename "session" to "same-session" for clarity Co-authored-by: Isaac

…mments Extract common CREATE TABLE + INSERT + cleanup into withTable helper. Add "Datasets are re-analyzed" explanation to each scenario comment, with details on why classic fails (pinned plan) but Connect succeeds. Co-authored-by: Isaac

Move scenario labels (S1-S6) into test names for discoverability. Remove externalAppend from scenarios 4a/4b/4c since the recreated table should be empty, and expect Seq.empty in join results. Co-authored-by: Isaac

Copy the test catalog from the classic PR (apache#55463) so the Connect tests for scenarios 4c and 5b can compile. Co-authored-by: Isaac

…sionFrom Add setVersionAndValidatedVersionFrom helper to InMemoryBaseTable to carry forward version metadata when wrapping one table into another. Without this, V2TableRefreshUtil cannot detect changes and refresh stale table references at execution time. Applied to all test catalogs that create new table instances via alterTableWithData: NullTableIdAndNullColumnIdInMemoryTableCatalog, NullColumnIdInMemoryTableCatalog, MixedColumnIdTableCatalog, ComposedColumnIdTableCatalog, TypeChangeResetsColIdTableCatalog, InMemoryRowLevelOperationTableCatalog, and PartialSchemaEvolutionCatalog. Also refactored InMemoryTable.copy() to use the same helper. Co-authored-by: Isaac

…olumn Apply schemaAfterDrops fix to BasicInMemoryTableCatalog.alterTable so that combined DROP + ADD column changes properly drop old data. The re-added column has null values for existing rows, not the old value. Update test expectations from [1, 100] to [1, null] in 3 tests. Co-authored-by: Isaac

Co-authored-by: Isaac

A single combined alterTable call preserves the old column ID via assignMissingIds name matching, so COLUMN_ID_MISMATCH would not fire. Two separate calls ensure the first deletes the column (and its ID), so the second assigns a fresh ID. Co-authored-by: Isaac

Remove the schemaAfterDrops mechanism from InMemoryTableCatalog.alterTable that was causing ClassCastException in InsertIntoSchemaEvolutionTests. Restore alterTableWithData(table.data, schema) to match master behavior. Revert checkAnswer assertions for null-ID catalogs back to Row(1, 100). Fix scalafmt formatting in DataSourceV2JoinConnectSuite. Co-authored-by: Isaac

…rtions 1. Restore schemaAfterDrops in InMemoryTableCatalog.alterTable so that drop+re-add column in a single alterTable call properly clears the dropped column's data from existing rows. 2. Fix null/mixed column ID test assertions: with version tracking now propagated, the table is refreshed on re-read after drop+re-add, so re-added salary column has null values (Row(1, null) not Row(1, 100)). Co-authored-by: Isaac

longvu-db marked this pull request as draft April 21, 2026 20:49

longvu-db changed the title ~~[SPARK-XXXXX][CONNECT][TESTS] Add Connect join refresh tests~~ [SPARK-56618][CONNECT][TESTS] Add Connect join refresh tests Apr 24, 2026

longvu-db changed the title ~~[SPARK-56618][CONNECT][TESTS] Add Connect join refresh tests~~ [SPARK-56618][CONNECT][TESTS] Add DSv2 join refresh tests for incrementally constructed queries in SPark Connect Apr 24, 2026

longvu-db marked this pull request as ready for review April 24, 2026 14:38

longvu-db changed the title ~~[SPARK-56618][CONNECT][TESTS] Add DSv2 join refresh tests for incrementally constructed queries in SPark Connect~~ [SPARK-56618][CONNECT][TESTS] Add DSv2 join refresh tests for incrementally constructed queries in Spark Connect Apr 24, 2026

longvu-db force-pushed the dsv2-connect-pr3-joins branch from 263c8bf to 642b0bc Compare April 30, 2026 13:14

longvu-db force-pushed the dsv2-connect-pr3-joins branch from e058e6a to 6b7251d Compare May 8, 2026 19:58

andreaschat-db reviewed May 13, 2026

View reviewed changes

Comment thread ...onnect/server/src/test/scala/org/apache/spark/sql/connect/DataSourceV2JoinConnectSuite.scala Outdated

longvu-db requested a review from andreaschat-db May 18, 2026 20:38

longvu-db force-pushed the dsv2-connect-pr3-joins branch from 7081cff to dd57944 Compare May 19, 2026 09:13

longvu-db added 20 commits May 20, 2026 15:08

[SPARK-XXXXX][CONNECT][TESTS] Add Connect join refresh tests

9762cf5

Co-authored-by: Isaac

Fix scalafmt formatting in connect test files

c22ef80

Remove SPARK-54157 prefix from Connect join test names

b82b3be

Co-authored-by: Isaac

Remove non-Connect files from this PR

25bcad1

These files belong in the classic test infrastructure PRs, not this Connect-only test PR. Co-authored-by: Isaac

Remove accidentally committed build artifact

193ed79

Co-authored-by: Isaac

Remove DataSourceV2ConnectJoinSuite from this PR

7fe50ad

Co-authored-by: Isaac

Rewrite Connect join tests using SparkConnectServerTest

1cd25ef

Replace the client-jvm test with a server-side test using SparkConnectServerTest and withSession pattern, matching the style of DataSourceV2CacheConnectSuite. Co-authored-by: Isaac

Fix expected values and Row comparison for drop+re-add tests

6658965

Co-authored-by: Isaac

Fix scalafmt formatting in DataSourceV2JoinConnectSuite

305189d

Co-authored-by: Isaac

Trigger CI

4a630e0

Co-authored-by: Isaac

Retrigger CI after fork build passed

6ecb409

Co-authored-by: Isaac

Retrigger CI

c5cbbe8

Co-authored-by: Isaac

Rename session variable from s to session for clarity

3d3a697

Co-authored-by: Isaac

Fix scalafmt formatting in DataSourceV2JoinConnectSuite

23afadb

Co-authored-by: Isaac

Retrigger CI

76f099b

Remove unused UTF8String import

4f63516

No longer needed after removing external writer from Scenario 6.

longvu-db added 22 commits May 20, 2026 15:08

Empty commit to retrigger CI

ecf38dc

Co-authored-by: Isaac

Address review: fix assertRows, generic serverCatalog, try/finally cl…

132108e

…eanup Co-authored-by: Isaac

Address review: remove ClassTag, add externalAppend helper

4a498ca

Remove unnecessary ClassTag from serverCatalog, add externalAppend helper method, and use it to reduce boilerplate in test bodies. Co-authored-by: Isaac

Restore SharedInMemoryTableCatalog after rebase conflict

044c2f5

Co-authored-by: Isaac

Add (table with column ID support) to scenario 5 and 6 test names

3ccc145

Co-authored-by: Isaac

Add (table with column ID support) to drop/recreate table test names

205afb0

Co-authored-by: Isaac

Add drop/recreate test with null column IDs (table without column ID …

89f8480

…support) Co-authored-by: Isaac

Use withCleanup wrapper for nullcolidcat test instead of raw try/finally

d67d889

Co-authored-by: Isaac

Add scenario labels to test names, remove externalAppend from 4a/4b/4c

0b49613

Move scenario labels (S1-S6) into test names for discoverability. Remove externalAppend from scenarios 4a/4b/4c since the recreated table should be empty, and expect Seq.empty in join results. Co-authored-by: Isaac

Add NullTableIdAndNullColumnIdInMemoryTableCatalog for Connect tests

93bae82

Copy the test catalog from the classic PR (apache#55463) so the Connect tests for scenarios 4c and 5b can compile. Co-authored-by: Isaac

Fix scalafmt formatting in DataSourceV2JoinConnectSuite

55722d1

Co-authored-by: Isaac

Address review: combine alterTable calls in S5, add row= named arg

ee2dc27

Co-authored-by: Isaac

Fix schemaAfterDrops else branch: use table.schema instead of schema

cc10159

Co-authored-by: Isaac

longvu-db force-pushed the dsv2-connect-pr3-joins branch from 9d2f3df to fd70f15 Compare May 20, 2026 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56618][CONNECT][TESTS] Add DSv2 join refresh tests for incrementally constructed queries in Spark Connect#55457

[SPARK-56618][CONNECT][TESTS] Add DSv2 join refresh tests for incrementally constructed queries in Spark Connect#55457
longvu-db wants to merge 43 commits into
apache:masterfrom
longvu-db:dsv2-connect-pr3-joins

longvu-db commented Apr 21, 2026 •

edited

Loading

Uh oh!

andreaschat-db left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

longvu-db commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Shared test infrastructure changes

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

andreaschat-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

longvu-db commented Apr 21, 2026 •

edited

Loading