Skip to content

[SPARK-56618][CONNECT][TESTS] Add DSv2 join refresh tests for incrementally constructed queries in Spark Connect#55457

Open
longvu-db wants to merge 43 commits into
apache:masterfrom
longvu-db:dsv2-connect-pr3-joins
Open

[SPARK-56618][CONNECT][TESTS] Add DSv2 join refresh tests for incrementally constructed queries in Spark Connect#55457
longvu-db wants to merge 43 commits into
apache:masterfrom
longvu-db:dsv2-connect-pr3-joins

Conversation

@longvu-db
Copy link
Copy Markdown
Contributor

@longvu-db longvu-db commented Apr 21, 2026

What changes were proposed in this pull request?

Add DSv2 join test coverage for Spark Connect in a new DataSourceV2JoinConnectSuite, mirroring the classic DataSourceV2DataFrameSuite join scenarios from #55463.

The tests use an in-process Connect server (SparkConnectServerTest) so that a Connect client creates DataFrames, modifies the underlying table, and then executes joins. This verifies Connect's re-analysis behavior for incrementally constructed queries.

In Connect, Datasets are re-analyzed on every action, so operations that fail in classic mode (where each side pins its schema at analysis time) succeed here because both sides always get a fresh plan with the latest schema and data.

  • Scenario 1 (external + same-session insert): Both sides see all data (same as classic).
  • Scenario 2 (external + same-session ADD COLUMN): Both sides see 3-column schema. Classic preserves df1's 2-column schema because the plan is pinned at analysis time; Connect re-analyzes so both sides pick up the new schema.
  • Scenario 3 (external + same-session DROP COLUMN): Connect succeeds because both sides re-analyze and see the 1-column schema. Classic fails with COLUMNS_MISMATCH because df1's pinned plan still references the dropped column.
  • Scenario 4a (external drop/recreate, table with both table and column ID support): Connect succeeds because both sides re-analyze against the new table. Classic fails with TABLE_ID_MISMATCH because df1's pinned plan captured the original table ID.
  • Scenario 4b (external drop/recreate, table without table ID support, but with column ID support): Connect succeeds (re-analyzed). Classic fails with COLUMN_ID_MISMATCH because df1's pinned column IDs differ from the recreated table's.
  • Scenario 4c (external drop/recreate, table without table ID support and without column ID support): Succeeds in both classic and Connect since neither ID check fires.
  • Scenario 5a (external drop+re-add column, table without table ID support, but with column ID support): Connect succeeds (re-analyzed). Classic fails with COLUMN_ID_MISMATCH because the re-added column gets a fresh ID.
  • Scenario 5b (external drop+re-add column, table without table ID support and without column ID support): Succeeds in both classic and Connect since neither ID check fires.
  • Scenario 6 (external type change, table with both table and column ID support): Connect succeeds because both sides re-analyze and see the new STRING type. Classic fails with COLUMNS_MISMATCH because df1's pinned plan still expects INT.

The Connect and classic tests are aligned so the only differences are the APIs (Connect session vs classic sql()/spark.table(), assertRows vs checkAnswer, server-side serverCatalog access).

Shared test infrastructure changes

Same test catalogs as #55463: NullTableIdAndNullColumnIdInMemoryTableCatalog, TypeChangeResetsColIdTableCatalog, NullColumnIdInMemoryTableCatalog, MixedColumnIdTableCatalog, ComposedColumnIdTableCatalog, and InMemoryBaseTable/InMemoryTable accessor methods.

Why are the changes needed?

Connect has fundamentally different semantics from classic mode: every action re-analyzes the plan from scratch on the server, meaning there is no pinned QueryExecution and no stale schema detection. These tests document and verify that behavior across all catalog ID support variants.

This is the Spark Connect counterpart of #55463.

Does this PR introduce any user-facing change?

No. This PR only adds tests and test-only catalogs.

How was this patch tested?

New tests in DataSourceV2JoinConnectSuite.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-6)

@longvu-db longvu-db marked this pull request as draft April 21, 2026 20:49
@longvu-db longvu-db changed the title [SPARK-XXXXX][CONNECT][TESTS] Add Connect join refresh tests [SPARK-56618][CONNECT][TESTS] Add Connect join refresh tests Apr 24, 2026
@longvu-db longvu-db changed the title [SPARK-56618][CONNECT][TESTS] Add Connect join refresh tests [SPARK-56618][CONNECT][TESTS] Add DSv2 join refresh tests for incrementally constructed queries in SPark Connect Apr 24, 2026
@longvu-db longvu-db marked this pull request as ready for review April 24, 2026 14:38
@longvu-db longvu-db changed the title [SPARK-56618][CONNECT][TESTS] Add DSv2 join refresh tests for incrementally constructed queries in SPark Connect [SPARK-56618][CONNECT][TESTS] Add DSv2 join refresh tests for incrementally constructed queries in Spark Connect Apr 24, 2026
@longvu-db longvu-db force-pushed the dsv2-connect-pr3-joins branch from 263c8bf to 642b0bc Compare April 30, 2026 13:14
@longvu-db longvu-db force-pushed the dsv2-connect-pr3-joins branch from e058e6a to 6b7251d Compare May 8, 2026 19:58
Copy link
Copy Markdown
Contributor

@andreaschat-db andreaschat-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @longvu-db. Left a comment.

@longvu-db longvu-db requested a review from andreaschat-db May 18, 2026 20:38
@longvu-db longvu-db force-pushed the dsv2-connect-pr3-joins branch from 7081cff to dd57944 Compare May 19, 2026 09:13
longvu-db added 20 commits May 20, 2026 15:08
…rios

Add 6 specific join tests with checkAnswer that mirror the classic
DataSourceV2DataFrameSuite join scenarios from PR apache#55463. These tests
verify Connect-specific behavior where both sides re-analyze on every
action, so operations that fail in classic mode (DROP COLUMN, drop/recreate
table, type change) succeed in Connect.

Co-authored-by: Isaac
These files belong in the classic test infrastructure PRs, not this
Connect-only test PR.

Co-authored-by: Isaac
Replace the client-jvm test with a server-side test using
SparkConnectServerTest and withSession pattern, matching the
style of DataSourceV2CacheConnectSuite.

Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Replace direct SQL statements (ALTER TABLE, DROP TABLE, INSERT) with
catalog API calls to simulate truly external changes between df1 and
df2 creation. This matches the pattern used in other DSv2 test sections.

Co-authored-by: Isaac
…alidation

Remove external writer that adds rows after drop/recreate and
drop/re-add column. Add schema assertions to verify the join
result has the expected column names. Update expected results
to reflect that Connect re-analyzes both sides against the new
table state (empty after recreate, null salary after drop+re-add).
No longer needed after removing external writer from Scenario 6.
A single alterTable(ident, dropCol, addCol) applies both changes
atomically, so the data migration sees salary in both old and new
schemas and retains the old value. Splitting into two separate calls
ensures the drop triggers data migration (nulling salary), matching
the behavior of ALTER TABLE DROP/ADD COLUMN SQL statements.
longvu-db added 22 commits May 20, 2026 15:08
Co-authored-by: Isaac
Remove unnecessary ClassTag from serverCatalog, add externalAppend
helper method, and use it to reduce boilerplate in test bodies.

Co-authored-by: Isaac
Scenario 5: use single alterTable call with both dropCol and addCol
(matching classic test). Data is preserved since column name and type match.
Add externalAppend to align test structure with classic.

Scenario 6: keep two separate calls because InMemoryBaseTable preserves
old INT data when names match in a single call, causing a type mismatch
at read time. Document the difference in the test comment.

Co-authored-by: Isaac
Rename existing tests to include "external" and add session-based
variants that use SQL instead of catalog API for scenarios 4-6:
- drop/recreate table
- drop+re-add same-type column
- drop+re-add different-type column

Co-authored-by: Isaac
Rename existing tests to include "external" and add session-based
variants using SQL for scenarios 1-3:
- insert (session INSERT INTO)
- ADD COLUMN (session ALTER TABLE ADD COLUMN)
- DROP COLUMN (session ALTER TABLE DROP COLUMN)

Co-authored-by: Isaac
…g ID suffixes

Rewrite Connect join tests to mirror the classic DataSourceV2DataFrameSuite
structure from PR apache#55463:
- Add catalog ID support suffixes to all test names
- Use external-only variants for scenarios 4 and 5
- Add nullidcat (4b) and nullbothidscat (4c, 5b) catalog variants
- Scenario 6: use single alterTable call aligned with classic test
- Remove same-session variants for scenarios 4, 5, 6
- Rename "session" to "same-session" for clarity

Co-authored-by: Isaac
…mments

Extract common CREATE TABLE + INSERT + cleanup into withTable helper.
Add "Datasets are re-analyzed" explanation to each scenario comment,
with details on why classic fails (pinned plan) but Connect succeeds.

Co-authored-by: Isaac
Move scenario labels (S1-S6) into test names for discoverability.
Remove externalAppend from scenarios 4a/4b/4c since the recreated
table should be empty, and expect Seq.empty in join results.

Co-authored-by: Isaac
Copy the test catalog from the classic PR (apache#55463) so the Connect
tests for scenarios 4c and 5b can compile.

Co-authored-by: Isaac
…sionFrom

Add setVersionAndValidatedVersionFrom helper to InMemoryBaseTable to
carry forward version metadata when wrapping one table into another.
Without this, V2TableRefreshUtil cannot detect changes and refresh
stale table references at execution time.

Applied to all test catalogs that create new table instances via
alterTableWithData: NullTableIdAndNullColumnIdInMemoryTableCatalog,
NullColumnIdInMemoryTableCatalog, MixedColumnIdTableCatalog,
ComposedColumnIdTableCatalog, TypeChangeResetsColIdTableCatalog,
InMemoryRowLevelOperationTableCatalog, and PartialSchemaEvolutionCatalog.
Also refactored InMemoryTable.copy() to use the same helper.

Co-authored-by: Isaac
…olumn

Apply schemaAfterDrops fix to BasicInMemoryTableCatalog.alterTable so that
combined DROP + ADD column changes properly drop old data. The re-added
column has null values for existing rows, not the old value.

Update test expectations from [1, 100] to [1, null] in 3 tests.

Co-authored-by: Isaac
A single combined alterTable call preserves the old column ID via
assignMissingIds name matching, so COLUMN_ID_MISMATCH would not fire.
Two separate calls ensure the first deletes the column (and its ID),
so the second assigns a fresh ID.

Co-authored-by: Isaac
Remove the schemaAfterDrops mechanism from InMemoryTableCatalog.alterTable
that was causing ClassCastException in InsertIntoSchemaEvolutionTests.
Restore alterTableWithData(table.data, schema) to match master behavior.
Revert checkAnswer assertions for null-ID catalogs back to Row(1, 100).
Fix scalafmt formatting in DataSourceV2JoinConnectSuite.

Co-authored-by: Isaac
@longvu-db longvu-db force-pushed the dsv2-connect-pr3-joins branch from 9d2f3df to fd70f15 Compare May 20, 2026 15:09
…rtions

1. Restore schemaAfterDrops in InMemoryTableCatalog.alterTable so that
   drop+re-add column in a single alterTable call properly clears the
   dropped column's data from existing rows.

2. Fix null/mixed column ID test assertions: with version tracking now
   propagated, the table is refreshed on re-read after drop+re-add, so
   re-added salary column has null values (Row(1, null) not Row(1, 100)).

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants