Skip to content

[SPARK-56618][SQL][TESTS] Add DSv2 join refresh tests for incrementally constructed queries#55463

Open
longvu-db wants to merge 29 commits into
apache:masterfrom
longvu-db:dsv2-pr3-joins
Open

[SPARK-56618][SQL][TESTS] Add DSv2 join refresh tests for incrementally constructed queries#55463
longvu-db wants to merge 29 commits into
apache:masterfrom
longvu-db:dsv2-pr3-joins

Conversation

@longvu-db
Copy link
Copy Markdown
Contributor

@longvu-db longvu-db commented Apr 21, 2026

What changes were proposed in this pull request?

Add join-specific tests for incrementally constructed queries to DataSourceV2DataFrameSuite, along with a new NullTableIdAndNullColumnIdInMemoryTableCatalog test catalog.

When DataFrames are analyzed at different times and then joined, the refresh phase in QueryExecution must align all table references to the same version. These tests verify 6 core scenarios with catalog ID support variants:

  • Scenario 1 (external + same-session insert): Both sides of the join see the latest version after an INSERT between df1 and df2 creation.
  • Scenario 2 (external + same-session ADD COLUMN): df1 keeps its original 2-column schema while df2 has the new 3-column schema; both see the latest data.
  • Scenario 3 (external + same-session DROP COLUMN): Removing a column that df1 references throws COLUMNS_MISMATCH.
  • Scenario 4a (external drop/recreate, table with both table and column ID support): Drop+recreate produces a new table ID; join throws TABLE_ID_MISMATCH.
  • Scenario 4b (external drop/recreate, table without table ID support, but with column ID support): Column IDs still detect the recreate; join throws COLUMN_ID_MISMATCH.
  • Scenario 4c (external drop/recreate, table without table ID support and without column ID support): Neither ID check fires; join succeeds (change goes undetected).
  • Scenario 5a (external drop+re-add column, table without table ID support, but with column ID support): Two separate alterTable calls assign a fresh column ID; join throws COLUMN_ID_MISMATCH.
  • Scenario 5b (external drop+re-add column, table without table ID support and without column ID support): Neither ID check fires; join succeeds (change goes undetected).
  • Scenario 6 (external type change, table with both table and column ID support): Schema validation catches the type change (INT to STRING); join throws COLUMNS_MISMATCH.

Why are the changes needed?

The existing tests covered scenarios 1-3 within larger multi-step tests but did not cover:

  • Table identity detection via table ID and column ID (scenarios 4a-4c)
  • Column identity detection via column ID (scenarios 5a-5b)
  • Type change detection via schema validation (scenario 6)
  • The case where neither table ID nor column ID is supported, showing the limitation (scenarios 4c, 5b)

These scenarios are important to verify the correctness of DSv2 table refresh behavior for incrementally constructed queries.

Does this PR introduce any user-facing change?

No. This PR only adds tests and a test-only catalog.

How was this patch tested?

New tests in DataSourceV2DataFrameSuite and new NullTableIdAndNullColumnIdInMemoryTableCatalog test catalog.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-6)

@longvu-db longvu-db marked this pull request as draft April 21, 2026 20:51
longvu-db added a commit to longvu-db/spark that referenced this pull request Apr 24, 2026
…rios

Add 6 specific join tests with checkAnswer that mirror the classic
DataSourceV2DataFrameSuite join scenarios from PR apache#55463. These tests
verify Connect-specific behavior where both sides re-analyze on every
action, so operations that fail in classic mode (DROP COLUMN, drop/recreate
table, type change) succeed in Connect.

Co-authored-by: Isaac
@longvu-db longvu-db changed the title [SPARK-XXXXX][SQL][TESTS] Add DSv2 join and set operation refresh tests [SPARK-XXXXX][SQL][TESTS] Add DSv2 join refresh tests for incrementally constructed queries Apr 24, 2026
@longvu-db longvu-db changed the title [SPARK-XXXXX][SQL][TESTS] Add DSv2 join refresh tests for incrementally constructed queries [SPARK-56618][SQL][TESTS] Add DSv2 join refresh tests for incrementally constructed queries Apr 24, 2026
@longvu-db longvu-db marked this pull request as ready for review April 24, 2026 14:26
longvu-db added a commit to longvu-db/spark that referenced this pull request Apr 30, 2026
…rios

Add 6 specific join tests with checkAnswer that mirror the classic
DataSourceV2DataFrameSuite join scenarios from PR apache#55463. These tests
verify Connect-specific behavior where both sides re-analyze on every
action, so operations that fail in classic mode (DROP COLUMN, drop/recreate
table, type change) succeed in Connect.

Co-authored-by: Isaac
Remove separate test suites, test catalogs, and production code changes.
Add 3 join tests for design doc scenarios 4-6 (drop+recreate table,
drop+re-add column same type, drop+re-add column different type) directly
in DataSourceV2DataFrameSuite.

Co-authored-by: Isaac
…s 1-3

Group all 6 design doc join scenarios together at the bottom of the file,
following the design doc order (scenarios 1-6).

Co-authored-by: Isaac
Replace direct SQL statements (ALTER TABLE, DROP TABLE, INSERT) with
catalog API calls to simulate truly external changes between df1 and
df2 creation. This matches the pattern used in other DSv2 test sections.

Co-authored-by: Isaac
…pr3-joins

# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2DataFrameSuite.scala
Co-authored-by: Isaac
Use named arguments for externalAppend calls

Co-authored-by: Isaac
EOF
)
longvu-db added a commit to longvu-db/spark that referenced this pull request May 8, 2026
…rios

Add 6 specific join tests with checkAnswer that mirror the classic
DataSourceV2DataFrameSuite join scenarios from PR apache#55463. These tests
verify Connect-specific behavior where both sides re-analyze on every
action, so operations that fail in classic mode (DROP COLUMN, drop/recreate
table, type change) succeed in Connect.

Co-authored-by: Isaac
Copy link
Copy Markdown
Contributor

@andreaschat-db andreaschat-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @longvu-db. Left a few comments regarding test coverage.

longvu-db added 3 commits May 13, 2026 12:17
Add two new join test variants:
- Scenario 4 variant: drop+recreate table detected via column IDs
  when table ID is null (uses nullidcat catalog)
- Scenario 5 variant: drop+re-add column detected via column IDs
  when two separate alterTable calls assign a fresh column ID

Co-authored-by: Isaac
…t names

Rename existing tests to include "external" and add session-based
variants that use SQL instead of catalog API for the mutations:
- drop/recreate table with null table ID catalog
- drop+re-add column

Co-authored-by: Isaac
Rename existing tests to include "external" and add session-based
variants using SQL for scenarios 1-3:
- insert (session INSERT INTO)
- ADD COLUMN (session ALTER TABLE ADD COLUMN)
- DROP COLUMN (session ALTER TABLE DROP COLUMN)

Co-authored-by: Isaac
longvu-db added 9 commits May 13, 2026 13:10
Drop/recreate changes both id and salary column IDs, producing
a multiline errors string. Use (?s).* to match across newlines.

Co-authored-by: Isaac
Co-authored-by: Isaac
…support)

Shows that table ID still detects drop/recreate when column IDs are
null (using nullcolidcat catalog).

Co-authored-by: Isaac
…labels

Rename test name suffixes to clearly indicate catalog capabilities:
- testcat: (table with both table and column ID support)
- nullidcat: (table without table ID support, but with column ID support)
- nullcolidcat: (table with table ID support, but without column ID support)

Add sub-labels (4a-4d, 5a-5c) for scenario 4 and 5 variants.

Co-authored-by: Isaac
…ble test

Add "external" or "same-session" to all test names that were ambiguous.
Remove Scenario 5a (single alterTable preserving column ID) since it
tests an implementation detail of batched changes. Renumber 5b/5c to 5a/5b.

Co-authored-by: Isaac
…-IDs catalog

Remove same-session variants (4c, 4d, 4e, 5b, 5c) and replace with
external variants using NullTableIdAndNullColumnIdInMemoryTableCatalog:
- 4c: external drop/recreate with no IDs (undetected, join succeeds)
- 5b: external drop+re-add column with no IDs (undetected, join succeeds)

Add NullTableIdAndNullColumnIdInMemoryTableCatalog that strips both
table and column IDs to simulate connectors with no identity tracking.

Co-authored-by: Isaac
@longvu-db longvu-db requested a review from andreaschat-db May 18, 2026 20:24
longvu-db added a commit to longvu-db/spark that referenced this pull request May 18, 2026
…g ID suffixes

Rewrite Connect join tests to mirror the classic DataSourceV2DataFrameSuite
structure from PR apache#55463:
- Add catalog ID support suffixes to all test names
- Use external-only variants for scenarios 4 and 5
- Add nullidcat (4b) and nullbothidscat (4c, 5b) catalog variants
- Scenario 6: use single alterTable call aligned with classic test
- Remove same-session variants for scenarios 4, 5, 6
- Rename "session" to "same-session" for clarity

Co-authored-by: Isaac
longvu-db added a commit to longvu-db/spark that referenced this pull request May 18, 2026
Copy the test catalog from the classic PR (apache#55463) so the Connect
tests for scenarios 4c and 5b can compile.

Co-authored-by: Isaac
Add setVersionAndValidatedVersionFrom helper to InMemoryBaseTable to
carry forward version metadata when wrapping one table into another.
Without this, V2TableRefreshUtil cannot detect changes and refresh
stale table references at execution time.

Applied to all test catalogs that create new table instances via
alterTableWithData, and refactored InMemoryTable.copy() to use the
same helper.

Also simplify Scenario 4c: remove externalAppend after drop/recreate,
expect empty result since both sides refresh to the empty new table.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants