[SPARK-56936][PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3#55974
Draft
zhengruifeng wants to merge 4 commits into
Draft
[SPARK-56936][PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3#55974zhengruifeng wants to merge 4 commits into
zhengruifeng wants to merge 4 commits into
Conversation
…24.04 tzdata - Switch the tz-aware fixture from the legacy alias `US/Eastern` to its canonical IANA name `America/New_York`. On Ubuntu 24.04 the system `tzdata` package no longer ships the legacy `US/*` aliases (those moved to `tzdata-legacy`), so under pandas >= 3.0 (which resolves tz via stdlib zoneinfo instead of bundled pytz), the previous fixture raised `ZoneInfoNotFoundError` in CI. - Remap the loaded golden DataFrame in memory when running under pandas >= 3.0 so the pandas-2-generated golden columns still line up: `datetime64[ns]` -> `[us]` and `Categorical` categories `object` -> `str`. Only the column keys are remapped; the on-disk golden file is unchanged. Generated-by: Claude Code
HyukjinKwon
approved these changes
May 19, 2026
Extend the pandas-3 in-memory adapter so the value comparisons also line up: - Scale 13+ digit integers in cells of datetime64 / Timedelta-list columns by 1/1000. Pandas 3 returns microseconds where pandas 2 returned nanoseconds for the same cast, e.g. bigint <- pd.date_range(...).values flips from 86_400_000_000_000 to 86_400_000_000. - Override the single decimal(10,0) x ['12','34']@list cell, which flipped from "X" (pandas 2 errored) to [Decimal('12'), Decimal('34')] (pandas 3 succeeds). Test now passes under both pandas 2.3.3 (spark-dev-313) and pandas 3.0.2 (spark-dev-313-p3) locally. Generated-by: Claude Code
No behavior change. Folds _patch_golden_for_pandas3 directly into the loader block where it is used, since it is only called once. Also replaces the local re.sub helper with Series.str.replace(regex=True) to drop the `import re`. Generated-by: Claude Code
No behavior change. Use self.repr_value(value) and self.repr_type(...) to derive both rename and scale targets directly from self.test_data and the affected Spark type, instead of grep-matching the golden column names. Single loop over test_data builds both rename and scale_cols. Generated-by: Claude Code
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Make
pyspark.sql.tests.coercion.test_pandas_udf_return_type.PandasUDFReturnTypeTestswork under pandas >= 3.0 and on systems whosetzdatapackage no longer ships the legacyUS/*aliases (e.g. Ubuntu 24.04 / noble).Switch the tz-aware fixture from
US/EasterntoAmerica/New_York. The values returned bypd.date_range(...).valuesare identical for the two aliases (same zone, same DST rules), so the on-disk golden file does not need to be regenerated.Patch the loaded golden DataFrame in memory for pandas >= 3.0. The golden file was generated under pandas 2 and the on-disk content is unchanged. At load time, when running under pandas >= 3.0, the test:
[us]instead of[ns], andpd.Categoricalkeepsstr-dtyped categories instead ofobject.bigint <- pd.date_range(...).valuesflips from86_400_000_000_000to86_400_000_000).decimal(10,0) x ['12','34']@listcell, which flipped fromX(pandas 2 errored) to[Decimal('12'), Decimal('34')](pandas 3 succeeds at the string -> Decimal coercion).Why are the changes needed?
The scheduled CI run on the
python-312-pandas-3image fails in this suite, e.g. https://github.com/apache/spark/actions/runs/26002965955/job/76430490989. Root causes:pd.date_range("19700101", periods=2, tz="US/Eastern").valuesraiseszoneinfo._common.ZoneInfoNotFoundError: 'No time zone found with key US/Eastern'. Pandas 3 droppedpytzas a hard dependency and now resolves tz names through stdlibzoneinfo, which on Ubuntu 24.04 cannot findUS/Easternbecause Ubuntu moved the legacy aliases out oftzdatainto a separatetzdata-legacypackage that the CI image does not install.golden.loc[str_t, str_v]raisesKeyErrorbecause the column keys in the golden file are pandas-2-shaped (datetime64[ns],Categorical(..., object)) but the lookup keys built at runtime are pandas-3-shaped (datetime64[us],Categorical(..., str)).Does this PR introduce any user-facing change?
No. Test-only change.
How was this patch tested?
Ran the suite locally under two envs:
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code