Skip to content

feat(datafusion): push down isnan over NaN-preserving numeric expressions#2592

Open
huan233usc wants to merge 8 commits into
apache:mainfrom
huan233usc:fix/2154-isnan-complex-arg-pushdown
Open

feat(datafusion): push down isnan over NaN-preserving numeric expressions#2592
huan233usc wants to merge 8 commits into
apache:mainfrom
huan233usc:fix/2154-isnan-complex-arg-pushdown

Conversation

@huan233usc
Copy link
Copy Markdown
Contributor

@huan233usc huan233usc commented Jun 5, 2026

Which issue does this PR close?

Part of #2154.

What changes are included in this PR?

Previously, scalar-function pushdown only handled isnan(...) when the argument was a bare column reference; complex numeric arguments such as isnan(qux + 1) were silently dropped.

This PR adds resolve_nan_preserving_reference, which resolves an isnan argument down to a single column Reference through transformations that preserve NaN-ness, so isnan(<expr>) can be pushed down as <col> IS NAN:

  • negation -x
  • numeric casts (date casts still rejected)
  • x + c, c + x, x - c, c - x for a finite literal c
  • x * c, c * x, x / c for a finite, non-zero literal c
  • arbitrary nesting of the above (e.g. isnan(-(qux + 1) * 3))

Resolving scalar-function arguments (e.g. abs(x)) is intentionally left for a follow-up PR to keep this change small; isnan(abs(x)) is not pushed down here.

Why is this sound?

Filter pushdown is reported as Inexact, so DataFusion re-applies the original predicate after scanning. The pushed-down predicate therefore only needs to be implied by the original filter (it may match extra rows, but must never drop a matching one). Every supported transformation keeps NaN-ness exactly equivalent — the result is NaN iff the wrapped column is NaN — so both isnan(...) and NOT isnan(...) remain correct.

Cases that do not preserve NaN-ness are intentionally rejected (and covered by tests): x * 0 / x / 0 (±inf * 0 is NaN), c / x (0 / 0 is NaN), and multi-column expressions.

Are these changes tested?

Yes. Updated the existing isnan(qux + 1) "unsupported" test and added unit tests covering negation, additive and multiplicative forms, nested expressions, combination with other predicates, and the rejected/deferred cases. All unit tests in expr_to_predicate pass, along with clippy and rustfmt.

…ions

Previously `isnan(...)` was only pushed down to Iceberg when its argument
was a bare column; complex numeric arguments such as `isnan(qux + 1)` were
silently dropped.

This resolves the argument through NaN-preserving transformations - negation,
`abs`, numeric casts, and arithmetic with finite literals - down to a single
column reference, so `isnan(<expr>)` is pushed down as `<col> IS NAN`.

Filter pushdown is reported as Inexact, so DataFusion re-applies the original
predicate after scanning. Every supported transformation keeps NaN-ness exactly
equivalent (result is NaN iff the column is NaN), so both `isnan(...)` and
`NOT isnan(...)` stay correct. Unsound cases (`x * 0`, `x / 0`, `c / x`,
multi-column expressions) are explicitly rejected and covered by tests.

Closes apache#2154
Replace the terse `(nonzero, literal_allowed_left)` tuple with explicit
per-operator handling, and simplify the literal helpers to return booleans
(`is_finite_literal` / `is_finite_nonzero_literal`). No behavior change.
Narrow this PR's scope to references, negation, numeric casts and arithmetic
with finite literals. Resolving scalar-function arguments such as `abs(x)` is
left for a separate PR. `isnan(abs(x))` is no longer pushed down (covered by
the unsupported-args test).
Re-add `abs(x)` as a NaN-preserving argument to `isnan`, and merge the literal
f64 extraction into a single `literal_as_f64` helper. No other behavior change.
…tion

Replace the hand-rolled numeric type match with DataFusion's
`ScalarValue::cast_to(Float64)`. Less code and covers all numeric literal
types (e.g. decimals) without enumerating them. No behavior change for the
supported cases.
Use the existing `Reference::is_nan()` builder instead of constructing the
`Predicate::Unary(UnaryExpression::new(...))` by hand.
…iteral

Fold the finiteness check into `finite_literal` and inline the non-zero check
at the multiply/divide call sites, removing the two thin boolean wrappers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant