feat(datafusion): push down isnan over NaN-preserving numeric expressions#2592
Open
huan233usc wants to merge 8 commits into
Open
feat(datafusion): push down isnan over NaN-preserving numeric expressions#2592huan233usc wants to merge 8 commits into
huan233usc wants to merge 8 commits into
Conversation
…ions Previously `isnan(...)` was only pushed down to Iceberg when its argument was a bare column; complex numeric arguments such as `isnan(qux + 1)` were silently dropped. This resolves the argument through NaN-preserving transformations - negation, `abs`, numeric casts, and arithmetic with finite literals - down to a single column reference, so `isnan(<expr>)` is pushed down as `<col> IS NAN`. Filter pushdown is reported as Inexact, so DataFusion re-applies the original predicate after scanning. Every supported transformation keeps NaN-ness exactly equivalent (result is NaN iff the column is NaN), so both `isnan(...)` and `NOT isnan(...)` stay correct. Unsound cases (`x * 0`, `x / 0`, `c / x`, multi-column expressions) are explicitly rejected and covered by tests. Closes apache#2154
Replace the terse `(nonzero, literal_allowed_left)` tuple with explicit per-operator handling, and simplify the literal helpers to return booleans (`is_finite_literal` / `is_finite_nonzero_literal`). No behavior change.
Narrow this PR's scope to references, negation, numeric casts and arithmetic with finite literals. Resolving scalar-function arguments such as `abs(x)` is left for a separate PR. `isnan(abs(x))` is no longer pushed down (covered by the unsupported-args test).
Re-add `abs(x)` as a NaN-preserving argument to `isnan`, and merge the literal f64 extraction into a single `literal_as_f64` helper. No other behavior change.
…tion Replace the hand-rolled numeric type match with DataFusion's `ScalarValue::cast_to(Float64)`. Less code and covers all numeric literal types (e.g. decimals) without enumerating them. No behavior change for the supported cases.
Use the existing `Reference::is_nan()` builder instead of constructing the `Predicate::Unary(UnaryExpression::new(...))` by hand.
…iteral Fold the finiteness check into `finite_literal` and inline the non-zero check at the multiply/divide call sites, removing the two thin boolean wrappers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Part of #2154.
What changes are included in this PR?
Previously, scalar-function pushdown only handled
isnan(...)when the argument was a bare column reference; complex numeric arguments such asisnan(qux + 1)were silently dropped.This PR adds
resolve_nan_preserving_reference, which resolves anisnanargument down to a single columnReferencethrough transformations that preserve NaN-ness, soisnan(<expr>)can be pushed down as<col> IS NAN:-xx + c,c + x,x - c,c - xfor a finite literalcx * c,c * x,x / cfor a finite, non-zero literalcisnan(-(qux + 1) * 3))Resolving scalar-function arguments (e.g.
abs(x)) is intentionally left for a follow-up PR to keep this change small;isnan(abs(x))is not pushed down here.Why is this sound?
Filter pushdown is reported as
Inexact, so DataFusion re-applies the original predicate after scanning. The pushed-down predicate therefore only needs to be implied by the original filter (it may match extra rows, but must never drop a matching one). Every supported transformation keeps NaN-ness exactly equivalent — the result is NaN iff the wrapped column is NaN — so bothisnan(...)andNOT isnan(...)remain correct.Cases that do not preserve NaN-ness are intentionally rejected (and covered by tests):
x * 0/x / 0(±inf * 0is NaN),c / x(0 / 0is NaN), and multi-column expressions.Are these changes tested?
Yes. Updated the existing
isnan(qux + 1)"unsupported" test and added unit tests covering negation, additive and multiplicative forms, nested expressions, combination with other predicates, and the rejected/deferred cases. All unit tests inexpr_to_predicatepass, along withclippyandrustfmt.