diff --git a/docs/source/user-guide/latest/compatibility/index.md b/docs/source/user-guide/latest/compatibility/index.md index 46578d1683..54a7b4ae39 100644 --- a/docs/source/user-guide/latest/compatibility/index.md +++ b/docs/source/user-guide/latest/compatibility/index.md @@ -30,3 +30,30 @@ This guide documents areas where Comet's behavior is known to differ from Spark. - **Expressions**: per-expression compatibility notes, including cast. - **JSON**: choosing between the native and Spark-compatible engines for JSON expressions. - **Spark versions**: version-specific known issues and limitations. + +## Native and codegen-dispatch implementations + +Some Spark expressions have two implementations in Comet: + +- A **codegen-dispatch** implementation that runs Spark's own generated code for the + expression inside Comet's native pipeline (via the Arrow-direct codegen dispatcher). This + produces byte-exact Spark results at the cost of one JNI round-trip per batch. It is gated + globally by `spark.comet.exec.scalaUDF.codegen.enabled` (enabled by default); when the + dispatcher is disabled, these expressions fall back to Spark. +- A **native** (Rust / DataFusion) implementation that is faster, with no JNI overhead, but + has known semantic differences from Spark for some inputs or patterns. + +Because the codegen-dispatch path matches Spark exactly, Comet uses it by **default**. The +faster native path is **opt-in per expression** via that expression's +`spark.comet.expression..allowIncompatible=true` flag, which declares that you +accept its differences from Spark. There is no global opt-in. When the native path is enabled +but a specific input or pattern has no native implementation, Comet routes that case back +through the codegen dispatcher rather than running something incompatible. + +This is the model behind the [regular expression](regex.md) and [JSON](json.md) families, +which document their per-expression configs and the specific differences to expect. + +This is distinct from expressions that have **no** codegen-dispatch path: there, the +incompatible cases fall back to Spark by default, and `allowIncompatible=true` runs the native +(incompatible) path instead. `cast` is the main example; see the +[expression reference](../expressions.md) for which expressions have incompatible cases. diff --git a/docs/source/user-guide/latest/expressions.md b/docs/source/user-guide/latest/expressions.md index 1281f3bad2..b148e60f13 100644 --- a/docs/source/user-guide/latest/expressions.md +++ b/docs/source/user-guide/latest/expressions.md @@ -26,10 +26,14 @@ transparently falls back to Spark for that part of the plan; results are unaffec Expressions marked ✅ Supported are enabled by default and produce Spark-compatible results. -Some ✅ Supported expressions have specific incompatible cases that fall back to Spark by -default. Those cases must be opted into per expression with +Some ✅ Supported expressions have specific incompatible cases that are not run by default. +Those cases must be opted into per expression with `spark.comet.expression.EXPRNAME.allowIncompatible=true` (where `EXPRNAME` is the Spark -expression class name, for example `Cast`). There is no global opt-in. +expression class name, for example `Cast`). There is no global opt-in. By default such a case +either falls back to Spark (for example `cast`) or, when the expression has a Spark-compatible +codegen-dispatch implementation, runs through that instead (for example the regex and JSON +families). See [Native and codegen-dispatch implementations](compatibility/index.md#native-and-codegen-dispatch-implementations) +for how Comet chooses. Most expressions can also be disabled with `spark.comet.expression.EXPRNAME.enabled=false`, where `EXPRNAME` is the Spark expression class name (for example `Length` or `StartsWith`). See the