feat: add experimental support for Spark regexp expressions via JVM UDF framework by andygrove · Pull Request #4239 · apache/datafusion-comet

andygrove · 2026-05-06T13:56:23Z

Which issue does this PR close?

Closes #.

Rationale for this change

This PR extends the JVM UDF framework (introduced in #4232) to support all Spark regular expression functions with full Java regex compatibility. The native Rust engine remains the default; the JVM-backed engine is offered as an experimental opt-in for workloads that need 100% Java-regex-compatible semantics (backreferences, lookaround, embedded flags, etc.) at the cost of a JNI round-trip per batch.

What changes are included in this PR?

Add JVM UDF implementations for regexp_extract, regexp_extract_all, regexp_instr, regexp_replace, and split
Add a new spark.comet.exec.regexp.engine config (default rust) that selects between the native Rust engine and the experimental Java engine
Add serde support to route the affected expressions through the JVM UDF bridge when the Java engine is selected
Add Arrow schema normalization in the Rust JVM UDF executor (handles ListVector field naming differences between Arrow Java and DataFusion)
Reorganize SQL test files: separate files per engine (*_rust.sql, *_rust_enabled.sql, *_java.sql)
Document the two engines and their trade-offs in docs/source/user-guide/latest/compatibility/regex.md, including a note that the Java engine is experimental and may change

How are these changes tested?

CometRegExpJvmSuite: 45 tests covering all regexp expressions with the Java engine explicitly enabled
12 SQL test files covering both Java and Rust engine paths
Existing CometExpressionSuite and CometStringExpressionSuite tests continue to pass

Also fix CometArrayExpressionSuite compilation by qualifying the Spark udf() call, which was shadowed by the new org.apache.comet.udf package.

Implements a DataFusion PhysicalExpr that evaluates child expressions, exports the results as Arrow FFI arrays, calls CometUdfBridge.evaluate() via JNI, and imports the output array. Adds datafusion-comet-jni-bridge as a dependency of the spark-expr crate.

… is true

… wording

…UDF class via context classloader Wrap the JNI body in try/finally so input ValueVectors and the result vector are always closed, even when the UDF or arrow export throws. Resolve the CometUDF class through the thread context classloader so user-supplied UDF jars (added via spark.jars) are visible from the bridge.

…ns fall back to Spark When routing RLike through the JVM UDF, reject Literal(null) and patterns that fail Pattern.compile during planning. Both cases now produce withInfo + None, letting Spark evaluate the expression instead of crashing the executor task with PatternSyntaxException or NullPointerException.

Make comet_udf_bridge an Option in JVMClasses so a missing org.apache.comet.udf.CometUdfBridge class (e.g. shading dropped org.apache.comet.udf.*) no longer crashes executor JVM init. The JVM-UDF dispatch path returns a clear ExecutionError when the bridge is unavailable. Also clarify the FFI lifetime contract on the result import.

Replace string literals "rust"/"java" used for the regexp engine selector with named constants on CometConf. Tighten CometRLike.getSupportLevel so it only reports Compatible(None) when the pattern is a Literal, matching the actual constraint enforced by the convert path.

Literal-folded children no longer get expanded to batch-row count before crossing JNI; ColumnarValue::Scalar is materialized at length 1, avoiding an O(rows) copy of values that never vary across the batch. Document the contract on CometUDF: scalar inputs arrive as length-1 vectors, vector inputs at the batch row count, and the result must match the longest input.

…suite

# Conflicts: # common/src/main/java/org/apache/comet/udf/CometUdfBridge.java # common/src/main/scala/org/apache/comet/udf/CometUDF.scala # native/jni-bridge/src/lib.rs # native/spark-expr/src/jvm_udf/mod.rs

The bridge now caches a single shared CometUDF instance per class across native worker threads, so UDF instance state must be thread-safe. Replace the per-instance LinkedHashMap LRU (which mutates on get when accessOrder is true) with ConcurrentHashMap.

Atomic and shorter. Avoids two threads compiling the same pattern under contention.

mbutrovich · 2026-05-08T15:37:18Z

Marking this as draft while we discuss and compare with #4267.

andygrove · 2026-05-12T14:20:12Z

@mbutrovich I'm going to start reviewing your PR, but it is 7k LOC and there is lot in there so it may take some time.

I have a couple of questions:

Is there any reason to block merging this PR to provide the regexp support now (as an experimental feature, gated on a config), and then update the UDF implementations later to use the new framework?
Can your PR be broken down into smaller units of work for easier review?

…xp-default

The native Rust engine remains the default. The JVM-backed engine is available as an experimental opt-in for workloads that need 100% Java-regex-compatible semantics.

mbutrovich · 2026-05-12T14:54:12Z

@mbutrovich I'm going to start reviewing your PR, but it is 7k LOC and there is lot in there so it may take some time.

I have a couple of questions:

Is there any reason to block merging this PR to provide the regexp support now (as an experimental feature, gated on a config), and then update the UDF implementations later to use the new framework?

Can your PR be broken down into smaller units of work for easier review?

I suppose not.
It's still draft, not ready for human review, and have some separate PRs I can peel off, yep. It also still contains a chunk of code from this PR (all the tests).

mbutrovich · 2026-05-12T14:55:13Z

That does raise the issue of how aggressively I should review these if they'll be dead code soon. There are number of optimizations we're leaving on the table, but maybe that doesn't matter right now.

andygrove · 2026-05-12T14:58:07Z

That does raise the issue of how aggressively I should review these if they'll be dead code soon. There are number of optimizations we're leaving on the table, but maybe that doesn't matter right now.

My argument for allowing this PR in would be that it adds all the regression tests and gives us an immediate performance win today.

The refactor to update the UDF implementations to use the new framework should be a net reduction in line count I am assuming and easy to review?

mbutrovich · 2026-05-12T15:20:26Z

The refactor to update the UDF implementations to use the new framework should be a net reduction in line count I am assuming and easy to review?

Likely not due to the codegen work I've done, but we'll cross that bridge when we come to it. If you mark this ready for review today I'll review it.

These SQL test files exercise Java-only regex features (backreferences, lookahead, embedded flags) and previously relied on the JVM engine being the default. After the default reverted to the Rust engine, they need to explicitly opt in via spark.comet.exec.regexp.engine=java.

# Conflicts: # spark/src/main/scala/org/apache/comet/serde/strings.scala

andygrove · 2026-05-13T13:18:37Z

@mbutrovich following on from our discussion about configs yesterday, I filed an issue where we can have that discussion. #4310

andygrove added 30 commits April 30, 2026 06:17

docs: add implement-comet-expression Claude skill

9cd1566

docs: reference PR template and add skill-acknowledgement note

953cb86

docs: check datafusion-spark crate before writing native code

422d2b3

Merge branch 'add-implement-expression-skill'

88f2331

feat: add CometUDF trait for JVM-side scalar UDFs

eb8aa14

feat: add RegExpLikeUDF using java.util.regex.Pattern

60a2ecd

Also fix CometArrayExpressionSuite compilation by qualifying the Spark udf() call, which was shadowed by the new org.apache.comet.udf package.

feat: add CometUdfBridge JNI entry point for native UDF dispatch

633b75e

feat: add JvmScalarUdf proto message for JVM UDF dispatch

1c64070

feat: register CometUdfBridge in JVMClasses for native UDF dispatch

8f78436

feat: wire JvmScalarUdf proto into native planner

d8ab411

feat: add spark.comet.exec.regexp.useJVM config

4970c9c

feat: route RLike through JVM UDF when spark.comet.exec.regexp.useJVM…

54ddd50

… is true

test: add end-to-end suite for JVM-backed RLike

0a942ad

fix: use project-wide CometArrowAllocator in RegExpLikeUDF

fbfc158

docs: correct CometUdfBridge thread cache lifetime comment

909ab91

docs: document from_ffi consumption invariant in JvmScalarUdfExpr

862ed2e

style: apply make format

a943de5

docs: mark spark.comet.exec.regexp.useJVM experimental and generalize…

e1b9b2a

… wording

test: add CometRegExpBenchmark covering all rlike modes

76418c6

ci: register new RLike JVM-bridge test suites in PR workflows

8ac45be

build: exclude docs/superpowers from rat and git

a1f8ecf

remove skill

23a9e52

refactor: rename regexp.useJVM boolean to regexp.engine enum (rust|java)

1c66f44

test: cover empty and all-null subject vectors in RegExpLikeUDF unit …

85029c5

…suite

mbutrovich mentioned this pull request May 6, 2026

feat: add JVM UDF framework for native execution #4232

Merged

Merge remote-tracking branch 'apache/main' into java-regexp

1ad838b

# Conflicts: # common/src/main/java/org/apache/comet/udf/CometUdfBridge.java # common/src/main/scala/org/apache/comet/udf/CometUDF.scala # native/jni-bridge/src/lib.rs # native/spark-expr/src/jvm_udf/mod.rs

andygrove marked this pull request as ready for review May 8, 2026 15:15

andygrove added 2 commits May 8, 2026 09:18

refactor: use computeIfAbsent for pattern cache lookup

941d9c7

Atomic and shorter. Avoids two threads compiling the same pattern under contention.

mbutrovich mentioned this pull request May 8, 2026

feat: Arrow-direct codegen dispatcher for Spark expressions and Scala UDFs #4267

Draft

mbutrovich marked this pull request as draft May 8, 2026 15:37

mbutrovich added a commit to mbutrovich/datafusion-comet that referenced this pull request May 9, 2026

add some complex type support, remove apache#4239 code. update docs.

07e37ea

andygrove added 4 commits May 12, 2026 08:23

Merge remote-tracking branch 'apache/main' into worktree-pr-4239-rege…

336ec6e

…xp-default

fix: default regexp engine back to rust, mark java engine experimental

ea939ce

The native Rust engine remains the default. The JVM-backed engine is available as an experimental opt-in for workloads that need 100% Java-regex-compatible semantics.

style: prettier format regex compatibility docs

5e18c62

style: drop unused idx binding in RegExpInStrUDF to fix scalafix lint

8b92370

andygrove changed the title ~~feat: add all Spark regexp expressions via JVM UDF framework~~ feat: add experimental support for Spark regexp expressions via JVM UDF framework May 12, 2026

style: drop unused idx bindings in regexp serde to fix scalafix lint

ca6628b

andygrove marked this pull request as ready for review May 12, 2026 15:45

mbutrovich added a commit to mbutrovich/datafusion-comet that referenced this pull request May 12, 2026

Remove code related to apache#4239.

4be8144

andygrove mentioned this pull request May 12, 2026

feat: add experimental support for JSON expressions using Comet JVM UDF Framework #4305

Draft

andygrove mentioned this pull request May 13, 2026

[DISCUSS] Simplify regex engine + incompatibility config model #4310

Open

Merge remote-tracking branch 'apache/main' into java-regexp

b55adb0

# Conflicts: # spark/src/main/scala/org/apache/comet/serde/strings.scala

andygrove moved this to In progress in Comet Development May 13, 2026

andygrove added this to Comet Development May 13, 2026

andygrove added this to the 0.17.0 (June 2026) milestone May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add experimental support for Spark regexp expressions via JVM UDF framework#4239

feat: add experimental support for Spark regexp expressions via JVM UDF framework#4239
andygrove wants to merge 49 commits into
apache:mainfrom
andygrove:java-regexp

andygrove commented May 6, 2026 •

edited

Loading

Uh oh!

mbutrovich commented May 8, 2026

Uh oh!

andygrove commented May 12, 2026

Uh oh!

mbutrovich commented May 12, 2026 •

edited

Loading

Uh oh!

mbutrovich commented May 12, 2026 •

edited

Loading

Uh oh!

andygrove commented May 12, 2026

Uh oh!

mbutrovich commented May 12, 2026

Uh oh!

andygrove commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

mbutrovich commented May 8, 2026

Uh oh!

andygrove commented May 12, 2026

Uh oh!

mbutrovich commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbutrovich commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove commented May 12, 2026

Uh oh!

mbutrovich commented May 12, 2026

Uh oh!

andygrove commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andygrove commented May 6, 2026 •

edited

Loading

mbutrovich commented May 12, 2026 •

edited

Loading

mbutrovich commented May 12, 2026 •

edited

Loading