Skip to content

[VL] Fix native crash in toVeloxExpr for nested field reference into array/map#12290

Open
felipepessoto wants to merge 6 commits into
apache:mainfrom
felipepessoto:fix-fieldref-nested-array-crash
Open

[VL] Fix native crash in toVeloxExpr for nested field reference into array/map#12290
felipepessoto wants to merge 6 commits into
apache:mainfrom
felipepessoto:fix-fieldref-nested-array-crash

Conversation

@felipepessoto

@felipepessoto felipepessoto commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

What changes are proposed in this pull request?

TL;DR fix this: https://github.com/apache/gluten/actions/runs/27487184961/job/81246596702?pr=12278
image

PR/CI with test only, before the fix: #12291 / https://github.com/apache/gluten/actions/runs/27490867910/job/81255583450?pr=12291

image

SubstraitVeloxExprConverter::toVeloxExpr(const ::substrait::Expression::FieldReference&, const RowTypePtr&) resolves a nested struct_field reference by walking the chain and descending one level at a time with inputColumnType = asRowType(inputColumnType->childAt(idx)).

When the field path traverses a non-struct child — for example a field nested under an array — asRowType() returns nullptr, and the next loop iteration dereferenced that null RowType, crashing the whole (forked) JVM with a SIGSEGV in libvelox.so:

# C  [libvelox.so+0x214951c]  gluten::SubstraitVeloxExprConverter::toVeloxExpr(substrait::Expression_FieldReference const&, std::shared_ptr<facebook::velox::RowType const> const&)+0x9c

Because a SIGSEGV is not a catchable C++ exception, SubstraitToVeloxPlanValidator (which wraps expression conversion in try/catch (VeloxException)) could not catch it and fall back — the process died outright.

This PR replaces the unchecked navigation with VELOX_USER_CHECK guards that throw a VeloxUserError:

  • the field-reference index is within range (the raw RowType::childAt/nameOf use unchecked operator[]), and
  • the child being descended into is actually a struct/row.

With these checks, an unsupported nested reference makes plan validation fail cleanly and the query falls back to vanilla Spark instead of crashing the JVM.

How was this patch tested?

Added cpp/velox/tests/SubstraitVeloxExprConverterTest.cc, a unit test that builds the exact crashing FieldReference — a reference descending into an array column — and asserts the converter now throws a VeloxUserError (via VELOX_ASSERT_THROW) instead of crashing the process; it also covers an out-of-range field index. Before this change the same input aborts the JVM with the SIGSEGV shown above.

This is the same code path hit by Delta Lake's UpdateSuite test "nested data support - analysis error - updating array type" (an UPDATE whose field path descends through an array), which is exercised by the Gluten Delta Spark UT CI: before this change the forked test JVM aborts with the SIGSEGV; after it, validation falls back to vanilla Spark and the query is handled correctly.

clang-format-15 reports no changes.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: GitHub Copilot CLI (claude-opus-4.8)

…array/map

SubstraitVeloxExprConverter::toVeloxExpr(FieldReference) resolves a nested
struct_field chain by calling asRowType() on each child to descend a level.
When the reference traverses a non-struct child -- e.g. a field nested under
an array, as exercised by Delta's
"nested data support - analysis error - updating array type" UpdateSuite test
-- asRowType() returns null and the next loop iteration dereferenced that null
RowType, crashing the whole forked JVM with a SIGSEGV in libvelox.so
(gluten::SubstraitVeloxExprConverter::toVeloxExpr). A SIGSEGV is not catchable,
so plan validation could not fall back and the test process died outright.

Replace the unchecked navigation with VELOX_USER_CHECK guards (field-index
bounds + non-null struct child) that throw a VeloxUserError. The plan validator
already wraps expression conversion in try/catch(VeloxException), so the query
now falls back to vanilla execution cleanly instead of crashing.

Add SubstraitVeloxExprConverterTest with a regression repro: a field reference
descending into an array column, and an out-of-range field index, both of which
must now throw instead of crashing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 14, 2026 06:34
@github-actions github-actions Bot added the VELOX label Jun 14, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds regression coverage and hardens Substrait field-reference conversion to fail gracefully (VeloxUserError) instead of crashing (SIGSEGV) on invalid nested paths and out-of-range indices.

Changes:

  • Add bounds checking for direct struct_field indices during field-access conversion.
  • Reject nested field descent into non-row types with a user error (instead of nullptr deref).
  • Add unit tests covering both failure modes and wire them into the Velox test target.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
cpp/velox/tests/SubstraitVeloxExprConverterTest.cc New regression tests for non-struct nested field references and out-of-range field indices.
cpp/velox/tests/CMakeLists.txt Registers the new test file in the Velox test suite.
cpp/velox/substrait/SubstraitToVeloxExpr.cc Adds range checks and non-row descent checks to prevent SIGSEGV and throw clean user errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cpp/velox/substrait/SubstraitToVeloxExpr.cc
::substrait::Expression::FieldReference fieldReference;
fieldReference.mutable_direct_reference()->mutable_struct_field()->set_field(5);

VELOX_ASSERT_THROW(SubstraitVeloxExprConverter::toVeloxExpr(fieldReference, inputType), "out of range");
@felipepessoto felipepessoto changed the title [VL] Fix native crash in toVeloxExpr for nested field reference into array/map [WIP] [VL] Fix native crash in toVeloxExpr for nested field reference into array/map Jun 14, 2026
@felipepessoto felipepessoto changed the title [WIP] [VL] Fix native crash in toVeloxExpr for nested field reference into array/map [VL] Fix native crash in toVeloxExpr for nested field reference into array/map Jun 14, 2026
@felipepessoto

felipepessoto commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

@zhztheplayer @philo-he could you review this PR? I found the issue when running the Delta CI #12278.

I'm not familiar with this code, but with the fix the SIGSEGV are gone. I also created a PR without the fix and I could repro it: #12291 / https://github.com/apache/gluten/actions/runs/27490867910/job/81255583450?pr=12291.

Thanks

@zhztheplayer

Copy link
Copy Markdown
Member

cc @rui-mo

@rui-mo rui-mo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! I assume this change converts a core dump into a user-facing error. Have you looked into the cases where Spark could produce an out-of-range index? I'm wondering why Spark would generate this kind of plan.

felipepessoto added a commit to felipepessoto/gluten that referenced this pull request Jun 16, 2026
…crash

Adds VeloxDeltaNestedFieldArraySuite: a Delta MERGE that updates a field nested
under an array (`value.a` where `value` is `array<struct<a:int>>`). On an
unfixed build this drives SubstraitVeloxExprConverter::toVeloxExpr to descend
into the array, dereference the null RowType returned by asRowType(), and crash
the forked JVM with a SIGSEGV -- the Spark-level companion to the existing
SubstraitVeloxExprConverterTest.cc.

This reaches the native converter (and the crash) under the Spark 4.0/4.1
-Pdelta CI jobs (Delta 4.x). With the fix in apache#12290 the converter throws a
catchable VeloxUserError, Gluten falls back, and Delta raises the expected
AnalysisException.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@felipepessoto

Copy link
Copy Markdown
Contributor Author

Thanks for the fix! I assume this change converts a core dump into a user-facing error. Have you looked into the cases where Spark could produce an out-of-range index? I'm wondering why Spark would generate this kind of plan.

I tried multiple things, but I couldn't repro with simple unit test. The repro is at https://github.com/apache/gluten/actions/runs/27487184961/job/81246596702?pr=12278

image

So, I assume it is the next test:

https://github.com/delta-io/delta/blob/9f600c42dd9bc383c025ee6e43f8199571a2783b/spark/src/test/scala/org/apache/spark/sql/delta/MergeIntoSuiteBase.scala#L1291

Suite MergeIntoNestedDataSQLPathBasedSuite

The problem is this test is not a simple SQL command:

image

… CI

After validating with PR apache#12278 (all 79 CI checks passed including Delta tests),
confirmed that the explicit VELOX_USER_CHECK(idx < size) is redundant - Velox's
RowType::childAt/nameOf already have VELOX_CHECK_LT bounds checks that throw
catchable VeloxUserError, providing sufficient error handling for Gluten.

Kept the essential VELOX_USER_CHECK_NOT_NULL for the rowType, which prevents
SIGSEGV by catching null before dereferencing.

Updated test comment to reflect actual Velox behavior (built-in bounds checks).

Validation: apache@4a9d3d846
@felipepessoto

felipepessoto commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@rui-mo

For (VELOX_USER_CHECK(idx < size)) -- I have not seen Spark produce this, and I don't think a well-formed plan would: the index comes from Spark's resolved GetStructField.ordinal / bound attribute ordinals, which the analyzer keeps in range. I added it only as defense-in-depth with a clearer message. In fact, in current Velox RowType::childAt/nameOf already
VELOX_CHECK_LT(idx, size), so an out-of-range index would already throw a catchable VeloxRuntimeError and fall back even without this check.
I'm dropping that change to keep the PR focused on the real crash fix.

https://github.com/facebookincubator/velox/blob/main/velox/type/Type.h#L1213C1-L1216C4

UPDATE

The bounds check removal caused a SIGSEGV in spark-test-spark40. Investigation revealed the check was NOT redundant:

The Critical Issue:

  • tmp->field() returns int32 (signed integer)
  • The check included idx >= 0 to catch negative indices
  • When a negative int32 is cast to uint32_t for childAt(idx), it becomes a huge positive number (e.g., -1 -> 4294967295)
  • This bypasses Velox's VELOX_CHECK_LT(idx, size) check and causes undefined behavior/SIGSEGV

The test was checking for 'out of range' in the error message, but Velox's
VELOX_CHECK_LT throws with format 'Expression: idx < children_.size() (5 vs. 2)'.
Updated the test assertion to match the actual Velox error message.
Copilot AI review requested due to automatic review settings June 18, 2026 09:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread cpp/velox/tests/SubstraitVeloxExprConverterTest.cc Outdated
VELOX_ASSERT_THROW with empty string checks that an exception is thrown
without validating the message content. This makes the test less brittle
to Velox error message format changes while still verifying the bounds
check happens.
VELOX_CHECK_LT throws VeloxRuntimeError (not VeloxUserError or generic VeloxException),
so we need to use VELOX_ASSERT_RUNTIME_THROW to catch the correct exception type.
Copilot AI review requested due to automatic review settings June 19, 2026 02:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

The field() method returns an int32 which can be negative. When a negative
index is cast to uint32_t for childAt(idx), it becomes a huge positive number
that bypasses Velox's VELOX_CHECK_LT(idx, size) check and causes undefined
behavior (SIGSEGV observed in Spark 4.0 script transformation tests).

The explicit idx >= 0 check is necessary to catch negative indices before
they're cast to unsigned and cause memory corruption.

Reverts commits:
- 4070b5c [VL] Use VELOX_ASSERT_RUNTIME_THROW for bounds check test
- fab87a2 [VL] Remove message check from test - just verify exception is thrown
- b6fe845 [VL] Fix test assertion to match Velox's error message format
- 68c93dd [VL] Remove redundant index bounds check - validated by PR apache#12278 CI

Fixes apache#12307 test failure in spark-test-spark40.
@felipepessoto

Copy link
Copy Markdown
Contributor Author

@rui-mo can we merge it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants