[VL][TEST] Reproduce: native Parquet writer ignores Hadoop-conf parquet.block.size#12307
Draft
felipepessoto wants to merge 6 commits into
Draft
[VL][TEST] Reproduce: native Parquet writer ignores Hadoop-conf parquet.block.size#12307felipepessoto wants to merge 6 commits into
felipepessoto wants to merge 6 commits into
Conversation
….block.size Adds VeloxParquetRowGroupSuite to pin down why Delta's DeletionVectorsWithPredicatePushdownSuite aborts under Gluten (its beforeAll writes 1M rows with parquet.block.size=2MB on the Hadoop conf and asserts >1 row group, but Gluten produces a single row group). Three native-write cases (~8MB of int64, checkNativeWrite asserts the write offloads to ColumnarWriteFilesExec): - default 128MB block size -> exactly 1 row group (the Delta-failure scenario); - spark.gluten.sql.columnar.parquet.write.blockSize=1MB -> multiple row groups (the native writer honors its own conf and CAN split); - parquet.block.size=1MB set on the runtime Hadoop conf (Delta's mechanism) -> still 1 row group (it does not reach the native writer). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ts desired behavior) Flip the third case from characterizing the bug (asserting a single row group) to reproducing it: it now asserts the DESIRED behavior (> 1 row group), so it FAILS today (parquet.block.size on the runtime Hadoop conf does not reach Gluten's native writer -> single row group) and will turn green once that config is plumbed through to the native writer. The default-block-size and Gluten-conf cases remain green controls. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…roupSuite - Wrap the runtime `spark.sparkContext.hadoopConfiguration` access with `// scalastyle:off/on hadoopconfiguration` (its documented escape hatch); the reproduction test must mutate the runtime Hadoop conf on purpose. - Reflow the header scaladoc to satisfy spotless/scalafmt (wrapMaxColumn=100). Verified locally: `spotless:check` and `scalastyle:check` both pass on backends-velox. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR CI showed the plain-Parquet cases in VeloxParquetRowGroupSuite all pass: a plain `INSERT OVERWRITE DIRECTORY ... USING PARQUET` write DOES honor `parquet.block.size` set on the runtime Hadoop conf. So the Delta `DeletionVectorsWithPredicatePushdownSuite` abort is not a generic native-writer problem -- it is specific to the Delta write path. Adds VeloxDeltaParquetRowGroupSuite: writes a 1M-row Delta table with `parquet.block.size` set on the same conf object Delta uses (`spark.sparkContext.hadoopConfiguration`) and asserts the data file has more than one row group. This is expected to FAIL today (single row group), reproducing the abort; the deletion-vectors variant matches the original suite. Also corrects the now-stale comments in VeloxParquetRowGroupSuite (the runtime Hadoop-conf block size does reach the native writer for plain Parquet writes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
org.apache.spark.util.Utils is private[spark] and this suite is in package org.apache.gluten.execution, so it cannot be referenced here (it also cascaded into an ambiguous `.sum`). Replace Utils.tryWithResource with a plain try/finally and make the row-group count explicitly Int. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…roupSuite Move scalastyle:off marker before withTempPath lambda to cover the entire hadoopConfiguration access on line 113.
felipepessoto
added a commit
to felipepessoto/gluten
that referenced
this pull request
Jun 19, 2026
The field() method returns an int32 which can be negative. When a negative index is cast to uint32_t for childAt(idx), it becomes a huge positive number that bypasses Velox's VELOX_CHECK_LT(idx, size) check and causes undefined behavior (SIGSEGV observed in Spark 4.0 script transformation tests). The explicit idx >= 0 check is necessary to catch negative indices before they're cast to unsigned and cause memory corruption. Reverts commits: - 4070b5c [VL] Use VELOX_ASSERT_RUNTIME_THROW for bounds check test - fab87a2 [VL] Remove message check from test - just verify exception is thrown - b6fe845 [VL] Fix test assertion to match Velox's error message format - 68c93dd [VL] Remove redundant index bounds check - validated by PR apache#12278 CI Fixes apache#12307 test failure in spark-test-spark40.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes are proposed in this pull request?
Adds
VeloxParquetRowGroupSuitevalidating how Gluten's native (Velox) Parquet writer decides Parquet row-group boundaries, and through which config channelparquet.block.sizereaches the native writer.Motivation: Delta's
DeletionVectorsWithPredicatePushdownSuite.beforeAllwrites a 1M-row table withhadoopConf().set("parquet.block.size", 2MB)and asserts the file has > 1 row group. Under Gluten the file has a single row group, sobeforeAllthrows and the whole suite aborts (uncovered by the Delta Spark UT pipeline #12278).Three native-write cases (~8MB of int64;
checkNativeWriteasserts the write offloads toColumnarWriteFilesExec):spark.gluten.sql.columnar.parquet.write.blockSize=1MB-> multiple row groups (the native writer honors its own conf and can split).parquet.block.size=1MBset on the runtime Hadoop conf (Delta's mechanism) -> asserts > 1 row group and currently FAILS, because that value does not reach the native writer (single row group). This case is the reproduction; it should turn green once the Hadoop-conf block size is plumbed through to the native writer.Draft to run CI and confirm the reproduction. Case 3 is expected RED until a follow-up fix.
How was this patch tested?
This IS the test. The CI run on this PR exercises all three cases.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: GitHub Copilot CLI (claude-opus-4.8)