[VL][TEST] Reproduce: native Parquet writer ignores Hadoop-conf parquet.block.size by felipepessoto · Pull Request #12307 · apache/gluten

felipepessoto · 2026-06-16T16:53:11Z

What changes are proposed in this pull request?

Adds VeloxParquetRowGroupSuite validating how Gluten's native (Velox) Parquet writer decides Parquet row-group boundaries, and through which config channel parquet.block.size reaches the native writer.

Motivation: Delta's DeletionVectorsWithPredicatePushdownSuite.beforeAll writes a 1M-row table with hadoopConf().set("parquet.block.size", 2MB) and asserts the file has > 1 row group. Under Gluten the file has a single row group, so beforeAll throws and the whole suite aborts (uncovered by the Delta Spark UT pipeline #12278).

Three native-write cases (~8MB of int64; checkNativeWrite asserts the write offloads to ColumnarWriteFilesExec):

default 128MB block size -> 1 row group (control; the Delta-failure scenario).
spark.gluten.sql.columnar.parquet.write.blockSize=1MB -> multiple row groups (the native writer honors its own conf and can split).
parquet.block.size=1MB set on the runtime Hadoop conf (Delta's mechanism) -> asserts > 1 row group and currently FAILS, because that value does not reach the native writer (single row group). This case is the reproduction; it should turn green once the Hadoop-conf block size is plumbed through to the native writer.

Draft to run CI and confirm the reproduction. Case 3 is expected RED until a follow-up fix.

How was this patch tested?

This IS the test. The CI run on this PR exercises all three cases.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: GitHub Copilot CLI (claude-opus-4.8)

….block.size Adds VeloxParquetRowGroupSuite to pin down why Delta's DeletionVectorsWithPredicatePushdownSuite aborts under Gluten (its beforeAll writes 1M rows with parquet.block.size=2MB on the Hadoop conf and asserts >1 row group, but Gluten produces a single row group). Three native-write cases (~8MB of int64, checkNativeWrite asserts the write offloads to ColumnarWriteFilesExec): - default 128MB block size -> exactly 1 row group (the Delta-failure scenario); - spark.gluten.sql.columnar.parquet.write.blockSize=1MB -> multiple row groups (the native writer honors its own conf and CAN split); - parquet.block.size=1MB set on the runtime Hadoop conf (Delta's mechanism) -> still 1 row group (it does not reach the native writer). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ts desired behavior) Flip the third case from characterizing the bug (asserting a single row group) to reproducing it: it now asserts the DESIRED behavior (> 1 row group), so it FAILS today (parquet.block.size on the runtime Hadoop conf does not reach Gluten's native writer -> single row group) and will turn green once that config is plumbed through to the native writer. The default-block-size and Gluten-conf cases remain green controls. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…roupSuite - Wrap the runtime `spark.sparkContext.hadoopConfiguration` access with `// scalastyle:off/on hadoopconfiguration` (its documented escape hatch); the reproduction test must mutate the runtime Hadoop conf on purpose. - Reflow the header scaladoc to satisfy spotless/scalafmt (wrapMaxColumn=100). Verified locally: `spotless:check` and `scalastyle:check` both pass on backends-velox. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PR CI showed the plain-Parquet cases in VeloxParquetRowGroupSuite all pass: a plain `INSERT OVERWRITE DIRECTORY ... USING PARQUET` write DOES honor `parquet.block.size` set on the runtime Hadoop conf. So the Delta `DeletionVectorsWithPredicatePushdownSuite` abort is not a generic native-writer problem -- it is specific to the Delta write path. Adds VeloxDeltaParquetRowGroupSuite: writes a 1M-row Delta table with `parquet.block.size` set on the same conf object Delta uses (`spark.sparkContext.hadoopConfiguration`) and asserts the data file has more than one row group. This is expected to FAIL today (single row group), reproducing the abort; the deletion-vectors variant matches the original suite. Also corrects the now-stale comments in VeloxParquetRowGroupSuite (the runtime Hadoop-conf block size does reach the native writer for plain Parquet writes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

org.apache.spark.util.Utils is private[spark] and this suite is in package org.apache.gluten.execution, so it cannot be referenced here (it also cascaded into an ambiguous `.sum`). Replace Utils.tryWithResource with a plain try/finally and make the row-group count explicitly Int. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…roupSuite Move scalastyle:off marker before withTempPath lambda to cover the entire hadoopConfiguration access on line 113.

The field() method returns an int32 which can be negative. When a negative index is cast to uint32_t for childAt(idx), it becomes a huge positive number that bypasses Velox's VELOX_CHECK_LT(idx, size) check and causes undefined behavior (SIGSEGV observed in Spark 4.0 script transformation tests). The explicit idx >= 0 check is necessary to catch negative indices before they're cast to unsigned and cause memory corruption. Reverts commits: - 4070b5c [VL] Use VELOX_ASSERT_RUNTIME_THROW for bounds check test - fab87a2 [VL] Remove message check from test - just verify exception is thrown - b6fe845 [VL] Fix test assertion to match Velox's error message format - 68c93dd [VL] Remove redundant index bounds check - validated by PR apache#12278 CI Fixes apache#12307 test failure in spark-test-spark40.

felipepessoto and others added 2 commits June 16, 2026 16:30

github-actions Bot added the VELOX label Jun 16, 2026

felipepessoto and others added 4 commits June 16, 2026 18:21

[VL] Fix scalastyle hadoopconfiguration violation in VeloxParquetRowG…

3f5fa84

…roupSuite Move scalastyle:off marker before withTempPath lambda to cover the entire hadoopConfiguration access on line 113.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL][TEST] Reproduce: native Parquet writer ignores Hadoop-conf parquet.block.size#12307

[VL][TEST] Reproduce: native Parquet writer ignores Hadoop-conf parquet.block.size#12307
felipepessoto wants to merge 6 commits into
apache:mainfrom
felipepessoto:velox-parquet-rowgroup-test

felipepessoto commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

felipepessoto commented Jun 16, 2026

What changes are proposed in this pull request?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant