Skip to content

[VL][TEST] Reproduce: native Parquet writer ignores Hadoop-conf parquet.block.size#12307

Draft
felipepessoto wants to merge 6 commits into
apache:mainfrom
felipepessoto:velox-parquet-rowgroup-test
Draft

[VL][TEST] Reproduce: native Parquet writer ignores Hadoop-conf parquet.block.size#12307
felipepessoto wants to merge 6 commits into
apache:mainfrom
felipepessoto:velox-parquet-rowgroup-test

Conversation

@felipepessoto

Copy link
Copy Markdown
Contributor

What changes are proposed in this pull request?

Adds VeloxParquetRowGroupSuite validating how Gluten's native (Velox) Parquet writer decides Parquet row-group boundaries, and through which config channel parquet.block.size reaches the native writer.

Motivation: Delta's DeletionVectorsWithPredicatePushdownSuite.beforeAll writes a 1M-row table with hadoopConf().set("parquet.block.size", 2MB) and asserts the file has > 1 row group. Under Gluten the file has a single row group, so beforeAll throws and the whole suite aborts (uncovered by the Delta Spark UT pipeline #12278).

Three native-write cases (~8MB of int64; checkNativeWrite asserts the write offloads to ColumnarWriteFilesExec):

  1. default 128MB block size -> 1 row group (control; the Delta-failure scenario).
  2. spark.gluten.sql.columnar.parquet.write.blockSize=1MB -> multiple row groups (the native writer honors its own conf and can split).
  3. parquet.block.size=1MB set on the runtime Hadoop conf (Delta's mechanism) -> asserts > 1 row group and currently FAILS, because that value does not reach the native writer (single row group). This case is the reproduction; it should turn green once the Hadoop-conf block size is plumbed through to the native writer.

Draft to run CI and confirm the reproduction. Case 3 is expected RED until a follow-up fix.

How was this patch tested?

This IS the test. The CI run on this PR exercises all three cases.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: GitHub Copilot CLI (claude-opus-4.8)

felipepessoto and others added 2 commits June 16, 2026 16:30
….block.size

Adds VeloxParquetRowGroupSuite to pin down why Delta's
DeletionVectorsWithPredicatePushdownSuite aborts under Gluten (its beforeAll
writes 1M rows with parquet.block.size=2MB on the Hadoop conf and asserts >1 row
group, but Gluten produces a single row group).

Three native-write cases (~8MB of int64, checkNativeWrite asserts the write
offloads to ColumnarWriteFilesExec):
- default 128MB block size -> exactly 1 row group (the Delta-failure scenario);
- spark.gluten.sql.columnar.parquet.write.blockSize=1MB -> multiple row groups
  (the native writer honors its own conf and CAN split);
- parquet.block.size=1MB set on the runtime Hadoop conf (Delta's mechanism) ->
  still 1 row group (it does not reach the native writer).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ts desired behavior)

Flip the third case from characterizing the bug (asserting a single row group) to
reproducing it: it now asserts the DESIRED behavior (> 1 row group), so it FAILS
today (parquet.block.size on the runtime Hadoop conf does not reach Gluten's
native writer -> single row group) and will turn green once that config is plumbed
through to the native writer. The default-block-size and Gluten-conf cases remain
green controls.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added the VELOX label Jun 16, 2026
felipepessoto and others added 4 commits June 16, 2026 18:21
…roupSuite

- Wrap the runtime `spark.sparkContext.hadoopConfiguration` access with
  `// scalastyle:off/on hadoopconfiguration` (its documented escape hatch);
  the reproduction test must mutate the runtime Hadoop conf on purpose.
- Reflow the header scaladoc to satisfy spotless/scalafmt (wrapMaxColumn=100).

Verified locally: `spotless:check` and `scalastyle:check` both pass on
backends-velox.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR CI showed the plain-Parquet cases in VeloxParquetRowGroupSuite all pass: a
plain `INSERT OVERWRITE DIRECTORY ... USING PARQUET` write DOES honor
`parquet.block.size` set on the runtime Hadoop conf. So the Delta
`DeletionVectorsWithPredicatePushdownSuite` abort is not a generic native-writer
problem -- it is specific to the Delta write path.

Adds VeloxDeltaParquetRowGroupSuite: writes a 1M-row Delta table with
`parquet.block.size` set on the same conf object Delta uses
(`spark.sparkContext.hadoopConfiguration`) and asserts the data file has more
than one row group. This is expected to FAIL today (single row group),
reproducing the abort; the deletion-vectors variant matches the original suite.

Also corrects the now-stale comments in VeloxParquetRowGroupSuite (the runtime
Hadoop-conf block size does reach the native writer for plain Parquet writes).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
org.apache.spark.util.Utils is private[spark] and this suite is in package
org.apache.gluten.execution, so it cannot be referenced here (it also cascaded
into an ambiguous `.sum`). Replace Utils.tryWithResource with a plain
try/finally and make the row-group count explicitly Int.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…roupSuite

Move scalastyle:off marker before withTempPath lambda to cover the entire
hadoopConfiguration access on line 113.
felipepessoto added a commit to felipepessoto/gluten that referenced this pull request Jun 19, 2026
The field() method returns an int32 which can be negative. When a negative
index is cast to uint32_t for childAt(idx), it becomes a huge positive number
that bypasses Velox's VELOX_CHECK_LT(idx, size) check and causes undefined
behavior (SIGSEGV observed in Spark 4.0 script transformation tests).

The explicit idx >= 0 check is necessary to catch negative indices before
they're cast to unsigned and cause memory corruption.

Reverts commits:
- 4070b5c [VL] Use VELOX_ASSERT_RUNTIME_THROW for bounds check test
- fab87a2 [VL] Remove message check from test - just verify exception is thrown
- b6fe845 [VL] Fix test assertion to match Velox's error message format
- 68c93dd [VL] Remove redundant index bounds check - validated by PR apache#12278 CI

Fixes apache#12307 test failure in spark-test-spark40.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant