Skip to content

pxf: TableReader streaming + scan / BindRow#38

Merged
trendvidia merged 3 commits into
mainfrom
java-table-stream-scan
May 12, 2026
Merged

pxf: TableReader streaming + scan / BindRow#38
trendvidia merged 3 commits into
mainfrom
java-table-stream-scan

Conversation

@trendvidia

Copy link
Copy Markdown
Owner

Summary

Companion to protowire-go v0.74 (pxf.TableReader streaming) and v0.75 (TableReader.scan + pxf.BindRow). Reads rows from an InputStream one at a time with working-set memory bounded by the size of the largest single row — the shape consumers need for CSV-replacement datasets that don't fit in memory.

```java
try (var tr = new TableReader(in)) {
while (true) {
var b = AllTypes.newBuilder();
if (!tr.scan(b)) break;
process(b.build());
}
}
```

Cell-state semantics match BindRow: a null cell leaves the field absent (pxf.default applied, pxf.required errors), an Ast.NullVal cell clears wrappers / optional / oneof per §3.9, any other value sets the field. WKT timestamps + durations, enum-by-name, proto3 wrappers all bind correctly because BindRow re-uses the existing unmarshal pipeline (format-and-reparse).

API additions

Symbol Purpose
`TableReader(InputStream)` Reads through leading directives + header, returns reader positioned at the first row
`(*TableReader).type() / columns() / directives()` Header accessors
`(*TableReader).next()` Reads next row; returns `null` at EOF. Errors are sticky
`(*TableReader).scan(Message.Builder)` Reads next row + binds to msg via BindRow; returns false at EOF
`(*TableReader).tail()` InputStream yielding buffered + remaining bytes; chain for multi-table docs
`BindRow.bindRow(Message.Builder, List, Ast.TableRow)` Standalone helper

Implementation

  • Row-boundary scanner. Byte-level walker with string / bytes-literal / line-comment / block-comment awareness — embedded parens / ) inside literals don't trip the scan. Mirrors protowire-go's findNextRow in table_stream.go.
  • Header reuses Parser.parse(). Buffered header prefix (up through the column-list )) is parsed by the existing AST parser, so the standalone constraint, dotted-column rejection, and arity enforcement all flow through unchanged.
  • Per-row decode reuses Parser.parseTableRow. New package-private static helper Parser.parseTableRow(byte[], int) lets TableReader decode each row's byte slice without re-running the full document grammar.
  • BindRow strategy. Render each non-null cell back to PXF text (<column> = <value>\n), concat as synthetic body, run UnmarshalOptions.withSkipValidate(true).unmarshal. Reuses every existing decoder branch (WKTs, wrappers, enums, defaults, required, oneof) for zero new switch arms. Same trade protowire-go made.
  • Header byte budget caps at 64 KiB — fail-fast against a TableReader pointed at a giant body-only document with no @table ever.

Tradeoffs

  • Errors are sticky. Once next() or scan throws, subsequent calls rethrow the same exception. Matches the Go port's contract.
  • tail() semantics. Must only be called after next() returns null. Calling earlier returns bytes the current reader still intends to consume, which desynchronizes the next reader.
  • TableReader does NOT implement AutoCloseable. The underlying InputStream is the caller's responsibility; closing the reader would close `src` and prevent tail() chaining. Documented in javadoc.
  • 24 tests, lots of byte-precise coverage. The byte-at-a-time ChunkedStream reader is adversarial for any buffer-boundary bug; protowire-go's equivalent caught real issues there.

Stacking

Fourth of five Java port PRs:

  1. pxf: parser-side support for v0.72/v0.73 grammar features #35 — parser-side support for v0.72/v0.73 grammar
  2. pxf: schema reserved-name check (draft §3.13) #36 — schema reserved-name check
  3. pxf: Result.directives() / Result.tables() accessors #37 — Result accessors
  4. This PR — TableReader streaming + scan / BindRow
  5. Next — release-cut + JetBrains jar refresh

Test plan

  • `./gradlew :pxf:test` — all green (existing + 24 new `TableReaderTest`)
  • Equivalence: streaming and materializing paths produce identical cell type/value sequences for the same input
  • Byte-at-a-time `InputStream` works (`ChunkedStream` adversarial test)
  • Multi-table via `tail()` recovers correctly
  • `scan` works for all 9 cell value variants via WKT + wrapper + enum binding

Companion to protowire-go v0.74 (pxf.TableReader streaming) and v0.75
(TableReader.scan + pxf.BindRow). Reads rows from an InputStream one
at a time with working-set memory bounded by the size of the largest
single row.

API additions:
  - TableReader(InputStream) — header parsed at construction; reader
    positioned at the first row
  - type(), columns(), directives() — header accessors
  - tail() — InputStream yielding buffered + remaining bytes; use to
    chain a second TableReader for multi-table documents
  - next() — returns the next Ast.TableRow or null at EOF; errors
    are sticky
  - scan(Message.Builder) — reads next row and binds cells to msg
    fields by column name via BindRow
  - BindRow.bindRow(Message.Builder, List<String>, Ast.TableRow) —
    standalone helper for callers iterating Result.tables()

Implementation: byte-level row-boundary scanner pulls bytes from the
source InputStream on demand and slices one ( ... ) row range at a
time, handed to Parser.parseTableRow for cell decoding. The scanner
is string / bytes / line-comment / block-comment aware so embedded
parens don't trip it. Header parsing reuses Parser.parse() on the
buffered header prefix; the standalone constraint, dotted-column
rejection, and arity enforcement come from there. Header byte
budget caps at 64 KiB.

BindRow strategy mirrors protowire-go's table_bind.go: render each
non-null cell back to PXF text (`<column> = <value>\n`), concat as
synthetic body, run UnmarshalOptions.withSkipValidate(true).unmarshal.
Reuses every existing decoder branch (WKTs, wrappers, enums,
defaults, required, oneof) for zero new switch arms.

Tests in TableReaderTest (24 cases): basic streaming, three cell
states, side-channel directives before header, sticky errors,
list/block cells rejected mid-stream, strings + block + line
comments with embedded parens, byte-at-a-time InputStream
(adversarial for buffer boundaries), multi-table via tail(),
equivalence with materializing path, oversized-header rejection,
scan happy path, scan empty-cell-leaves-field-at-zero,
null-on-wrapper clearing, WKT timestamp binding, BindRow against
materializing path, arity mismatch, non-leaf-cell rejection.

Parser gained a package-private parseTableRow(byte[], int) static
helper so TableReader can decode each row's byte slice without
re-running the full document grammar.
Comment thread pxf/src/main/java/org/protowire/pxf/BindRow.java Fixed
Comment thread pxf/src/main/java/org/protowire/pxf/BindRow.java Fixed
Comment thread pxf/src/main/java/org/protowire/pxf/BindRow.java Fixed
CodeQL "Unread local variable" flagged three pattern-match bindings
in writeCellValue's switch (NullVal n, ListVal l, BlockVal bv) — all
three cases just appended a literal string or threw, with no use of
the bound variable. Java 21 standard pattern matching requires a
binding on every `case` label (unnamed-pattern `_` is preview in 21
and stable in 22; this project doesn't enable preview), so the names
couldn't be dropped.

Refactor: handle NullVal / ListVal / BlockVal via `instanceof` checks
before the switch (no binding needed), and keep the switch for the
value-using cases that actually read the bound variable. Switch is
no longer exhaustive over the sealed Ast.Value so a default arm
guards against any future leaf-value type we miss adding.

No behavior change. All BindRow / TableReader tests still pass.
Same CodeQL "Unread local variable" fix as the previous commit
applied to BindRow. Only NullVal had the issue here — ListVal and
BlockVal both use their bound variables (l.elements() /
bv.entries()).

Java 21 standard pattern matching requires a binding on every
`case` label, and unnamed-pattern `_` is preview-only. Routing the
no-binding case out via instanceof before the switch is the
cleanest fix. Default arm guards against any future leaf-value
type we miss adding.
@trendvidia trendvidia merged commit 7739e40 into main May 12, 2026
3 checks passed
@trendvidia trendvidia deleted the java-table-stream-scan branch May 12, 2026 03:57
trendvidia added a commit that referenced this pull request May 12, 2026
Promote [Unreleased] entries to [0.75.0] in advance of tagging
v0.75.0. First tagged version after v0.70.0; consolidates four
merged batches that brought the Java port up to the v0.75.0
protowire spec:

  - #35 parser-side support for v0.72 named directives + v0.73
    @entry prefix list + @table directive
  - #36 schema reserved-name check (draft §3.13)
  - #37 Result.directives() / Result.tables() accessors
  - #38 TableReader streaming + scan(Message.Builder) + BindRow

Aligns protowire-java's version number with the rest of the
protowire-* stack, which is at v0.75.0 across protowire and
protowire-go. Wire format unchanged across all four batches.

After merge + tag, the JetBrains plugin jar refresh in the spec
repo (scripts/refresh_jetbrains_parser_jar.sh) picks up the new
parser and unblocks the red squiggles on @<name>/@entry/@table
documents in the IDE.
trendvidia added a commit that referenced this pull request May 12, 2026
CodeQL was failing with "could not process any code written in
Java/Kotlin" on PRs that touched no Java source — the README PR
(#40) and the release-cut commit (#39, which only edits
CHANGELOG.md).

Diagnosis: gradle/actions/setup-gradle@v6 restores the build cache
on every run. README- and CHANGELOG-only changes leave every input
to the JavaCompile task identical, so Gradle reports the `classes`
task as UP-TO-DATE and runs zero `javac` invocations. CodeQL's
Java tracer wraps `javac`; no invocations means no extracted
classes, which surfaces as the empty-database error at the
analyze step.

Confirmed by the run-duration pattern in `gh run list --workflow=codeql.yml`:
  - failing runs (#39 release-cut, #40 README) finished in ~1 min
  - successful runs (#35-#38 feature PRs that touched src/) took ~2 min

The minute of "missing" wall-clock is exactly the javac step that
didn't run.

Fix: `--no-build-cache clean classes` forces a recompile every
CodeQL invocation. Drops a `clean` task ahead of `classes` to
guarantee no UP-TO-DATE skip, and `--no-build-cache` is
belt-and-braces against the setup-gradle action's cache restore.

Slightly slower CodeQL runs (~30s more on cold compile) but the
analyze step gets real javac output to extract from.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants