Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,41 @@ format changes.

## [Unreleased]

### v1.0 spec changes

Three one-time spec changes from the protowire v1.0 freeze line
(STABILITY.md). **Breaking** — there is no alias period; v1.0 is itself
the major bump.

- `@table` directive renamed to `@dataset` (draft §3.4.4). Public API
follows: `Ast.TableDirective` → `Ast.DatasetDirective`, `Ast.TableRow`
→ `Ast.DatasetRow`, `TableReader` → `DatasetReader`,
`Document.tables()` → `Document.datasets()`, `Result.tables()` →
`Result.datasets()`. Source files `TableReader.java`,
`TableReaderTest.java`, `TableParserTest.java` renamed accordingly.
Decoder semantics unchanged.

- `@proto` directive added (draft §3.4.5). New `Ast.ProtoDirective`
record + `Ast.ProtoShape` enum (`ANONYMOUS`, `NAMED`, `SOURCE`,
`DESCRIPTOR`). Four body shapes lexically distinguished:
`@proto { ... }` (anonymous), `@proto pkg.Type { ... }` (named),
`@proto """..."""` (source), `@proto b"..."` (descriptor). Exposed
via `Document.protos()` and `Result.protos()`. Descriptor form is
the MUST-support shape per spec; this port supports all four.

- Reserved directive names expanded from 5 to 13 (draft §3.4.6).
Decoder rejects `@table`, `@datasource`, `@view`, `@procedure`,
`@function`, `@permissions` as spec-reserved (future-allocated).
`SchemaValidator.FUTURE_RESERVED_DIRECTIVES` exposes the set.

`@dataset`'s row message type is now optional in the AST. When
omitted, the directive consumes the typed binding of a preceding
anonymous `@proto` per draft §3.4.4 Anonymous binding.

`Lexer.repositionTo(int)` added for the parser's `@proto` brace-body
skip (interior is protobuf source, not PXF, so the lexer hops past
the body rather than tokenising it).

## [0.75.0]

Catch-up release. First tagged version after v0.70.0; brings the Java
Expand Down
40 changes: 20 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,9 +98,9 @@ byte[] formatted = Pxf.formatDocument(doc);

The AST is a sealed hierarchy of records (`Ast.Document`, `Ast.Entry`, `Ast.Value`); pattern matching is used throughout the formatter and decoder.

### Directives and `@table` (Result accessors)
### Directives and `@dataset` (Result accessors)

PXF documents can carry [`@<name>` directives, `@entry` bundles, and `@table` rows](https://github.com/trendvidia/protowire#directives) at the document root alongside (or instead of) a message body. `unmarshalFull` captures all three on `Result`:
PXF documents can carry [`@<name>` directives, `@entry` bundles, and `@dataset` rows](https://github.com/trendvidia/protowire#directives) at the document root alongside (or instead of) a message body. `unmarshalFull` captures all three on `Result`:

```java
Result r = Pxf.unmarshalFull(pxfBytes, b);
Expand All @@ -112,64 +112,64 @@ for (Ast.Directive d : r.directives()) {
// chosen message, chameleon's @header pattern.
}

for (Ast.TableDirective t : r.tables()) {
// t.type(), t.columns(), t.rows() List<TableRow>.
for (Ast.DatasetDirective t : r.datasets()) {
// t.type(), t.columns(), t.rows() List<DatasetRow>.
// Each row.cells().get(i) is:
// null — empty cell (field absent, pxf.default applies)
// Ast.NullVal — explicit null (field cleared per §3.9)
// any other Ast.Value — field set to that value
}
```

`r.directives()` excludes `@type` and `@table` (those have their own accessors). Order is preserved.
`r.directives()` excludes `@type` and `@dataset` (those have their own accessors). Order is preserved.

### `TableReader`: streaming `@table` consumption
### `DatasetReader`: streaming `@dataset` consumption

For datasets too large to materialize, read rows from an `InputStream` with working-set memory bounded by the size of the largest single row — not by the row sequence:

```java
try (var in = Files.newInputStream(Path.of("trades.pxf"))) {
var tr = new TableReader(in);
var tr = new DatasetReader(in);
String typ = tr.type();
List<String> cols = tr.columns();
List<Ast.Directive> hdrs = tr.directives(); // side-channel directives before the @table header
List<Ast.Directive> hdrs = tr.directives(); // side-channel directives before the @dataset header

Ast.TableRow row;
Ast.DatasetRow row;
while ((row = tr.next()) != null) {
// row.cells(): List<Ast.Value> with the three-state mapping above.
}
}
```

`NewTableReader` throws `NoSuchElementException` if the input ends before any `@table` directive. Multi-table documents chain via `tr.tail()`, which returns an `InputStream` of the buffered-but-unconsumed bytes followed by the remaining source:
`NewDatasetReader` throws `NoSuchElementException` if the input ends before any `@dataset` directive. Multi-table documents chain via `tr.tail()`, which returns an `InputStream` of the buffered-but-unconsumed bytes followed by the remaining source:

```java
var tr1 = new TableReader(src);
var tr1 = new DatasetReader(src);
// ... iterate tr1.next() until it returns null ...
var tr2 = new TableReader(tr1.tail());
var tr2 = new DatasetReader(tr1.tail());
```

Per-row arity and v1 cell-grammar errors (`[...]` / `{...}` cells, dotted columns) surface as the offending row is consumed, not deferred to end-of-input — see [draft §3.4.4 "Streaming consumption"](https://github.com/trendvidia/protowire/blob/main/docs/draft-trendvidia-protowire-00.txt).

### `scan` and `BindRow`: per-row binding

`TableReader.scan(builder)` reads the next row and binds its cells to the message by column name; returns `false` when the row sequence is exhausted:
`DatasetReader.scan(builder)` reads the next row and binds its cells to the message by column name; returns `false` when the row sequence is exhausted:

```java
var tr = new TableReader(in);
var tr = new DatasetReader(in);
while (true) {
var b = Trade.newBuilder();
if (!tr.scan(b)) break;
process(b.build());
}
```

`BindRow.bindRow(builder, columns, row)` is the same logic exposed standalone, for callers iterating `Result.tables()[i].rows()` on the materializing path:
`BindRow.bindRow(builder, columns, row)` is the same logic exposed standalone, for callers iterating `Result.datasets()[i].rows()` on the materializing path:

```java
Ast.Document doc = Pxf.parse(pxfBytes);
for (Ast.TableDirective tbl : doc.tables()) {
for (Ast.TableRow row : tbl.rows()) {
for (Ast.DatasetDirective tbl : doc.datasets()) {
for (Ast.DatasetRow row : tbl.rows()) {
var b = Trade.newBuilder();
BindRow.bindRow(b, tbl.columns(), row);
process(b.build());
Expand Down Expand Up @@ -223,9 +223,9 @@ PXF (`:pxf`):
- ✅ AST-preserving `formatDocument`.
- ✅ **`@<name>` named directives** at document root with raw-body extraction (`Ast.Directive`, `Result.directives()`).
- ✅ **`@entry` bundle directive** (zero-or-more prefix list; four permitted shapes per draft §3.4.3).
- ✅ **`@table` directive** (the protowire-native CSV replacement) — `Ast.TableDirective`, `Ast.TableRow`, three-state cells, parser enforces row arity + dotted-column rejection + list/block-cell rejection + standalone-constraint.
- ✅ **Streaming `TableReader`** over `InputStream` for datasets too large to materialize. Working-set memory bounded by largest single row.
- ✅ **Per-row binding** via `TableReader.scan(Message.Builder)` and standalone `BindRow.bindRow(...)`.
- ✅ **`@dataset` directive** (the protowire-native CSV replacement) — `Ast.DatasetDirective`, `Ast.DatasetRow`, three-state cells, parser enforces row arity + dotted-column rejection + list/block-cell rejection + standalone-constraint.
- ✅ **Streaming `DatasetReader`** over `InputStream` for datasets too large to materialize. Working-set memory bounded by largest single row.
- ✅ **Per-row binding** via `DatasetReader.scan(Message.Builder)` and standalone `BindRow.bindRow(...)`.
- ✅ **Schema reserved-name check** (`SchemaValidator.validateFile` / `validateDescriptor`) catches schemas declaring fields/oneofs/enum values named `null`/`true`/`false`. Runs by default on every `unmarshal*` call; `UnmarshalOptions.withSkipValidate(true)` opts out.

SBE (`:sbe`):
Expand Down
85 changes: 68 additions & 17 deletions pxf/src/main/java/org/protowire/pxf/Ast.java
Original file line number Diff line number Diff line change
Expand Up @@ -19,24 +19,28 @@ public record Comment(Position pos, String text) {}
*
* @param typeUrl body's message type from {@code @type}; may be empty
* @param directives side-channel {@code @<name>} directives at document
* root in source order (excludes {@code @type} and
* {@code @table})
* @param tables {@code @table} directives in source order. Per draft
* §3.4.4 a document with any table MUST NOT also have
* a {@code typeUrl} or body entries; the parser
* enforces this
* root in source order (excludes spec-defined
* directives: {@code @type}, {@code @dataset},
* {@code @proto}, {@code @entry})
* @param datasets {@code @dataset} directives in source order. Per
* draft §3.4.4 a document with any dataset MUST
* NOT also have a {@code typeUrl} or body entries;
* the parser enforces this
* @param protos {@code @proto} directives in source order
* (draft §3.4.5)
* @param entries message body entries
* @param leadingComments comments before any directive or body entry
*/
public record Document(
String typeUrl,
List<Directive> directives,
List<TableDirective> tables,
List<DatasetDirective> datasets,
List<ProtoDirective> protos,
List<Entry> entries,
List<Comment> leadingComments) {

public static Document of(String typeUrl, List<Entry> entries) {
return new Document(typeUrl, List.of(), List.of(), List.copyOf(entries), List.of());
return new Document(typeUrl, List.of(), List.of(), List.of(), List.copyOf(entries), List.of());
}
}

Expand Down Expand Up @@ -74,31 +78,78 @@ public record Directive(
List<Comment> leadingComments) {}

/**
* A {@code @table <type> ( col1, col2, ... ) row*} directive at document
* root (draft §3.4.4). Carries many instances of one message type — the
* protowire-native CSV replacement.
* A {@code @dataset <type> ( col1, col2, ... ) row*} directive at
* document root (draft §3.4.4). Carries many instances of one message
* type — the protowire-native CSV replacement.
*
* <p>v1 cell-grammar restrictions enforced by the parser: cells exclude
* list and block values; column entries are unqualified field names
* (no dotted paths); row arity equals column count; documents with any
* {@code @table} MUST NOT carry {@code @type} or body field entries.
* {@code @dataset} MUST NOT carry {@code @type} or body field entries.
*
* <p>{@code type} MAY be empty when an anonymous {@code @proto}
* directive (Section 3.4.5) precedes the dataset in document order;
* the anonymous schema is consumed as the row message type.
*/
public record TableDirective(
public record DatasetDirective(
Position pos,
String type,
List<String> columns,
List<TableRow> rows,
List<DatasetRow> rows,
List<Comment> leadingComments) {}

/**
* One parenthesized cell tuple in a {@link TableDirective}. The cells
* list has the same length as the containing table's column list.
* One parenthesized cell tuple in a {@link DatasetDirective}. The cells
* list has the same length as the containing dataset's column list.
* A {@code null} entry in cells denotes an absent field (the "empty
* cell" between two commas); a {@link NullVal} denotes a present-but-
* null field; any other {@link Value} denotes a present field with that
* value.
*/
public record TableRow(Position pos, List<Value> cells) {}
public record DatasetRow(Position pos, List<Value> cells) {}

/**
* Shape of a {@link ProtoDirective}'s body (draft §3.4.5).
*/
public enum ProtoShape {
/** {@code @proto { <message-body> }} — defines an unnamed message. */
ANONYMOUS,
/** {@code @proto <dotted-name> { <message-body> }} — single named message. */
NAMED,
/** {@code @proto """<proto-source>"""} — complete .proto source file. */
SOURCE,
/** {@code @proto b"<base64-FileDescriptorSet>"} — compiled descriptor. */
DESCRIPTOR;
}

/**
* A {@code @proto <body>} directive at document root (draft §3.4.5).
* Carries an embedded protobuf schema, making the PXF document
* self-describing. The shape distinguishes the four lexically-determined
* body forms.
*
* <p>{@code body} carries raw bytes per shape:
* <ul>
* <li>{@link ProtoShape#ANONYMOUS}, {@link ProtoShape#NAMED}: bytes
* between the opening {@code {} and matching {@code }} (both
* exclusive). The bytes are protobuf message-body source.</li>
* <li>{@link ProtoShape#SOURCE}: contents of the triple-quoted string
* (with leading-LF / dedent applied). The bytes are a complete
* {@code .proto} source file.</li>
* <li>{@link ProtoShape#DESCRIPTOR}: base64-decoded bytes of the
* bytes literal. The bytes are a serialised
* {@code google.protobuf.FileDescriptorSet}.</li>
* </ul>
*
* <p>{@code typeName} is non-empty only when {@code shape} is
* {@link ProtoShape#NAMED}.
*/
public record ProtoDirective(
Position pos,
ProtoShape shape,
String typeName,
byte[] body,
List<Comment> leadingComments) {}

/** A single entry inside a message body: assignment, map entry, or block. */
public sealed interface Entry permits Assignment, MapEntry, Block {
Expand Down
18 changes: 9 additions & 9 deletions pxf/src/main/java/org/protowire/pxf/BindRow.java
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@
import java.util.List;

/**
* Per-row proto-binding helper for {@code @table} rows. Sits atop the
* streaming {@link TableReader} (via {@link TableReader#scan}) and is also
* Per-row proto-binding helper for {@code @dataset} rows. Sits atop the
* streaming {@link DatasetReader} (via {@link DatasetReader#scan}) and is also
* exported as a standalone helper for callers that iterate the
* materializing path's {@link Result#tables()} rows.
*
Expand Down Expand Up @@ -49,15 +49,15 @@ private BindRow() {}
* surfaces as a "field not found" error from the underlying unmarshal
* call (unless {@link UnmarshalOptions#discardUnknown} is set).
*/
public static void bindRow(Message.Builder builder, List<String> columns, Ast.TableRow row) {
public static void bindRow(Message.Builder builder, List<String> columns, Ast.DatasetRow row) {
if (columns.size() != row.cells().size()) {
throw new IllegalArgumentException(
"BindRow: " + columns.size() + " columns vs " + row.cells().size() + " cells");
}
byte[] body = rowToPxfBody(columns, row);
// Run the synthetic body through the standard unmarshal pipeline.
// SkipValidate avoids re-running the reserved-name check per row
// (the caller's TableReader / unmarshalFull already validated the
// (the caller's DatasetReader / unmarshalFull already validated the
// descriptor once at bind time).
UnmarshalOptions.defaults().withSkipValidate(true).unmarshal(body, builder);
}
Expand All @@ -67,7 +67,7 @@ public static void bindRow(Message.Builder builder, List<String> columns, Ast.Ta
* non-{@code null} cell, in column order. Empty cells produce no
* entry — the field stays absent from the decoder's perspective.
*/
static byte[] rowToPxfBody(List<String> columns, Ast.TableRow row) {
static byte[] rowToPxfBody(List<String> columns, Ast.DatasetRow row) {
ByteArrayOutputStream out = new ByteArrayOutputStream();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < row.cells().size(); i++) {
Expand All @@ -83,12 +83,12 @@ static byte[] rowToPxfBody(List<String> columns, Ast.TableRow row) {
}

/**
* Format a single cell value as PXF text. v1 {@code @table} cells are
* Format a single cell value as PXF text. v1 {@code @dataset} cells are
* scalar-shaped (no list, no block), so only the leaf-value variants
* appear; list and block AST nodes are unreachable here because
* {@code parseTableRow} / {@code consumeRowCell} rejects them before
* {@code parseDatasetRow} / {@code consumeRowCell} rejects them before
* the streaming reader hands them to {@code bindRow}. Hand-constructed
* TableRow values bypass that check, so guard defensively.
* DatasetRow values bypass that check, so guard defensively.
*
* <p>The {@code NullVal} / {@code ListVal} / {@code BlockVal} cases
* don't need to read the bound variable, so they're checked via
Expand All @@ -105,7 +105,7 @@ static void writeCellValue(StringBuilder sb, Ast.Value v) {
if (v instanceof Ast.ListVal || v instanceof Ast.BlockVal) {
throw new IllegalArgumentException(
"BindRow: unexpected " + (v instanceof Ast.ListVal ? "list" : "block")
+ " value in cell (v1 @table cells are scalar-shaped)");
+ " value in cell (v1 @dataset cells are scalar-shaped)");
}
switch (v) {
case Ast.StringVal s ->
Expand Down
Loading