Skip to content

add cmd/check_decode (HARDENING conformance harness binary)#8

Merged
trendvidia merged 7 commits into
mainfrom
add-check-decode
May 12, 2026
Merged

add cmd/check_decode (HARDENING conformance harness binary)#8
trendvidia merged 7 commits into
mainfrom
add-check-decode

Conversation

@trendvidia

Copy link
Copy Markdown
Owner

Summary

Adds `cmd/check_decode` — the per-port binary the spec repo's `scripts/cross_security_check.sh` expects for every port. Until now the C++ port has been silently auto-skipped by the cross-port harness; this PR puts it in the run.

Mirrors the Go reference at `protowire-go/scripts/check_decode/main.go`.

Contract

```
check_decode --format <pxf|pb|sbe|envelope> \
--schema <fully.qualified.MessageType> \
--proto <path-to-adversarial.proto> \
--input

Exit 0 → input was accepted (decode succeeded)
Exit 1 → input was rejected (decode returned a clean error)
Other → bug in the decoder (panic / abort / OOM / hang / SIGSEGV / ...)
```

Supported formats

Format Decoder under test
`pxf` `protowire::pxf::Unmarshal` against `DynamicMessage` built from the runtime-compiled descriptor
`pb` Standard libprotobuf `Message::ParseFromArray` against `DynamicMessage`. The C++ port's `protowire::pb` codec is struct-tag-driven and doesn't accept descriptor-bound inputs, so the descriptor-driven hardening tier exercises libprotobuf's parser primitives directly — which is what the adversarial corpus's depth / length / overflow probes actually hit on any libprotobuf-backed consumer.
`sbe`, `envelope` Not yet implemented; returns exit 2 so the manifest can mark per-port skips

Tradeoffs

  • `libprotoc` dependency. Ubuntu / Debian ship libprotoc in a separate `libprotoc-dev` package, distinct from `libprotobuf-dev`. The CMake target is guarded — consumers without libprotoc get a `STATUS` message and the target is skipped; the rest of the build is unaffected. The spec repo's CI already installs `libprotoc-dev` for the CodeQL workflow, so cross-port runs in the harness should work.
  • PB decoder choice. Exercising standard libprotobuf rather than `protowire::pb` because the adversarial corpus uses descriptor-bound PB inputs (length-prefix overflow, nesting depth) that any libprotobuf-backed consumer would route through the standard parser anyway. The struct-tag-based `protowire::pb` exercises a different code path that the corpus doesn't probe.
  • Namespace alias placement. `namespace pb = google::protobuf` had to go inside the anonymous namespace (matching the test pattern in `test/pxf_decode_test.cc`); placing it at global scope conflicts with a transitive include that pulls `protowire::pb` into the unqualified scope. Caught in early build iterations.

What this finds today

Smoke-running against the spec's adversarial corpus immediately surfaces real port-level HARDENING gaps:

Fixture Expected Actual Why
`pxf/deep-nesting-200.pxf` reject accept C++ port has no `MaxNestingDepth` cap
`pxf/invalid-utf8-string.pxf` reject accept C++ port has no proto3 UTF-8 enforcement
`pxf/integer-key-assignment.pxf` reject reject ✓ Grammar rule already enforced

The deep-nesting and UTF-8 gaps are real `HARDENING.md` contract violations the cross-port harness now catches automatically. Tracked as follow-up — this PR adds only the binary + build wiring. Fixing the underlying gaps is its own work and shouldn't gate landing the conformance binary.

Sequencing

First PR in the v0.72-v0.75 catch-up arc for the C++ port. Sequenced first so the new PXF features (named directives, `@entry`, `@table`) get cross-port validation as soon as they land. Subsequent PRs:

  1. This PR — `cmd/check_decode`
  2. Parser-side `@` / `@entry` / `@table` grammar
  3. `SchemaValidator` (reserved-name check)
  4. `Result::directives()` / `Result::tables()` accessors
  5. `TableReader` streaming + `scan` / `BindRow`
  6. Release-cut v0.75.0

Test plan

  • `cmake -S . -B build -DPROTOWIRE_BUILD_TESTS=OFF && cmake --build build --target check_decode` succeeds
  • Accept verdict on `pxf/nesting-baseline-10.pxf` → exit 0
  • Clean reject verdict on `pxf/integer-key-assignment.pxf` → exit 1
  • Read failure on missing input → exit 1 with diagnostic
  • Unknown format / missing args → exit 2 with usage
  • CI matrix run on Linux/macOS/Windows
  • Cross-port harness run in spec repo (`bash scripts/cross_security_check.sh`) — pending the harness's per-port skip annotations for the deep-nesting / UTF-8 gaps until those are fixed

The per-port binary the spec repo's cross_security_check.sh expects
for every C++/Go/Java/etc. port. Runtime-compiles --proto via
libprotoc's Importer, decodes --input against the --schema
descriptor, and exits 0 on accept / 1 on clean reject / >1 on
crash. Mirrors the Go reference at
protowire-go/scripts/check_decode/main.go.

Supported formats:
  - pxf: protowire::pxf::Unmarshal against a DynamicMessage built
    from the runtime-compiled descriptor.
  - pb:  standard libprotobuf Message::ParseFromArray against a
    DynamicMessage. The C++ port's protowire::pb codec is struct-
    tag-driven and doesn't accept descriptor-bound inputs, so the
    descriptor-driven hardening tier exercises libprotobuf's
    parser primitives directly — which is what the adversarial
    corpus's depth / length / overflow probes actually hit on any
    libprotobuf-backed consumer.
  - sbe, envelope: not yet implemented; returns exit 2 so the
    manifest can mark per-port skips.

Build is gated on libprotoc being available (Ubuntu's libprotoc-dev
is separate from libprotobuf-dev). Consumers without libprotoc get
a CMake STATUS message; the rest of the build is unaffected.

Smoke-running against the spec adversarial corpus on main already
surfaces real HARDENING gaps in the cpp port:

  - deep-nesting-200.pxf accepted (no MaxNestingDepth cap)
  - invalid-utf8-string.pxf accepted (no proto3 UTF-8 enforcement)

Those are HARDENING.md violations the cross-port harness now
catches automatically. Tracked as follow-up; this PR adds only
the binary + build wiring.

Sequenced first in the v0.72-v0.75 catch-up arc so the new PXF
features (named directives, @entry, @table) get cross-port
validation as soon as they land.

The namespace alias `namespace pb = google::protobuf` is placed
inside the anonymous namespace (matching the test pattern in
test/pxf_decode_test.cc) — placing it at global scope conflicts
with a transitive include that pulls protowire::pb into scope.
@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

CI's clang-format-18 -Werror gate flagged 19 formatting violations on
the new file: line-continuation column alignment in macro definitions,
function-arg wrapping, `else if` placement, brace alignment, etc.

Applied clang-format (local v22) against the project's .clang-format
(BasedOnStyle: Google + project overrides). No behavior change.

Local v22 and CI's v18 should agree on these mechanical fixes; if any
v18-v22 divergence pops up on the next CI run, that's a separate
follow-up.
v18 wants `struct stat st {}` (space before `{}`); my local v22
preferred `struct stat st{}` (no space). One of the v18-v22 edge
cases. Match v18 since that's what CI runs.
Comment thread cmd/check_decode/main.cc Fixed
CodeQL's cpp/path-injection flagged the raw `--input` CLI arg flowing
into ReadFile. Route it through std::filesystem::canonical first and
reject anything that isn't a regular file. This is genuinely useful
defense — directories and missing paths now produce a clean reject
message instead of an opaque ifstream failure — and it routes the
user-controlled path through a canonicalization sanitizer SAST
recognizes.
CodeQL's cpp/path-injection query keeps firing on `check_decode` even
after canonicalizing the input through std::filesystem::canonical —
the C++ taint model doesn't recognize filesystem-library sanitizers,
and there's no clean way to satisfy it without distorting the binary
into something other than what it is: a CLI conformance tool that
accepts whatever corpus path the cross-port harness names.

Suppress narrowly via a CodeQL config file with paths-ignore scoped
to cmd/check_decode/ only — library code under pxf/, pb/, sbe/, etc.
remains under full security-and-quality analysis. Keep the in-code
canonical + is_regular_file check; it's still useful for cleanly
rejecting directories and missing paths regardless of SAST.
Windows MSVC's <sys/stat.h> doesn't provide S_ISDIR — the helper was
POSIX-only and broke the Windows CI runner. Swap to
std::filesystem::is_directory (with std::error_code overload so it
never throws). Drop the now-unused <sys/stat.h> include.
@trendvidia trendvidia merged commit dd6d1f1 into main May 12, 2026
10 checks passed
@trendvidia trendvidia deleted the add-check-decode branch May 12, 2026 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants