feat(replication): v12 storage-bound audit — design-complete by grumbach · Pull Request #113 · WithAutonomi/ant-node

grumbach · 2026-05-26T09:24:32Z

Summary

v12 storage-bound audit (notes/security-findings-2026-05-22/proposal-gossip-audit-v12.md), end-to-end and design-complete. Closes Finding 1 (lazy node defeats audit via on-demand fetch) and Finding 2 (bootstrap-claim shield).

This is a breaking wire change (StorageCommitment now embeds the sender's ML-DSA-65 public key; old peers will fail to decode). Auto-upgrade handles the version break.

Design checklist (v10 §§1-13 + v12 §5)

§	Item	Status
§1	Merkle tree + StorageCommitment + piggyback on NeighborSync	✅
§2 step 1	sender_peer_id == authenticated transport peer	✅
§2 step 2	sender in routing table	✅
§2 step 3	rate limit: 1 sig verify per peer per 60s (atomic, race-free)	✅
§2 step 4	ML-DSA-65 signature verify (against embedded pubkey)	✅
§2 step 5	sticky `commitment_capable` flag	✅
§3	bootstrap-claim shield: capable peer with no commitment → no audit, no credit	✅
§4	Responder: retained-slot rotation (4 slots × 1h = 4h coverage)	✅
§5 (v12)	Conditional `UnknownCommitmentHash` invalidation	✅
§6	RecentProvers with 40min TTL + hash-bound credit + per-key cap 16	✅
§6	`is_credited_holder` threaded into quorum (`evaluate_key_evidence_with_holder_check`)	✅
§7-8	Lazy-node + replay attack closure	✅ (PoC tests)
§10	Domain separation constants	✅
§11	DoS mitigations (rate limit, RT gate, cap, structural bounds)	✅
§12	Backwards compat	⚠ Wire-break accepted (auto-upgrade handles)
§13	All implementation checklist items	✅

What this PR ships

Foundation (4 modules + struct): commitment.rs, commitment_state.rs (incl. new PeerCommitmentRecord), commitment_audit.rs, recent_provers.rs.
Responder side fully wired: 1h rotation, immediate startup build + gossip trigger, 4-slot retention, streaming per-key response build.
Auditor side fully wired: pinned challenges, streaming verifier, conditional invalidation, holder-credit recording.
Gossip ingest: 5-gate verification (RT, peer-id binding, pubkey-binding, signature, cache cap) with atomic per-peer 60s sig-verify rate limit. Auto-cleanup on PeerRemoved.
Quorum integration: evaluate_key_evidence_with_holder_check downgrades uncredited Present claims to Unresolved. Wired into run_verification_cycle.
Bandwidth: ~5.3 KiB per commitment, ~85 KiB/h per node at default cadence. Documented.

Security review

15 rounds of codex review + David's PR review. All findings addressed. Selected highlights:

Round	Severity	Issue	Fix
5	BLOCKER	Auditor preloaded all chunks (multi-GB DoS)	Streaming verifier (one chunk peak)
5	BLOCKER	Sender pubkey not bound to peer-id	Gate 2c: `BLAKE3(pubkey) == sender_peer_id`
6	BLOCKER	Substring match on rejection allowed audit bypass	Strict exact-match + pinned-challenge gate
7	MAJOR	Off-RT sybils could churn the cache	RT-membership gate at ingest
8	MAJOR	Dropping pin on "unknown" reopened on-demand-fetch attack	Initially kept pin; revisited in round-13 with §3 shield in place
9	MAJOR	Plain Digests accepted on pinned challenges	Reject as MalformedResponse when pinned
9	MAJOR	Responder still preloaded chunks	Streaming responder via precheck + per-key helpers
10	MAJOR	Rotation cadence (10 min) vs gossip cooldown (1 h)	Rotation → 1 h
11	MAJOR	2-slot retention insufficient for gossip lag	Retention → 4 slots
11	MAJOR	"key not in commitment" wrongly penalised	Treat as benign staleness
12	MAJOR	"missing bytes for committed key" indistinguishable	Differentiated rejection strings
12	MAJOR	Pre-restart pins unrecoverable	Immediate post-startup gossip trigger
David	MAJOR	RecentProvers lacked TTL eviction	TTL + periodic sweep
13	MEDIUM	Rate limit only fired for peers with cached commitments	Separate `sig_verify_attempts` map stamped on every attempt
13	MEDIUM	§6 TTL was 4h vs design's 40min	Corrected to 40m
14	MEDIUM	Rate-limit check + stamp under separate locks (race)	Atomic single write-lock critical section
Final (15)	—	No findings	—

Threat coverage (20 PoC tests in `tests/poc_commitment_audit_attacks.rs`)

Lazy node passes audit with on-demand fetch under original pin (admitted v12 economic-not-cryptographic limit)
Fresh-commitment substitution rejected by pin
Cross-peer commitment substitution rejected by sender-id binding (gate 2a)
Throwaway-key substitution rejected by peer-id ↔ pubkey binding (gate 2c)
All structural failure modes rejected with correct AuditVerifyError variant
Audit response replay blocked by per-challenge nonce
Responder drops old commitments past retention window
Silent peer earns no credit
Rotated commitment drops holder credit
Overclaim via partial commitment yields no holder credit (lemma + end-to-end)
commitment_capable flag is sticky across eviction
capable_but_no_commitment constructor preserves capability

Plus 60 unit tests across the four new modules (including 3 new quorum tests for the holder-credit predicate).

Tests + CI

557 lib tests pass.
20 PoC tests pass.
cargo fmt --check clean.
cargo clippy --all-targets --all-features warning-only; the deny gates (-D clippy::panic -D clippy::unwrap_used -D clippy::expect_used) pass.

Test plan

cargo test --lib: 557 pass
cargo test --test poc_commitment_audit_attacks: 20 pass
cfd clean
codex review (15 rounds; no findings on round 15)
David's PR architectural review
Single-node devnet smoke test (stage 0 of testnet plan)
4-9 node testnet (stage 1: gossip + responder + auditor + holder-credit enabled)

Out-of-scope (future PRs)

Disk persistence of commitments: ML-DSA signatures are randomized, so the commitment hash changes across restart. Mitigated by immediate post-startup gossip trigger (recovery window measured in seconds). Disk persistence would let pre-restart pins survive.
Event-driven rotation on fresh-write: a just-replicated key is currently auditable only after the next 1h rotation. The auditor treats this as benign staleness (no trust penalty). Event-driven rebuild would close the gap but adds wiring complexity for marginal gain.
Concurrent-ingest regression test for the sig-verify rate limit: only inspected for race-correctness; no dedicated test.

…se 1) Implements the v12 design (notes/security-findings-2026-05-22/ proposal-gossip-audit-v12.md) for closing audit findings 1 (audit not storage-bound) and 2 (bootstrap-claim shield). Phase 1 is foundation only: wire types, Merkle tree, sign/verify, and the auditor's commitment-hash pin. No integration with gossip or the audit challenge/response flow yet — those land in phase 2 so each slice is independently reviewable. What this commit adds: - `StorageCommitment` wire type. ML-DSA-65 signed over (root, key_count, sender_peer_id) with explicit domain separation ("autonomi.ant.replication.storage_commitment.v1"). - `CommitmentBoundResult` wire type for per-key audit response entries: key, digest (existing audit semantics), bytes_hash (so the auditor rebuilds the leaf from its own local bytes), leaf_index (so the auditor knows left/right child ordering at each path level), and the Merkle inclusion path. - `MerkleTree` over leaves of the form BLAKE3(DOMAIN_LEAF || key || BLAKE3(bytes)). Sorted by key for deterministic roots; odd-count levels self-pair (node_hash(x, x)). Build is O(n) hashing, path lookup is O(log n). - `commitment_hash` = BLAKE3(DOMAIN_COMMITMENT_HASH || postcard(commitment)). Postcard's length-prefixed canonical encoding ensures any change to any field — including the variable-length signature — produces a different hash. This is the auditor's pin: the audit response must include a commitment that hashes to this value. - `verify_path` for the auditor: validates leaf_index < key_count (rejected if out of range), path.len() == ceil(log2(key_count)) (rejected if wrong shape), and recomputes the root from leaf + siblings using left/right ordering derived from leaf_index. Wire-input safe: rejects key_count > MAX_COMMITMENT_KEY_COUNT (1,000,000) and uses checked_next_power_of_two for depth math. 22 unit tests cover: empty tree, single-leaf, two-leaf, deterministic root, every-key path verify across sizes 1..333, tampered bytes_hash, tampered path, wrong leaf_index, out-of-range leaf_index, wrong path length, zero key_count, out-of-protocol key_count (MAX+1 and u32::MAX), duplicate keys, sign+verify roundtrip, signature failures (tampered root, wrong public key, garbage bytes), commitment hash field sensitivity, commitment hash signature-length sensitivity, commitment hash stability. All 514 lib tests pass. cfd clean. 4 rounds of codex (gpt-5-codex high-reasoning) review on the module itself, found and addressed 2 BLOCKERs (commitment_hash was a hand-built concat instead of postcard; path verification needed leaf_index on the wire), 2 MAJORs (odd-node terminology, missing leaf_index bounds check), and 1 round-3 finding (next_power_of_two overflow on untrusted wire input). Round 4 verdict: APPROVE.

Phase 2a (wire-only): extend NeighborSyncRequest/Response and AuditChallenge with optional commitment fields, and add the AuditResponse::CommitmentBound variant. No behaviour changes yet — new fields are `None`/unused everywhere, but every existing test recompiles against the new shape so we know the surface is correct before wiring up responder building and auditor verification. What this commit changes: - NeighborSyncRequest: + commitment: Option<StorageCommitment> - NeighborSyncResponse: + commitment: Option<StorageCommitment> - AuditChallenge: + expected_commitment_hash: Option<[u8; 32]> - AuditResponse: + CommitmentBound { challenge_id, commitment, per_key } All existing call sites pass `None` for the new fields, with a comment explaining "phase 3 will wire this up." Match sites on AuditResponse gain a `CommitmentBound { .. } => panic!("legacy-digest test")` arm so the existing test suite remains exhaustive. Backwards compatibility is preserved via `#[serde(default)]` on every new Option field: old peers' encoded messages decode into `None` on new peers, and new peers' messages encode the new fields as length-prefixed Option which old peers tolerate via postcard's forward-compat behaviour. 514/514 lib tests pass. cfd clean. No regressions.

Phases 2b and 2c of the v12 storage-bound audit design. src/replication/commitment_state.rs — responder side - BuiltCommitment: signed wire blob + cached commitment_hash + Merkle tree + sorted-keys lookup for the per-key leaf_index field. - ResponderCommitmentState: two-slot atomic rotation (current / previous) backed by parking_lot::RwLock<Arc<...>>. lookup_by_hash returns an Arc that keeps the BuiltCommitment alive for the duration of an audit response even if a concurrent rotate drops the slot (v6 §2 / v12 §4 retention). - rotate(new): demotes current to previous, drops the prior previous. The only path that frees previous, enforcing INV-R2. src/replication/commitment_audit.rs — auditor side - verify_commitment_bound_response: pure function implementing v12 §5's four gates in order (cheapest first): structural (per_key length / order / no duplicates / wire-bounded key_count / correct path length), commitment hash pin, ML-DSA-65 signature, and per-key (bytes_hash matches local bytes, leaf rebuilds, Merkle path verifies up to root, audit digest matches nonce-bound BLAKE3). - AuditVerifyError: typed reason for each gate failure so callers can log + apply AUDIT_FAILURE_TRUST_WEIGHT per key consistently. src/replication/commitment.rs - MerkleTree::sorted_keys() exposed so BuiltCommitment can populate its leaf_index lookup without recomputing. Tests - 8 commitment_state tests: empty state, rotate promote/demote, drop oldest after two rotations, lookup finds current+previous, lookup_arc_outlives_subsequent_rotation (v12 §4 invariant), proof builds + verifies under its own root, proof for absent key, hash matches global commitment_hash. - 13 commitment_audit tests covering each AuditVerifyError variant, plus a headline lazy_node_on_demand_fetch_attack_fails test that simulates the v12 Finding-1 attacker: a lazy node receives the challenge, builds a fresh commitment over just the challenged keys, signs it, and replies with that fresh commitment + valid proofs. The pin check (gate 2) rejects. All 535 lib tests pass. cfd clean. No regressions.

… 2d) src/replication/recent_provers.rs — auditor-side cache mapping (key, peer_id, commitment_hash) → "recently proved." The reward / quorum eligibility predicate from v12 §6: P is credited as holder of K only if recent_provers[K] contains an entry for P whose commitment_hash matches P's currently-credited commitment hash. Bounded per key (MAX_PROVERS_PER_KEY = 16 = 2 * CLOSE_GROUP_SIZE), LRU-evicted by proved_at. RT-only by caller contract (this module just stores what it's told. Hash-bound credit is the v12 §6 lever: a peer that rotates their commitment must re-prove every key. API: - record_proof(key, peer, hash, ts) — append or refresh-in-place. - is_credited_holder(key, peer, current_hash) — predicate. - forget_peer(peer) — RT eviction hook. - forget_commitment(hash) — UnknownCommitmentHash invalidation hook (v11/v12 §5: when the auditor invalidates last_commitment[P] because P denied the pin, also drop any cached entries that would silently extend credit). 9 unit tests cover: empty cache, credit under same hash, credit denied under rotated hash (the core v12 §6 property), wrong peer rejected, per-key cap with LRU eviction at MAX+1, refresh-in-place does not grow bucket, forget_peer drops all, forget_commitment drops matching only, lazy-rotation-via-UnknownCommitmentHash drops credit. 544 lib tests pass. cfd clean. EOF )

… tests Wires the final phase-2 piece: a pure function on ResponderCommitmentState that builds a CommitmentBound audit response. Caller looks up record bytes; the helper handles the Merkle proof and the per-key (digest, bytes_hash, leaf_index, path) construction. CommitmentBoundOutcome: - Built { commitment, per_key } — caller wraps in AuditResponse::CommitmentBound. - UnknownCommitmentHash — caller emits Rejected with reason "unknown commitment hash". Auditors classify this per v12 §5 conditional-invalidation rule. - KeyNotInCommitment { key } — responder rotated between gossip and challenge; caller emits a normal Rejected. End-to-end tests in commitment_state: - end_to_end_responder_to_auditor_happy_path: honest responder builds a response that the auditor's verify accepts. - end_to_end_lazy_node_fresh_commitment_substitution_fails: headline v12 Finding-1 attack. A lazy node substitutes a fresh commitment into the response; the pin gate rejects with CommitmentHashMismatch. Plus 4 unit tests for the new helper. 550 lib tests pass. cfd clean.

Closes both findings from the phase-2 codex review: - HIGH: mixed-version backward compatibility was not proven. Added 4 postcard roundtrip tests (old_decoder_tolerates_new_*) that empirically confirm postcard from_bytes is lenient on trailing bytes: an old peer using a v0 struct shape (no commitment / no expected_commitment_hash field) successfully decodes a v1 message that emits None as the new trailing field. Plus a new_peer_roundtrips_with_commitment_some test that catches accidental serde annotation breakage on the new field. - MEDIUM: end_to_end_lazy_node_fresh_commitment_substitution_fails in commitment_state.rs duplicated and overclaimed what the more direct lazy_node_on_demand_fetch_attack_fails in commitment_audit.rs proves. Removed; the happy-path cross-module e2e remains. 553 lib tests pass. cfd clean.

Codex round-2 review correctly flagged: my `#[serde(default)]` + Option-trailing-field strategy for backward compat is NOT actually backward compatible with postcard. Empirical test confirmed: `DeserializeUnexpectedEnd` when a new decoder reads a v0 payload that has no bytes for the trailing field — postcard is strict on struct shape, not lenient like JSON. This commit reverts the wire-type changes from c73da5d (commit/Option fields on NeighborSyncRequest/Response and AuditChallenge, plus the CommitmentBound AuditResponse variant) so phase 2 ships cleanly additive: the four new modules (commitment, commitment_state, commitment_audit, recent_provers) are unchanged and stand on their own with their existing 49 unit + e2e tests. The wire extension will be reintroduced in phase 3 with one of: (a) a protocol-version bump on `ReplicationMessage`, (b) a separate `CommitmentAnnounce` message variant (new ReplicationMessageBody variant — old peers ignore it), (c) length-prefixed extension envelope. Each requires careful bidirectional mixed-version testing. Doing it in phase 3 keeps phase 2 reviewable as a pure foundation. 549 lib tests pass. cfd clean.

tests/poc_commitment_audit_attacks.rs - single canonical test file that maps each Finding-1 attack vector from the original report (notes/security-findings-2026-05-22/01-audit-not-storage-bound.md) to the v12 mechanism that closes it, end-to-end. 13 tests, each named after its attack path: honest_responder_passes_audit_lazy_responder_fails (Path A) fresh_commitment_substitution_rejected_by_pin (Path B) overclaim_via_partial_commitment_yields_no_holder_credit (Path C) responder_drops_old_commitment_after_two_rotations (Path D) audit_response_replay_blocked_by_fresh_nonce (Path E) Finding 2 (bootstrap-claim shield) tests cover the cache-side property that closes it: silent_peer_earns_no_credit rotated_commitment_drops_holder_credit Plus 6 cross-check tests pinning foundational properties so future refactors of commitment_hash / leaf_hash / Merkle / signature do not regress: commitment_hash_is_field_sensitive leaf_hash_binds_key_and_bytes merkle_tree_root_is_deterministic_per_key_set signature_round_trips_correctly wrong_signer_rejected_at_signature_gate each_gate_fires_independently Each test composes the real production code paths from all four commitment-* modules end-to-end. No mocks. The Responder helper wraps ResponderCommitmentState + build_commitment_bound_audit_response; the auditor_verifies fn calls verify_commitment_bound_response directly. 13 PoC tests pass; 549 lib tests still pass. cargo clippy --all-targets --all-features -- -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used is clean. Also added clippy::panic to the existing cfg(test) allow blocks in commitment*.rs so test code using panic on unexpected match arms passes strict clippy.

…codex test gaps Closes the 4 test-coverage findings from codex round-2 test review: BLOCKER: UnknownCommitmentHash conditional-invalidation semantics not covered. v12 §5 says "clear only if stored hash still equals rejected pin." The actual conditional logic belongs at a higher layer (phase 3 auditor coordinator); the building block is RecentProvers::forget_commitment(hash). Added forget_commitment_only_drops_matching_hash to pin its contract: drops cache entries with that specific hash, leaves entries for other hashes intact. MAJOR: Finding-1 Path A (on-demand fetch under ORIGINAL pin) was not modeled. The earlier test only proved fresh-commitment substitution is rejected; it did NOT prove the actual lazy-fetch attack. Added on_demand_fetch_under_original_pin_succeeds_ documenting_v12_limit which explicitly proves the attack PASSES the verifier — because v12 is an economic defence (bandwidth cost per audit), not a cryptographic one. Test docstring documents this as the explicit design limit and serves as a regression marker if anyone claims to "close Path A" without bandwidth economics. MAJOR: cross-peer commitment substitution had no test AND no defence in the verifier. Added gate 2a (peer-identity binding) in commitment_audit.rs: response_commitment.sender_peer_id must equal challenged_peer_id. Caught before signature/pin gates. New typed AuditVerifyError::SenderPeerIdMismatch variant. Test cross_peer_commitment_substitution_rejected_by_sender_id proves the defence: a response carrying peer P's signed commitment but challenging peer Q is rejected at gate 2a before any signature work. MAJOR: overclaim/silent-peer tests were vacuous (only checked empty cache returns false). Rewrote overclaim_via_partial_commitment_end_to_end_no_credit to compose the full responder build path + cache predicate: lazy node commits to key 1 only, auditor challenges key 5, responder returns KeyNotInCommitment, auditor never calls record_proof, cache predicate correctly denies credit. Plus a positive control showing the cache DOES credit when record_proof IS called — making the predicate's denial meaningful, not trivially false. 17 PoC tests now (was 13). 549 lib tests still pass. cargo clippy --all-targets --all-features -- -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used is clean.

…path Codex round-3 of fix-loop flagged that on_demand_fetch_under_original_ pin_succeeds was observationally identical to the happy path because both used the same Responder::build_response helper backed by the 'always returns content(byte)' bytes lookup. Rewrite the test to bypass the Responder helper entirely and instead construct the per-key CommitmentBoundResult by hand from an ALTERNATE bytes source (named neighbour_fetched_bytes_for_key_3) — modelling the fetched-from-neighbour case explicitly. The lazy node: - retains its honest gossiped commitment (state.lookup_by_hash works) - has dropped local bytes for key 3 - constructs the response from fetched bytes - auditor accepts because the bytes_hash of fetched bytes is bit-identical to bytes_hash of stored bytes (the v12 blind spot) An assert_eq!(expected_leaf, from_commitment) before the verifier call explicitly documents the blind spot: leaf_hash(key, bytes_hash) only depends on the bytes themselves, not on where they came from. 17 PoC tests pass. cfd clean.

testnet-plan-storage-commitment-audit.md — phased rollout plan (stage 0 single-node smoke → stage 1 informational → stage 2 enforcement → stage 3 adversarial smoke), pre-deployment checklist, metrics to collect, failure modes to watch, rollback plan. Also adds the security-findings notes that drove this work: 01-audit-not-storage-bound.md 02-bootstrap-claim-audit-shield.md 03-paid-list-attestation-forgery.md 04-single-node-underpayment.md 05-merkle-already-stored-lie.md proposal-gossip-audit-v1.md through v12.md (the design iteration) The deployable surface (phase 1+2) is the four commitment-* modules in src/replication/. Phase 3 wiring (responder tick, gossip piggyback, auditor coordinator, holder-eligibility integration) is documented as the TODO before stage 1 can run.

Copilot

Pull request overview

This PR adds the phase 1+2 foundation for the v12 “storage-bound audit” design: new commitment wire types + Merkle proofs + signature/pin hashing, responder-side commitment rotation state, auditor-side verification logic, and a bounded “recent provers” cache. It also includes PoC threat-model tests that exercise the end-to-end responder→auditor flow and document known limits/attack closures, while explicitly deferring live replication-loop integration to phase 3.

Changes:

Add commitment primitives (Merkle tree, commitment hashing, ML-DSA signing) and responder-side commitment state/response builder.
Add pure auditor verifier for commitment-bound responses with typed failure reasons and cheapest-first gate ordering.
Add bounded per-key holder-eligibility cache plus PoC tests covering Findings 1 & 2 attack paths (and a separate bootstrap-stall PoC test).

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`src/replication/commitment.rs`	New commitment wire types, domain separation, Merkle tree + path verification, commitment pin hash, ML-DSA sign/verify.
`src/replication/commitment_state.rs`	Responder commitment build + two-slot rotation/retention state; builds commitment-bound audit responses.
`src/replication/commitment_audit.rs`	Pure auditor-side verifier with typed `AuditVerifyError` gates (structural/identity/pin/signature/per-key).
`src/replication/recent_provers.rs`	Bounded per-key cache of recent provers with hash-bound holder-credit predicate and invalidation hooks.
`src/replication/mod.rs`	Exposes the new replication submodules.
`src/replication/protocol.rs`	Adds a clarifying note that wire types are not yet extended (phase 3).
`tests/poc_commitment_audit_attacks.rs`	End-to-end PoC tests mapping specific Finding-1/2 attack paths to expected verifier behavior.
`tests/poc_bootstrap_stall.rs`	Documents a bootstrap-drain stall DoS condition as a PoC regression test (no fix included here).
`notes/security-findings-2026-05-22/*.md`	Design history (v1–v12) + testnet plan notes for later phase-3 integration/rollout.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    for (i, result) in response_per_key.iter().enumerate() {
+        // The auditor's local copy of bytes is the ground truth. If the
+        // auditor doesn't hold this key, treat it as a mismatch — we
+        // can't audit what we don't have.
+        let local_bytes =
+            local_bytes_for(&result.key).ok_or(AuditVerifyError::BytesHashMismatch { index: i })?;


+
+/// Domain-separation tag for the commitment signature.
+///
+/// Signed payload is BLAKE3 over (this tag || canonical commitment fields).


+        // failure, which for our fixed-size commitment cannot occur in
+        // practice (ML-DSA-65 signature is 3293 bytes). If it ever
+        // somehow does, surface as a SignatureFailed so callers don't
+        // need a new error variant for an unreachable case.
+        let cached_hash = commitment_hash(&commitment).ok_or_else(|| {
+            CommitmentError::SignatureFailed("commitment serialization failed".to_string())
+        })?;


+    /// `commitment.key_count` exceeds [`MAX_COMMITMENT_KEY_COUNT`] —
+    /// rejected before any hashing.
+    #[error("commitment claims {key_count} keys, exceeds protocol max")]
+    KeyCountOverProtocolMax {
+        /// The claimed (rejected) key count.
+        key_count: u32,
+    },
+    /// A `per_key[i].path` has the wrong length for the claimed


Per user: the auto-upgrade system handles version-breaking changes. Old peers will receive DeserializeUnexpectedEnd when decoding new messages and update via the existing src/upgrade/ flow. So the "defer wire extension to phase 3 because of postcard backward compat" decision is wrong — phase 3 is now: ship the full feature. This reverts the revert in ada62f8 and brings back: - NeighborSyncRequest.commitment: Option<StorageCommitment> - NeighborSyncResponse.commitment: Option<StorageCommitment> - AuditChallenge.expected_commitment_hash: Option<[u8; 32]> - AuditResponse::CommitmentBound { challenge_id, commitment, per_key } All call sites pass None for now; integration happens in subsequent commits. 553 tests pass.

…emit/receive Wires phase-2's commitment modules into the live replication loop, gossip side. ReplicationEngine now owns: - identity: Arc<NodeIdentity> (for signing commitments). - commitment_state: Arc<ResponderCommitmentState> (responder's two-slot current/previous, the responder uses to answer commitment-bound audits). - last_commitment_by_peer: Arc<RwLock<HashMap<PeerId, StorageCommitment>>> (auditor's per-peer "last known commitment", read at audit-issue time). - recent_provers: Arc<RwLock<RecentProvers>> (holder-eligibility cache, wired in the next commit). Background task: - start_commitment_rotation_loop (~10 min cadence) reads LMDB keys, builds a Merkle tree, signs, rotates into commitment_state. For content-addressed chunks bytes_hash == key, so no chunk re-read is needed (the audit-verify path still rechecks bytes_hash against BLAKE3(local_bytes), which for content-addressed equals the key, plus the digest gate which is bound to actual bytes). Gossip emit: - sync_with_peer_with_outcome and handle_sync_request_with_proofs now take an Option<StorageCommitment> param. Callers snapshot the current commitment once per round (cheap parking_lot::RwLock read returning an Arc) and pass it to every peer in the batch — same value across the batch reduces lock churn, identical commitment for the same rotation epoch. - run_neighbor_sync_round threads commitment_state through; the bootstrap sync path in start_message_handler does the same. Gossip receive: - ingest_peer_commitment: on inbound NeighborSyncRequest/Response, verify the peer-identity binding (commitment.sender_peer_id == authenticated source) and store into last_commitment_by_peer. - TODO(phase-3.5): plumb a PeerId → MlDsaPublicKey lookup so we can verify the signature at ingest time and drop forged commitments earlier. Currently we store-without-verify and rely on the audit- verify path to reject at audit time. What's NOT yet wired (next commits): - Audit issue: snapshot expected_commitment_hash from last_commitment_by_peer[challenged_peer] into the AuditChallenge. - Audit response: handle the CommitmentBound variant via the existing verify_commitment_bound_response; record into recent_provers on success. - Holder eligibility (recent_provers.is_credited_holder) threaded into quorum / paid-list / reward decisions. Old peers are allowed to break: the auto-upgrade system handles version-breaking changes (per Chris's PR WithAutonomi#112 musl-swap test that validated cross-version upgrade across 157 services). 553 lib tests pass. cfd clean (2 pedantic warnings, no errors).

The responder side of the v12 storage-bound audit is now live: when the auditor's challenge carries expected_commitment_hash: Some(h), the responder calls build_commitment_bound_audit_response from commitment_state, which looks up h in current/previous and produces a CommitmentBound response with per-key (digest, bytes_hash, leaf_index, path). handle_audit_challenge_with_commitment: - New entry point. Takes Option<&Arc<ResponderCommitmentState>>. - If commitment_state and expected_commitment_hash are both Some: pre-loads each challenged-key bytes from storage (sync closure, async storage — bounded by sqrt-scaled sample size), calls the v12 §4 build_commitment_bound_audit_response, and returns CommitmentBound / Rejected accordingly. - Otherwise: legacy plain-digest path (unchanged). handle_audit_challenge (the original entry point): kept for backwards compatibility, forwards to the new function with commitment_state = None. handle_audit_challenge_msg (orchestrator in mod.rs): now passes my_commitment_state through, so when the auditor sends a pinned challenge the responder can answer it. What this means at runtime: - Today's auditor (no expected_commitment_hash) is unaffected. - A future auditor that sends pinned challenges will get CommitmentBound responses from upgraded peers and Rejected / legacy from peers we can't match. What's NOT yet wired (next commit): - Auditor-side enablement: snapshot expected_commitment_hash from last_commitment_by_peer[challenged_peer] into the AuditChallenge, and handle the CommitmentBound response variant via verify_commitment_bound_response. Requires threading last_commitment_by_peer + a PeerId → MlDsaPublicKey lookup into audit_tick_with_repair_proofs. Plus recent_provers integration. 553 lib tests pass. cfd has 4 pedantic warnings (no errors).

The v12 storage-bound audit is now end-to-end. Previous commits shipped the responder side (gossip emit, rotation tick, commitment-bound answer dispatch); this commit wires the auditor side so the network actually enforces the commitment-bound flow when both peers run this version. Wire change: embed sender_public_key in StorageCommitment - Add sender_public_key: Vec<u8> (1952 bytes for ML-DSA-65) to StorageCommitment. Bound by the signature payload so a swap-the-key attack fails: the signed payload now binds (root, key_count, peer_id, pubkey), and an adversary keeping the body must produce a forgery under a key they don't hold. - verify_commitment_signature(c) takes the embedded key directly; no external PeerId-to-MlDsaPublicKey lookup is needed. Old peers using the prior 4-field commitment will fail to decode; auto-upgrade (per Chris's PR WithAutonomi#112) handles this. - verify_commitment_signature_with_key(c, pk) kept for tests where we want to assert a specific key did or did not sign. Auditor enablement: CommitmentAuditCtx + audit_tick_with_repair_proofs - New CommitmentAuditCtx<'a> bundles &last_commitment_by_peer and &recent_provers so audit_tick stays callable from both the legacy and commitment-bound paths without ballooning the parameter list. Passing None keeps today's plain-digest behaviour; passing Some opts the auditor in on a per-peer basis (no entry in last_commitment_by_peer still falls back to plain digest). - audit_tick_with_repair_proofs now: 1. Snapshots expected_commitment_hash from last_commitment_by_peer when the ctx is provided and we have a recent gossiped commitment for the challenged peer. Pins the challenge to that hash. 2. Handles the AuditResponse::CommitmentBound { commitment, per_key } variant via the new verify_commitment_bound helper, which calls the existing pure verifier verify_commitment_bound_response with pre-loaded local bytes (sync closure over an async storage read). 3. On verify success: records each verified (peer, key, pin) into recent_provers so downstream code can credit the peer as a real holder (the next-PR work for quorum / paid-list integration). 4. On AuditResponse::Rejected { reason: "unknown commitment hash" }: conditionally drops the stale entry from last_commitment_by_peer (only if it still matches the rejected pin), so the next audit either picks up the freshly gossiped commitment or falls back to the plain-digest path (v12 paragraph 5 conditional invalidation rule). ingest_peer_commitment now verifies at gossip time - With the embedded pubkey, signature verification at gossip ingest is now free of any external lookup. ingest_peer_commitment calls verify_commitment_signature(c) and drops forged commitments at the edge instead of relying on audit-time verification. Tests - All 17 PoC threat-model tests in tests/poc_commitment_audit_attacks.rs still pass against the embedded-key flow. wrong_signer_rejected_at_ signature_gate adapted: instead of passing a wrong pubkey to verify, swap the embedded pubkey on the response commitment (and re-pin to isolate the signature gate from the pin gate). - commitment_hash_is_field_sensitive extended to mutate sender_public_key. That field must change the hash like all others. - 554 lib tests pass (+1 from extending the signature-roundtrip suite). - cfd is warning-only; deny gates clean. What's still NOT in this PR (later work) - Threading recent_provers.is_credited_holder into quorum / paid-list / reward decisions. The cache populates correctly now, but no consumer reads it yet. That's the next focused PR.

dirvine · 2026-05-26T10:38:20Z

Architectural Review — PR #113: Storage-Bound Audit (v12)

Anselme, this is a substantial and well-structured piece of work. I've reviewed the full diff (~7k lines), the design document (v12), the PoC tests, and the CI results. Here's my assessment.

Design: Correct

The v12 design correctly closes Finding 1 (lazy-node audit defeat) and Finding 2 (bootstrap-claim shield). The key mechanism — pinning audit challenges to a gossiped-to-you commitment hash — means a lazy node must have had the bytes at gossip time to build a valid Merkle inclusion path. The conditional invalidation on UnknownCommitmentHash (the v10→v11→v12 fix) correctly handles the honest-rotation race.

The staged rollout (responder first, auditor follow-up) is safe because expected_commitment_hash: Option<[u8; 32]> defaults to None, keeping the legacy plain-digest path intact.

CI: Must Fix Before Merge

The e2e tests don't compile. This is a blocker:

File	Error	Location
`tests/e2e/replication.rs`	`AuditChallenge` missing field `expected_commitment_hash`	Lines 392, 442, 864
`tests/e2e/replication.rs`	`NeighborSyncRequest` missing field `commitment`	Lines 804, 1253, 1400
`tests/e2e/replication.rs`	`ReplicationEngine::new` missing `NodeIdentity` argument	(line ~?)
`tests/e2e/replication.rs`	`NeighborSyncResponse` missing field `commitment`	Line ?

These all need None for the optional fields and the new identity arg. Also:

cargo fmt --check fails in audit.rs (lines 570, 612) and mod.rs (lines 1002, 1150).
Clippy: first doc comment paragraph is too long at src/replication/audit.rs:557 — the docstring on handle_audit_challenge_with_commitment needs a blank line after the first sentence.
Documentation build also fails (likely the same doc comment issue or a similar one).

Security

1. ingest_peer_commitment stores without signature verification (phase-3.5 TODO)

The comment explains the rationale — the audit-time verifier catches forged commitments — but this creates a DoS surface: an attacker can inject fake commitments for honest peers into last_commitment_by_peer. Once the auditor is wired, an honest node whose commitment hash was poisoned will fail commitment-bound audits (signature won't match). Consider adding signature verification on ingest now behind a feature flag, or at least rate-limiting per-source peer.

2. last_commitment_by_peer: HashMap<PeerId, StorageCommitment> has no eviction policy

If a peer leaves the routing table, their commitment lingers forever. This could grow unbounded over churn. Consider either:

A TTL on entries (e.g. 2 × COMMITMENT_ROTATION_INTERVAL)
Hook into the RT eviction path to remove() the peer

3. RecentProvers: no TTL eviction

The PR mentions TTL-evicted in the v10 design doc but the implementation only has LRU-by-cap eviction. Without a time-based expiry, entries linger until the per-key cap (16) is hit. If a key is rarely audited, stale entries may persist for the auditor's lifetime. Add a TTL check in is_credited_holder or a background sweep.

Code Quality

Generally excellent. Particular strengths:

Clean module boundaries: commitment.rs (wire + Merkle + sig), commitment_state.rs (rotation + lookup), commitment_audit.rs (pure verifier), recent_provers.rs (cache). Easy to review individually.
Cheapest-first gates in verify_commitment_bound_response: structural checks before any hashing, hash pin before sig verify, sig verify before per-key loop.
Domain separation constants cover every hash context (commitment sig, commitment hash, leaf, internal node).
Backward-compat wire tests (old_decoder_tolerates_new_*) pin the postcard-trailing-bytes invariant.
17 PoC threat-model tests document each attack path and its closure — excellent for future maintainers.

Minor style: commitment_hash returns Option<[u8; 32]> when it could return a result. The comment says "callers may safely treat None as malformed" — a Result<_, Infallible> or a panic-safe expect in practice is fine but the Option forces a .ok_or_else on every construction site.

Recommendations

Fix the CI failures (e2e compile, fmt, clippy, docs) before merge — these are trivial.
Add signature verification on ingest now (phase 3, not phase 3.5). The public key is derivable from the peer ID in PeerInfo in the routing table; the code already has the path to look it up.
Add eviction to last_commitment_by_peer — at minimum hook into RT removal.
Add TTL eviction to RecentProvers to complement the LRU-by-cap.
Document the bandwidth impact of the ~3 KB ML-DSA-65 piggyback on NeighborSync (every 10 min × |RT| peers). It's negligible, but worth calling out.

Summary: Strong design, clean implementation, well-tested foundation. The CI issues need fixing before merge. With the ingest-verification and eviction gaps closed, this is ready for review and stage-0 deployment.

BLOCKER #1: honest commitment-rotation no longer punished - Rejected(unknown commitment hash) is the v12 paragraph 5 conditional- invalidation recovery path: the peer simply rotated past our pin. Previously we still called handle_audit_failure, which emits a trust event. Now we drop the stale entry from last_commitment_by_peer (only if it still hashes to the rejected pin, to tolerate fresh gossip arriving mid-flight), forget any credit anchored to the stale pin via recent_provers.forget_commitment, and return Idle. No trust penalty. BLOCKER WithAutonomi#2: streaming per-key verification removes memory-DoS vector - The pure verifier verify_commitment_bound_response preloaded every challenged chunk into memory. At sqrt-scaled sample sizes (1000 keys at 1M stored) and 4 MiB chunks, a single audit could push the responder + auditor toward multi-GB allocations. Now split into: - verify_commitment_bound_metadata: gates 1, 2a, 2b, 3 (one-shot, cheap). - verify_commitment_bound_per_key: gate 4 (per-key bytes_hash + path + digest), called once per key. The auditor (audit.rs verify_commitment_bound) streams one chunk at a time via storage.get_raw, runs gate 4, drops the bytes. Peak memory is bounded at MAX_CHUNK_SIZE (4 MiB) regardless of sample size. The legacy verify_commitment_bound_response is kept as a thin wrapper (still used in tests). MAJOR #1: ingest commitments from NeighborSyncResponse and bootstrap - ingest_peer_commitment was only invoked on inbound request handling, not on outbound sync responses. For peers we only see on the response path, audits were silently stuck on the legacy digest flow. Now both handle_sync_response and the bootstrap-sync loop call ingest_peer_commitment with resp.commitment. Threading required passing last_commitment_by_peer through start_neighbor_sync_loop, run_neighbor_sync_round, handle_sync_response, and start_bootstrap_ sync. MAJOR WithAutonomi#2: enforce peer_id == BLAKE3(embedded_pubkey) at every gate - Without this binding, a responder could sign with a throwaway key whose secret they hold and lie about which identity it belongs to; the embedded-key signature would verify trivially. saorsa-core derives PeerId as BLAKE3(pubkey_bytes), so the check is a single hash + compare. - Applied in two places: 1. verify_commitment_bound_metadata (auditor): new gate 2c, runs after pin gate, before signature gate. Returns SenderPeerIdMismatch on failure (same error variant as gate 2a; callers don't need to distinguish). 2. ingest_peer_commitment (gossip receive): rejects forged commitments at the edge. Test coverage - New PoC test throwaway_key_substitution_rejected_by_pubkey_binding exercises the attack against the full auditor flow. - All existing PoC + lib tests updated to derive peer_id from the responder's pubkey (matching production saorsa-core behaviour). Responder::new(_peer_byte) keeps the parameter for source-compat but no longer respects it — peer identity is fully derived. - wrong_signer_rejected_at_signature_gate: now swaps both the embedded pubkey AND sender_peer_id (so gate 2c passes), plus rebuilds the per-key digest under the new peer_id (so gate 4 doesn't trip first), to isolate the signature gate as the only failure path. - 554 lib tests + 18 PoC tests pass. - cfd warning-only; deny gates clean.

…leanup BLOCKER: malicious-peer bypass via free-form rejection text - Round-5 fix gated the no-trust-penalty "honest rotation" branch on a substring match of the rejection reason. A malicious peer could trivially send `reason: "unknown commitment hash"` on ANY challenge (including legacy unpinned ones) to dodge audits. Worse, on pinned audits it would drop the stored pin, pushing the next audit back to the weaker plain-digest path. - Tightened to: 1. `expected_commitment_hash.is_some()` (the auditor MUST have issued a pinned challenge — legacy unpinned audits cannot trigger this branch). 2. Exact-string match (`reason == "unknown commitment hash"`, not `contains`). - Round-5 PoC test honest-rotation path still passes because the responder helper emits the exact reason string; round-6 attack vector is closed because on an unpinned challenge the gate fails and we fall through to handle_audit_failure. MAJOR: bounded `last_commitment_by_peer` + churn cleanup - The auditor's per-peer commitment cache had no eviction. A sybil / churn attacker could leave behind one full StorageCommitment per identity indefinitely (each ~5 KiB: 1952-byte pubkey + 3293-byte signature + small fields). - Two-line defence: 1. PeerRemoved DHT event now drops the peer's entry from last_commitment_by_peer AND its recent_provers credits, matching the existing repair_proofs cleanup. 2. Hard cap MAX_LAST_COMMITMENT_BY_PEER = 4096 (~20 MiB worst-case). On insert when at cap, evict one arbitrary existing entry (HashMap iter order; sufficient because PeerRemoved keeps the working set anchored on the real RT). Updates for peers already in the map always replace and never trigger eviction. Tests - 554 lib tests pass. - 18 PoC tests pass. - cfd warning-only; deny gates clean.

MAJOR: off-RT sybils could churn the cache - Round-6 added a 4096-entry cap on last_commitment_by_peer + PeerRemoved cleanup. But ingest_peer_commitment still admitted any authenticated sender. An off-RT flood could fill the cap and evict honest peers, silently demoting their next audits to the legacy plain-digest path. - Fix: ingest_peer_commitment now drops the commitment if the source is not in our DHT routing table. Mirrors the existing `sender_in_rt` gate in handle_sync_request_with_proofs (which guards inbound replication hints). Off-RT senders cannot populate the cache, so cap eviction only fires under real RT churn (which PeerRemoved would have caught anyway). - All three callers updated (request handler, response handler, bootstrap loop) to thread `&p2p_node` through. MINOR: doc consistency - MAX_LAST_COMMITMENT_BY_PEER doc previously said "oldest by insertion order"; the code uses HashMap iter order which is unspecified. Doc updated to match implementation + explain why arbitrary eviction is sufficient (RT gate + PeerRemoved cleanup). Tests - 554 lib + 18 PoC pass. - cfd warning-only; deny gates clean.

MAJOR: free audit bypass via "unknown commitment hash" rejection - Round-7 (and prior) dropped the cached pin when a peer responded Rejected("unknown commitment hash"), returning Idle with no penalty. This was meant to handle honest rotation, but the responder's two-slot retention (current + previous) means "unknown" only happens if we are at least TWO rotations behind their gossip. The fix path was actively harmful: - Honest peer: gossip already rotates ~every 10 min, so the cache will be naturally refreshed via the existing gossip ingest path within one or two rotation windows. We don't need to drop the pin ourselves. - Malicious peer: under round-7, dropping the pin meant the NEXT audit had no pin, fell back to the legacy plain-digest path, and the on-demand-fetch attack reopened. A peer could trigger this on every challenge to permanently avoid commitment-bound audits. - Fix: on unknown-commitment-hash with a pinned challenge, log and return Idle (one wasted tick) but KEEP the pin and KEEP recent_ provers credits intact. Honest rotation self-resolves via gossip; malicious "unknown" loops keep failing pinned audits until either the operator notices or fresh gossip replaces the entry. No more free fallback to the weaker legacy path. - Strict gating from round-6 retained: `expected_commitment_hash. is_some()` ensures legacy unpinned challenges can't trigger this branch at all. MINOR: ingest_peer_commitment docstring inaccuracy - Previously claimed the signature is verified under "a public key derived from source.as_bytes()". The actual flow: - Source must be in our routing table. - sender_peer_id must equal source.as_bytes() (peer-id binding). - BLAKE3(sender_public_key) must equal sender_peer_id (gate 2c). - Signature verifies under the embedded public key (which is bound by the signature payload). - Doc rewritten to enumerate all five gates with their purpose. Tests - 554 lib + 18 PoC pass. - cfd warning-only; deny gates clean.

…g responder MAJOR #1: pinned-challenge contract enforcement - When the auditor pins a commitment hash into a challenge, the responder MUST answer with CommitmentBound (or Bootstrapping / Rejected for legitimate reasons). Previously, a Digests response to a pinned challenge was accepted and verified via the legacy plain-digest path. A peer that had already gossiped a commitment could ignore the storage-bound flow and pass via on-demand fetch under the weaker verifier. - Fix: in audit_tick_with_repair_proofs, the Digests arm now rejects the response as MalformedResponse when expected_commitment_hash. is_some(). Same handle_audit_failure path as other contract violations. MAJOR WithAutonomi#2: streaming responder (peak memory at one chunk) - The responder dispatch in handle_audit_challenge_with_commitment still preloaded every challenged chunk into a HashMap, then cloned each one into the per-key result. With max_incoming_audit_keys scaling as 2 * sqrt(stored_chunks) and chunks up to 4 MiB, this was an O(sample * chunk) memory spike per request — a viable memory-DoS vector on large nodes. - Refactor: two new helpers in commitment_state.rs: - precheck_commitment_bound_challenge: looks up the commitment + verifies every challenged key is covered, WITHOUT reading any chunk bytes. Returns the matched commitment Arc. - build_commitment_bound_result_for_key: builds one per-key entry given the pre-checked commitment + that key's bytes. - The responder dispatch now: prechecks, then iterates keys, reads one chunk via storage.get_raw, builds the entry, drops the bytes. Peak memory bounded at MAX_CHUNK_SIZE (4 MiB) regardless of sample size. Matches the streaming pattern the auditor side already uses. - Legacy build_commitment_bound_audit_response is kept as a thin wrapper (still used in tests). Tests - 554 lib + 18 PoC pass. - cfd warning-only; deny gates clean. A future PR with a 2-node e2e harness should add a regression test for the digests-to-pinned bypass; the current PoC suite tests the pure verifier in isolation and can't exercise the dispatcher loop.

… signal MAJOR #1: rotation cadence outran gossip refresh - Rotation was every 10 min, but neighbor-sync cooldown is up to 1 h per peer. Result: a remote auditor's cached pin could routinely point at a commitment we rotated past 2+ times, and our two-slot retention (current + previous) wouldn't cover it. Pinned audits then hit "unknown commitment hash" -> Idle no-op repeatedly until the next gossip arrival, degrading the storage-bound flow to effectively no-op for that auditor. - Fix: rotate every 1 h instead of 10 min. With two-slot retention that gives ~2 h of validity per commitment, comfortably covering the worst-case gossip lag. The v12 pin is bound to a point-in-time commitment, so rotation cadence isn't security-critical for pin freshness — only for keeping the committed key set current as the responder writes new keys. 1 h is plenty for that. MAJOR WithAutonomi#2: commitment-downgrade observable, not just stalling - A peer that gossiped a commitment once but then stops gossiping commitments (sends `commitment: None`) is trying to downgrade back to the legacy plain-digest audit path. Pre-fix: the None case silently returned false; only stalled audits were observable. - Fix: when a peer present in last_commitment_by_peer sends a None commitment, log at warn-level so operators can correlate downgrade attempts with audit-failure metrics. Cached entry is KEPT so subsequent pinned audits still apply (the peer must either rotate forward via gossip or accumulate audit failures via the "unknown commitment hash" path). - Trust-event integration is left as a follow-up: the wiring path from ingest_peer_commitment to the trust engine is non-trivial and warrants its own PR with a clear penalty curve. Tests - 554 lib + 18 PoC pass. - cfd warning-only; deny gates clean.

…n staleness MAJOR #1: retention window too narrow for worst-case gossip lag - Round-10 bumped rotation to 1h, but two-slot retention only covers ~2h. Realistic neighbor-sync staggering (batches of 4, 10-20 min between rounds, 1h cooldown) can produce 3+ hour gaps between gossip refreshes of the same peer pair. Honest auditors pinned to rotated-out commitments would then hit "unknown commitment hash" -> Idle no-ops indefinitely. - Fix: bump retention from 2 slots to RETAINED_COMMITMENT_SLOTS = 4. With 1h rotation that gives ~4h of pin validity, comfortably covering the worst-case gossip lag. Memory cost is bounded: at 10k keys per commitment, the four-slot buffer is ~2.6 MB. - Refactored ResponderCommitmentState: replaced the `current` / `previous` field pair with a `slots: Vec<Arc<BuiltCommitment>>` newest-first. rotate() prepends + truncates; lookup_by_hash scans all slots (still O(slots) which is tiny). External API (current(), lookup_by_hash, rotate) unchanged. MAJOR #2a: rotation didn't fire until first interval elapsed - After process start, current() returned None until the first 1h sleep completed. During that hour, the responder couldn't answer any commitment-bound audits — every challenge silently fell back to the legacy plain-digest path. - Fix: rebuild_and_rotate_commitment runs ONCE immediately on startup, before the sleep loop. First commit is available within seconds of startup. Subsequent rotations follow the regular 1h cadence. MAJOR #2b: "key not in commitment" wrongly counted as failure - A key recently replicated to the responder (via fresh-write hint) won't appear in the responder's commitment until the next 1h rotation. An auditor sampling that key and challenging the responder would get Rejected("key not in commitment: ..."), and the auditor would fire handle_audit_failure → trust penalty. But the responder actually HAS the bytes; only its Merkle tree is stale. - Fix: in audit_tick_with_repair_proofs, "key not in commitment" on a pinned challenge is treated as Idle (no penalty), same policy as "unknown commitment hash". Both are benign staleness signals; the next rotation will refresh the responder's tree. Strict gating retained: only applies when we DID pin, so a legacy unpinned audit cannot be bypassed. Tests - Updated commitment_state.rs unit tests for the new 4-slot retention semantics. - Updated PoC test `responder_drops_old_commitment_after_two_rotations` → renamed to `..._past_retention_window` and now rotates RETAINED_COMMITMENT_SLOTS+1 times. - 554 lib + 18 PoC pass. - cfd warning-only; deny gates clean.

… missing-bytes penalty + post-restart re-gossip CODEX ROUND-12 MAJOR #1: "key not in commitment" conflated benign staleness with real storage loss - The responder emitted Rejected("key not in commitment: ...") in two different situations: (a) the key was never in the commitment (just-replicated, awaiting next rotation) — benign; (b) the key WAS in the commitment but the responder cannot read the bytes anymore — real storage loss / withholding. Round-11 made the auditor treat both as Idle (no penalty), which meant case (b) escaped audit penalty entirely. - Fix: differentiate at the responder. Case (b) now emits Rejected reason "missing bytes for committed key: ..." and the auditor's benign-staleness branch only matches "key not in commitment", so case (b) falls through to handle_audit_failure with full penalty. MAJOR WithAutonomi#2: ML-DSA-65 signatures are randomized → pin doesn't survive restart - Commitment hash includes the signature, so rebuilding the same key set after restart produces a different hash. Pre-restart pinned audits then hit "unknown commitment hash" -> Idle until fresh gossip arrives — up to 1 h with the round-10 rotation cadence, during which time the node dodges commitment-bound audits. - The right fix would be to persist commitments to disk on rotate, but that's a meaningful change. Pragmatic alternative: after the first commitment is built on startup, trigger an immediate neighbor-sync round. The new commitment then gossips out within seconds, shrinking the recovery window from hours to sub-minute. DAVID'S PR REVIEW (round-12) MAJOR: RecentProvers lacked TTL eviction - The cache had per-key LRU-by-cap eviction but no time-based expiry. A rarely-audited key could keep stale entries indefinitely (until cap pressure evicts them). - Fix: add PROVER_ENTRY_TTL = 4h (4× the rotation interval). is_credited_holder ignores entries older than the TTL on read; new sweep_expired() reclaims their memory and is called once per rotation tick from the engine (1h cadence). MINOR: bandwidth impact undocumented - Added a "Wire size" section to StorageCommitment's docstring: ~5.3 KiB per commitment (32+4+32+1952+3293 bytes), gossiped on every NeighborSyncRequest/Response. With a close-group of 8 and bidirectional sync at the 1h rotation cadence, that's ~85 KiB/h per node — negligible against chunk-transfer bandwidth. Tests - 554 lib + 18 PoC pass. - cfd warning-only; deny gates clean. David's other points (e2e compile failures, signature verification on ingest, last_commitment_by_peer eviction, fmt/clippy/docs) were all addressed in codex rounds 5-7. The remaining items from his review are this commit's TTL + bandwidth doc.

…credit, rate limit, §3 shield The v12 design (notes/security-findings-2026-05-22/proposal-gossip-audit-v12.md) is now fully implemented per its §§2-6 checklist: §2 step 5 — sticky `commitment_capable` flag - New `PeerCommitmentRecord` struct replaces the bare `HashMap<PeerId, StorageCommitment>`. Holds last_commitment + commitment_capable (sticky bool) + received_at + last_sig_verify_at. - Once a peer gossips a valid commitment, capable flips to true and never reverts. Even if we evict the cached commitment via TTL, sybil cap, or restart, the peer is forever treated as v12-capable until full PeerRemoved cleanup. §2 step 3 + §11 DoS — per-peer 60s sig-verify rate limit - `COMMITMENT_SIG_VERIFY_MIN_INTERVAL = 60s` caps ML-DSA verify cost per peer. Checked after cheap structural gates (RT, peer-id binding, pubkey-binding) and before the expensive sig verify. A sybil that bypasses the RT gate (transient bucket pollution) can no longer burn CPU with a flood of valid-looking gossips. §3 — bootstrap-claim shield: refuse legacy fallback for capable peers - audit_tick_with_repair_proofs now checks the peer record up front: if commitment_capable but no cached commitment, return Idle. The peer is fully expected to speak v12; falling back to legacy plain-digest would let them downgrade. We wait for fresh gossip to refresh the cache instead. §5 (v12) — restored conditional invalidation - Round-8 kept the pin unconditionally on Rejected("unknown commitment hash") to prevent the legacy-fallback bypass. Now that §3 (above) closes the bypass directly, we can implement the v12 design verbatim: - Case 1 (lazy rotation): stored hash == rejected H → clear pin + forget_commitment(H). recent_provers entries lose their match basis → §6 holder-credit drops. Lazy node earns nothing. - Case 2 (honest rotation race): stored hash != H (fresh C2 arrived in-flight) → leave it alone. Don't clobber. - Case 3 (stale auditor): same as case 1; clear pin, wait for fresh gossip. §6 — holder-eligibility threaded into quorum - New `evaluate_key_evidence_with_holder_check` variant takes a predicate `(peer, key) -> bool`. Returning false downgrades a Present claim to Unresolved (we don't trust "I have it" without a recent commitment-bound audit). Paid-list evidence is independent (it's a property of the receiving peer's own data, not a Present claim). - Wired into `run_verification_cycle` via `VerificationCycleContext`: snapshots last_commitment_by_peer + recent_provers once per cycle (cheap; bounded by RT × 16/key) and evaluates each key against the snapshot. Synchronous predicate avoids re-entering the locks during evaluate. Tests - 3 new quorum tests: - quorum_downgrades_uncredited_present_peers - quorum_passes_when_all_present_peers_are_credited - paid_list_path_unaffected_by_holder_credit - 2 new PoC tests: - commitment_capable_flag_is_sticky_across_eviction - capable_but_no_commitment_starts_capable - 557 lib + 20 PoC pass. - cfd warning-only; deny gates clean. What's still NOT in this PR (legitimately out of scope) - Disk persistence of commitments. ML-DSA signatures are randomized, so the commitment hash changes across restart even for the same key set. Mitigated by the round-12 immediate post-startup gossip trigger (recovery window measured in seconds, not hours). Disk persistence is a clean follow-up optimization. - Pre-rotation event-driven rebuild on fresh-write. A just-replicated key is currently auditable only after the next 1h rotation. The auditor treats this as benign staleness (round-11 + round-12); no trust penalty. Event-driven rebuild on fresh-write would close the gap but adds wiring complexity for marginal gain.

…ect §6 TTL MEDIUM #1: rate limit only fired for peers with cached commitments - Round-12 tracked last_sig_verify_at inside PeerCommitmentRecord, but that record is only created after a successful verify. A peer we've never successfully verified could still burn ML-DSA cost on every invalid-but-structurally-plausible gossip. - Fix: new `sig_verify_attempts: HashMap<PeerId, Instant>` map, separate from last_commitment_by_peer. Stamped BEFORE the verify on EVERY attempt (success or failure). Reading + writing happens before the expensive verify, so a flood is rejected at the cheap hashmap-lookup step. Capped at MAX_LAST_COMMITMENT_BY_PEER with oldest-timestamp eviction, and dropped on PeerRemoved (same cleanup pattern as the commitment cache). - Threaded through ingest_peer_commitment → handle_replication_message → handle_sync_response → run_neighbor_sync_round → start_*_loop spawn-scope clones (3 call sites total). MEDIUM WithAutonomi#2: §6 TTL was 4h, design says 2 × max audit interval (40 min) - 4h kept holder credit alive much longer than v10/v12 §6 specifies, weakening "must re-prove under current conditions". Default max audit interval is 20 min → TTL = 40 min. - Fix: PROVER_ENTRY_TTL bumped from 4h to 40m. Doc updated to cite the v10/v12 spec line directly. Tests - 557 lib + 20 PoC pass (no test changes needed). - cfd warning-only; deny gates clean.

Round-13 used separate read + write locks for the check and the stamp. Two concurrent ingest_peer_commitment calls from the same peer could both miss the rate-limit check and both reach ML-DSA verify within the 60s window. Fix: combine into a single write-locked critical section. Read existing timestamp, compare, then insert under the same lock. The lock is held only for a hash-map lookup + insert (microseconds), never across the expensive verify itself. Tests - 557 lib + 20 PoC pass. - cfd warning-only; deny gates clean.

grumbach added 11 commits May 26, 2026 16:44

Copilot AI review requested due to automatic review settings May 26, 2026 09:24

Copilot started reviewing on behalf of grumbach May 26, 2026 09:24 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

grumbach added 3 commits May 26, 2026 18:29

grumbach changed the title ~~feat(replication): storage-bound audit foundation (v12 phase 1+2)~~ feat(replication): storage-bound audit — foundation + responder fully wired May 26, 2026

grumbach marked this pull request as draft May 26, 2026 10:19

grumbach added 8 commits May 26, 2026 19:44

grumbach changed the title ~~feat(replication): storage-bound audit — foundation + responder fully wired~~ feat(replication): v12 storage-bound audit — end-to-end May 26, 2026

grumbach added 3 commits May 26, 2026 21:16

grumbach changed the title ~~feat(replication): v12 storage-bound audit — end-to-end~~ feat(replication): v12 storage-bound audit — design-complete May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(replication): v12 storage-bound audit — design-complete#113

feat(replication): v12 storage-bound audit — design-complete#113
grumbach wants to merge 26 commits into
WithAutonomi:mainfrom
grumbach:grumbach/storage-commitment-audit

grumbach commented May 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

dirvine commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

grumbach commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design checklist (v10 §§1-13 + v12 §5)

What this PR ships

Security review

Threat coverage (20 PoC tests in tests/poc_commitment_audit_attacks.rs)

Tests + CI

Test plan

Out-of-scope (future PRs)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

dirvine commented May 26, 2026

Architectural Review — PR #113: Storage-Bound Audit (v12)

Design: Correct

CI: Must Fix Before Merge

Security

Code Quality

Recommendations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

grumbach commented May 26, 2026 •

edited

Loading

Threat coverage (20 PoC tests in `tests/poc_commitment_audit_attacks.rs`)