fp-stability: verify/rank dd_line hotspots, cancellation-origin reporting, scale-free pass/fail by sbryngelson · Pull Request #1526 · MFlowCode/MFC

sbryngelson · 2026-06-02T00:16:48Z

Summary

Makes ./mfc.sh fp-stability (the Verrou-based FP-stability harness) trustworthy and self-explanatory: dd_line hotspots are verified and ranked; the report leads with where catastrophic cancellation originates and how badly (in decimal digits lost); precision sensitivity is kept clearly separate from cancellation origin; and pass/fail is scale-free, retiring the per-case hand-tuned thresholds.

Toolchain-only — no Fortran/solver and no build-system changes (CMakeLists.txt / build.py are untouched relative to master). gfortran is the fp-stability CI compiler.

Motivation

dd_line reported a flat list of source lines as confident "hotspots." Several ways that misled:

No check that a reported line actually reproduces the instability.
The headline "most flagrant" line was a sensitivity leader (the time integrator / final accumulation), conflated with where cancellation originates — e.g. the SSP-RK update (a convex combination) topped the list while the real cancellation lives in WENO/Riemann/EOS.
Cancellation was ranked by site count, but one site losing 14 digits outweighs many mild ones; "bits lost" also isn't intuitive.
Attribution could land on a &-continuation fragment, so the labelled line was unclear.
Per-case pass/fail used hand-tuned absolute thresholds spanning six orders of magnitude.

What it does now

Verification + ranking. A positive control perturbs only the suspect lines (Verrou --source, matching file+line+symbol via --gen-source); lines that don't reproduce the deviation are downgraded ::warning::→::notice:: (unconfirmed). Lines are ranked by their share of the single-precision deviation.

Sensitivity vs cancellation-origin. The report leads with cancellation origins (where ill-conditioning arises), ranked worst-first by decimal digits lost (a sweep of --cc-threshold-double levels buckets each site by the highest it survives; bits → digits, since a double carries ~16). Each entry shows the full statement (continuation fragments resolved to their logical-statement start). The --source sensitivity ranking — dominated by the (benign) time integrator — is relabelled as such and collapsed into a <details>, so it's no longer mistaken for a culprit-finder.

Scale-free pass/fail. The 6 hand-tuned per-case thresholds are gone. A case passes iff it retains sig_bits = -log2(max_dev / max|ref|) ≥ 24 (single-precision accuracy) — normalizing by field scale collapses the six-order spread into a tight 30–48 bit band, so one global floor suffices.

Honesty. Truncated lists report "…and N more"; a coverage caveat notes these are N 1-D cases (not a guarantee for unexercised multi-D/viscous/MHD/IGR/bubble paths); hotspots inside a #:for/#:def expansion are flagged as instance-ambiguous.

Validation (real MFC, `sod_standard`)

Cancellation origins (digits lost): the worst are the physically-expected ones — flux divergence m_rhs:921, divided differences m_weno:1090–1093, HLLC wave speeds m_riemann:3069/3100 — each losing ~14 of 16 digits. The SSP-RK time integrator is correctly not among them.
Sensitivity: m_time_steppers.fpp:510 = 100% (the integrator — expected, labelled, collapsed).
Sig-bits pass/fail: all 6 cases pass at 30.3–48.7 bits (floor 24); max_dev spans 1e-14..7e-8 but bits-retained is tight.
57 toolchain unit tests (TDD for all pure helpers); ruff + format clean; ./mfc.sh precheck all 7.

Code organization

The feature grew fp_stability.py past the 1000-line guideline, so it was de-duplicated and split (pure relocation, no behavior change) into: fp_stability_metrics.py (leaf: pure parsing/metrics), fp_stability_runners.py (Verrou orchestration), fp_stability_report.py (emitters), and fp_stability.py (CLI entry, CASES, _run_case) — each under the guideline, no import cycles.

History note: an intermediate commit added a precision-build-based per-instance disambiguation ("Tier 2"), which a later commit removes in full — it was the heaviest piece (its own module + a build flag + CMake plumbing) for the narrowest trigger, so it was dropped to keep the PR toolchain-only. The add/remove pair is visible in the commit list; the net diff contains none of it.

Caveats

The sig-bits floor is a flat regression bar: a clean case degrading from ~47 to ~25 bits would still pass. Auto-measured per-case baselines + a regression delta would add that sensitivity — deferred.
Per-site "digits lost" is a sweep lower bound (no false positives, may under-count). High local cancellation in flux/reconstruction differences is often by-design (the small difference is the physics) and only harms the result if it propagates — which the sensitivity cross-reference captures.
Job pass/fail fails a case only when it drops below the bit floor; the sensitivity/origin/severity features relabel and localize, they don't change exit semantics.

🤖 Generated with Claude Code

dd_line reports a minimal set of source lines, but presented them as a flat, equally-weighted list of confident warnings. Three problems: (1) no check that the reported lines actually reproduce the instability; (2) fypp #:for/#:def expansion collapses many generated computations onto one .fpp line, so a hit can be the wrong instance; (3) a multi-op line did not say which op was at fault. This adds, reusing the verified Verrou --source mechanism (matches file+line+symbol, captured via --gen-source): - Confirmation: perturb only the suspect lines; lines that fail to reproduce the deviation are downgraded from ::warning:: to ::notice:: (unconfirmed). - Per-line ranking: perturb each line alone and rank by the share of float-proxy it reproduces, so the dominant computation is named (e.g. m_time_steppers.fpp:510 = 100%). - Cancellation cross-reference: label dd_line hotspots that coincide with a stage-F catastrophic-cancellation site. - Macro-expansion flag: mark hotspots whose .fpp line sits inside a #:for/#:def expansion as instance-ambiguous. Surfaced in console, the GitHub step summary (ranked, tagged list), and inline annotations. Pure helpers covered by toolchain/mfc/test_fp_stability.py (22 tests, TDD). Verified end-to-end on a serial debug build.

…Tier 2) dd_line attributes to .fpp source lines, but a #:for/#:def expansion collapses many generated computations onto one line, so a macro-ambiguous hotspot cannot be pinned to a single runtime instance. This adds an opt-in precision path that resolves it. Mechanism (validated against gfortran+Verrou): a new build flag --fp-precision-lines strips the fypp line markers from each generated .f90 so the compiler attributes every expanded instance to a distinct physical line, emitting a .linemap.json sidecar mapping each line back to (.fpp file, line, instance). Marker renumbering was tried first but hit gfortran's DWARF line-number ceiling (~300k) and 700-line shadow runs; stripping avoids both and survives the cpp #if layer. fp-stability gains --precision-sim-binary: for the most flagrant macro-ambiguous hotspot, each expanded instance is perturbed alone (Verrou --source) on the precision binary and ranked, naming the responsible instance and showing its concrete generated code. The strip is gated to the simulation target only (pre/post run on CPU). Validated end-to-end: m_weno.fpp:238 (3 #:for instances) resolved to instance #0 = s_cb(i+3)-s_cb(i+1). toolchain/mfc/fp_precision_lines.py is pure + TDD'd (12 tests); normal build path is byte-identical and unaffected.

…gin; surface caps + coverage The per-line --source ranking measures sensitivity (where reduced precision most moves the output), which is structurally dominated by the time integrator / final accumulation: perturbing the last write to q_cons hits the output 1:1, while upstream errors get re-rounded there. Empirically, sod_standard's cancellation concentrates in m_weno.fpp (14 sites) and m_riemann_solvers.fpp (5), with m_time_steppers.fpp at just 1 — yet the time-stepper led the share ranking at 100%. Presenting it as 'most flagrant' conflated sensitivity with where ill-conditioning originates. Reframe: the dd_line/share view is relabeled 'single-precision sensitivity' with an explicit caveat (typically the time integrator, expected/benign, not a cancellation-origin finder); a new per-file cancellation-density line (_cancellation_by_file) headlines where cancellation actually concentrates; console + GitHub summary + inline annotations updated to keep the two signals distinct. Also: no silent caps (truncated dd_line/cancellation/float-max lists now report '…and N more'; annotations emit a dropped-count notice), and a coverage caveat in the summary header (N 1-D cases; a pass is not a guarantee for unexercised multi-D/viscous/MHD/IGR/bubble paths). _cancellation_by_file is pure + TDD'd.

…esolve continuations Two fixes prompted by review: (1) site COUNT is not severity — one catastrophic cancellation outweighs many mild ones; (2) attribution can land on a continuation fragment, so the labelled line was unclear. Severity: Verrou exposes no per-site bit-count, but --cc-threshold-double is itself a severity filter (a site is only reported if it lost >= threshold bits). A second pass at 26 bits identifies SEVERE sites with no false positives (may under-count). Severe sites are listed first and labelled; the count-by-file view is demoted with an explicit 'count != severity' caveat. On sod_standard this surfaces the real origins — flux divergence (m_rhs), divided differences and smoothness indicators (m_weno), HLLC wave speeds (m_riemann) — and correctly omits the time integrator. Continuations: _statement_bounds_in_lines follows free-form '&' continuations (leading or trailing) to the logical-statement start; cancellation sites are de-duplicated and displayed as the full statement at its canonical start line, so a hit on a fragment resolves to the whole expression. Pure helpers TDD'd (60 toolchain tests).

…hand-tuned thresholds Each case had a hand-tuned absolute L-inf threshold spanning 1e-13..2e-7 (six orders), driven by field magnitude and conditioning. Maintaining per-case thresholds is fragile. Normalizing the deviation by the field's peak magnitude removes the scale, so a single global criterion suffices. Pass/fail is now sig_bits = -log2(max_dev / max|ref|) >= MIN_SIG_BITS (24 = single precision retained under random rounding). The per-case 'threshold' field is removed from CASES; pass/fail, the VPREC FAIL marker, console, summary table, and inline annotations all report bits-retained vs the floor. The dd_sym/dd_line oracle keeps its own float-proxy-derived threshold (unchanged). Validated: max_dev spans 1e-14..7e-8 across the 6 cases but sig_bits is a tight 30.3..48.7 band, all >= 24 with margin; classification matches the prior thresholds (6/6 pass). Pure _sig_bits/_stability_pass are TDD'd (67 toolchain tests). A per-case auto-measured baseline + regression delta would add sensitivity for moderate drops; deferred as a heavier change.

… 'severe'), collapse sensitivity Review feedback on the summary: (1) it buried the interesting cancellation origins below the long, mostly-expected sensitivity list; (2) 'severe' is a binary label when an actual magnitude is far more useful; (3) 'bits lost' is not intuitive. Reorder: the cancellation-origins section now leads (right after the results table), ranked worst-first; the single-precision sensitivity list (dominated by the benign time integrator) is collapsed into a <details>. Severity as a number: a sweep of --cc-threshold-double levels [10,20,30,40,48] buckets each site by the highest it survives (_cancellation_severity), giving per-site bits lost (a lower bound). Bits are translated to decimal digits (a double carries ~16; _digits_left) so each entry reads e.g. '>= 12 digits lost (~4 of 16 left)' with the full statement. On sod_standard the worst origins (flux divergence, divided differences, HLLC wave speeds) lose ~14 of 16 digits; the sweep discriminates (23 sites >=10 bits, 11 >=48). 69 toolchain tests.

…vior change) Cleanup pass over the ~680 lines this branch added to fp_stability.py. Extracts shared helpers that had accreted across the six feature commits, with identical behavior (69 toolchain tests + ruff + precheck green; emitted console/summary text unchanged): - _resolve_source / _read_source_lines: the 'abs-path-or-glob-under-src(-then-tree)' + readlines block was repeated in _read_source_line, _macro_context, _statement_at, _get_source_context. A search_whole_tree flag preserves the one difference (only _get_source_context fell back to the whole tree). - _blank_result(name): the 15-field result dict was written verbatim twice. _find_dd_tool(verrou_bin, tool): merges _find_dd_sym/_find_dd_line. _setup_dd_run: shared dd_run.sh/dd_cmp.py setup + threshold-default for dd_sym and dd_line. _capture_gen_source: shared nearest --gen-source capture for confirmation and disambiguation. _more_md: the '…and N more' truncation footer used in three summary sections.

…(no behavior change) The PR had grown fp_stability.py well past the 1000-line soft guideline. Pure relocation — no function body, string, constant, or logic changed — into a clean dependency chain: - fp_stability_metrics.py (474, leaf): regexes/constants + pure parsing, source-reading, sig-bits, cancellation, ranking, statement-bounds helpers. Imports no sibling. - fp_stability_runners.py (530): Verrou subprocess orchestration (run/dd/vprec/cancellation/confirmation/disambiguation). Imports metrics. - fp_stability_report.py (244): GitHub summary + annotation emitters. Imports metrics. - fp_stability.py (715): CLI entry, CASES, _run_case, _blank_result; imports explicitly from the three. No import cycles. Also dropped the unused 'case' parameter from _run_cancellation_check/_run_float_max_check. Verified: 69 toolchain tests, ruff (incl. F-rules: no undefined names), precheck all 7, and a live fp-stability run confirming the cross-module orchestration (sig-bits pass/fail + cancellation sweep) is unchanged. Test import repointed to fp_stability_metrics.

Per the weight review, the precision-build-based per-instance disambiguation was the heaviest piece (its own module + a build flag + CMake plumbing + tests) for the narrowest trigger (fires only when the most-flagrant hotspot is also inside a #:for/#:def expansion). Removed in full: - deleted toolchain/mfc/fp_precision_lines.py and its tests; deleted _disambiguate_instances - reverted CMakeLists.txt and build.py to upstream (no MFC_FP_PRECISION_LINES option, no marker-strip step, no -D flag); dropped the --fp-precision-lines build arg and the --precision-sim-binary fp-stability arg - removed the E3 disambiguation stage, its docstring section, and the per-instance summary display Kept: the lightweight '#:for/#:def-expanded — may represent multiple instances' hotspot warning (cheap, honest, separate from the disambiguation machinery). 57 toolchain tests, ruff, precheck all 7 green; CMakeLists.txt and build.py are byte-identical to upstream.

…asibility guard Following the native convention (run/validate/viz take the case .py as a positional 'input'), fp-stability now does too — './mfc.sh fp-stability my_case.py' analyzes your case instead of the built-in suite; omitting it runs the suite as before. It loads the case via the shared loader (run.input.load), runs it as a single case, and auto-detects the files to diff from the reference run (_autodetect_compare: conserved-var .dat at the final step, prim fallback). Output is forced to serial .dat I/O (parallel_io=F) since the no-MPI binary is run as one process and the suite diffs serial files. Guard (Verrou is ~30x and the suite runs the sim many times): the case must be a small, short, single-process proxy — errors if cells > 100k or work (cells x t_step_stop) > 200k cell-steps, with guidance to coarsen. Validated end-to-end on a real case .py (auto-compare + sig-bits PASS + cancellation digits); guard correctly rejects 1D_sodshocktube (400k cell-steps). 60 toolchain tests, ruff, precheck all 7.

…/cancellation reframe The --help prose was stale: it claimed 'PASS/FAIL against per-case thresholds' (now scale-free sig-bits), didn't mention the case.py positional / its serial-CPU constraints / the feasibility guard, and listed an outdated case set. Rewrote the description to cover: running on a built-in suite or a user case .py (with constraints + guard), the >= 24-bit scale-free pass criterion, and the analysis passes (dd confirmation/ranking, cancellation origins by digits lost). Also updated the module-docstring Usage. The positional INPUT and a case.py example were already shown in --help; this fixes the surrounding prose.

…code, tests, comment rot From a multi-agent PR review: - Silent failures (critical): a crashed cancellation/float-max/MCA run was reported as a clean result ('none detected' / 'no overflows' / 'dev=0.0'). The run helpers now log the failure and return None (cancellation/float-max) or a completed-sample count (MCA, distinct from measured zero); _run_case reports 'run failed' instead of a false all-clear. - Crash (important): np.loadtxt(...)[:,1] raised IndexError on a single-row .dat (reachable via a 1-cell user case), aborting the whole suite. Added _dat_column using np.atleast_2d; also fixed the generated dd_cmp.py oracle. - Dead code: removed _cancellation_by_file (superseded by the digits-lost severity view) and _stability_pass (the orchestrator inlines the comparison), plus their tests. - Tests: smoke tests for the CI-only report emitters (blank + populated result, and unconfirmed->::notice:: downgrade), _digits_left clamp, and #:block/#:call/unbalanced macro cases. - Comment rot: stage-E docstring now says confirmation is set-level (not per-line); dropped the non-existent 'per-file density' reference; fixed the '48 ~ full mantissa' note (53-bit). 45 tests, ruff, precheck all 7.

…ssage Verrou is a compiled Valgrind fork (not a pip/uv/conda-friendly Python package), so it can't be installed into the venv. Add toolchain/bootstrap/verrou.sh — an explicit, opt-in installer that builds Valgrind+Verrou from source into $VERROU_HOME (default ~/.local/verrou), pinned to the same versions as the fp-stability CI workflow (Valgrind 3.26.0 + edf-hpc/verrou@a58d434). It is deliberately opt-in (never auto-built on a bare fp-stability run, since it's a ~20-min source build needing a C toolchain + autotools): idempotent (skips if already present, --force to rebuild), requires Linux (Valgrind has no modern-macOS/Apple-Silicon support), warns but proceeds on non-x86_64 (Valgrind builds on aarch64 etc., but Verrou's FP backends are best-validated on x86_64), and checks build deps up front with guidance. The fp-stability SKIP message now points at it and clarifies Verrou is not a pip package.

Replace the inline Valgrind+Verrou build in the workflow with a call to toolchain/bootstrap/verrou.sh, so the local installer and CI share one pinned recipe (no drift between them). Cache gating and the system-deps step are unchanged; the build step is still skipped on a cache hit. Tightened the verify step to '--tool=verrou --version', and noted that the cache key's pinned versions must track the script.

…ild as fallback verrou.sh now downloads a pinned, hash-verified prebuilt from sbryngelson/verrou-dist@v1 (seconds) and falls back to the ~20-min source build; idempotency check sources env.sh so a relocated prebuilt isn't re-fetched. fp_stability sets VALGRIND_LIB (via _verrou_env, reused by _dd_env) so a relocated tree's valgrind/verrou_dd_* calls resolve — harmless for source builds. CI installs zstd and sources env.sh before verifying.

…ard-fail if it can't Running ./mfc.sh fp-stability with no Verrou present now installs it via the bootstrap (downloads the pinned prebuilt from verrou-dist; source build as fallback) and proceeds, instead of SKIP+exit-0; a failed install is now a hard error. _find_verrou no longer accepts a bare system valgrind on PATH (it has no 'verrou' tool and would only fail at run time) — that case reads as 'Verrou absent' so it gets installed. CI drops the separate Install/Verify Verrou steps; the run does it. Tests added for the discovery logic.

…ROU_HOME tree, more tests verrou.sh try_prebuilt() runs as an if-condition (set -e suppressed), so a failed extract could fall through and the source build would install over a half-written tree; now extract+verify in a staging dir and swap into PREFIX atomically with explicit error checks. _find_verrou now verifies the $VERROU_HOME tree actually runs the verrou tool (with VALGRIND_LIB for a relocated prebuilt) so a broken/stale tree reads as absent and gets reinstalled, not used until it fails per-run. Fix comment rot (fp-stability now auto-installs). Add unit tests for _verrou_env (incl. preserve-user-VALGRIND_LIB), _dd_env PYTHONPATH composition, _install_verrou hard-fail guards, _has_verrou_tool OSError, and the broken-VERROU_HOME case.

…pp flag onto it The dd delta-debug stack (bisection + confirmation positive-control + sensitivity ranking) tried to pinpoint and rank the single most precision-sensitive source line, but fypp #:for/#:def expansion collapses many generated computations onto one .fpp line, so that attribution is instance-ambiguous by construction — the fragile part. Removed it (~900 lines). The cancellation pass stays and now carries the fypp instance-ambiguity flag: each cancellation origin is checked with _macro_context and, if its .fpp line sits inside a #:for/#:def, marked 'may represent multiple instances' in console, annotations, and summary. file:line attribution (cancellation origins, ranked by digits lost) is preserved; only the false-precision line-pinpointing is gone. Verified end-to-end: 27 cancellation sites, 23 flagged fypp-ambiguous.

sbryngelson added 5 commits June 1, 2026 19:01

sbryngelson changed the title ~~fp-stability: verify, rank, and resolve dd_line hotspots to the responsible instance~~ fp-stability: verify/rank/localize dd_line hotspots, per-instance disambiguation, scale-free pass/fail Jun 2, 2026

sbryngelson added 4 commits June 2, 2026 07:41

sbryngelson changed the title ~~fp-stability: verify/rank/localize dd_line hotspots, per-instance disambiguation, scale-free pass/fail~~ fp-stability: verify/rank dd_line hotspots, cancellation-origin reporting, scale-free pass/fail Jun 2, 2026

sbryngelson added 9 commits June 2, 2026 10:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fp-stability: verify/rank dd_line hotspots, cancellation-origin reporting, scale-free pass/fail#1526

fp-stability: verify/rank dd_line hotspots, cancellation-origin reporting, scale-free pass/fail#1526
sbryngelson wants to merge 18 commits into
MFlowCode:masterfrom
sbryngelson:fp-stability-dd-confirm-rank

sbryngelson commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

sbryngelson commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

What it does now

Validation (real MFC, sod_standard)

Code organization

Caveats

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

sbryngelson commented Jun 2, 2026 •

edited

Loading

Validation (real MFC, `sod_standard`)