fp-stability: verify/rank dd_line hotspots, cancellation-origin reporting, scale-free pass/fail#1526
Draft
sbryngelson wants to merge 18 commits into
Draft
fp-stability: verify/rank dd_line hotspots, cancellation-origin reporting, scale-free pass/fail#1526sbryngelson wants to merge 18 commits into
sbryngelson wants to merge 18 commits into
Conversation
dd_line reports a minimal set of source lines, but presented them as a flat, equally-weighted list of confident warnings. Three problems: (1) no check that the reported lines actually reproduce the instability; (2) fypp #:for/#:def expansion collapses many generated computations onto one .fpp line, so a hit can be the wrong instance; (3) a multi-op line did not say which op was at fault. This adds, reusing the verified Verrou --source mechanism (matches file+line+symbol, captured via --gen-source): - Confirmation: perturb only the suspect lines; lines that fail to reproduce the deviation are downgraded from ::warning:: to ::notice:: (unconfirmed). - Per-line ranking: perturb each line alone and rank by the share of float-proxy it reproduces, so the dominant computation is named (e.g. m_time_steppers.fpp:510 = 100%). - Cancellation cross-reference: label dd_line hotspots that coincide with a stage-F catastrophic-cancellation site. - Macro-expansion flag: mark hotspots whose .fpp line sits inside a #:for/#:def expansion as instance-ambiguous. Surfaced in console, the GitHub step summary (ranked, tagged list), and inline annotations. Pure helpers covered by toolchain/mfc/test_fp_stability.py (22 tests, TDD). Verified end-to-end on a serial debug build.
…Tier 2) dd_line attributes to .fpp source lines, but a #:for/#:def expansion collapses many generated computations onto one line, so a macro-ambiguous hotspot cannot be pinned to a single runtime instance. This adds an opt-in precision path that resolves it. Mechanism (validated against gfortran+Verrou): a new build flag --fp-precision-lines strips the fypp line markers from each generated .f90 so the compiler attributes every expanded instance to a distinct physical line, emitting a .linemap.json sidecar mapping each line back to (.fpp file, line, instance). Marker renumbering was tried first but hit gfortran's DWARF line-number ceiling (~300k) and 700-line shadow runs; stripping avoids both and survives the cpp #if layer. fp-stability gains --precision-sim-binary: for the most flagrant macro-ambiguous hotspot, each expanded instance is perturbed alone (Verrou --source) on the precision binary and ranked, naming the responsible instance and showing its concrete generated code. The strip is gated to the simulation target only (pre/post run on CPU). Validated end-to-end: m_weno.fpp:238 (3 #:for instances) resolved to instance #0 = s_cb(i+3)-s_cb(i+1). toolchain/mfc/fp_precision_lines.py is pure + TDD'd (12 tests); normal build path is byte-identical and unaffected.
…gin; surface caps + coverage The per-line --source ranking measures sensitivity (where reduced precision most moves the output), which is structurally dominated by the time integrator / final accumulation: perturbing the last write to q_cons hits the output 1:1, while upstream errors get re-rounded there. Empirically, sod_standard's cancellation concentrates in m_weno.fpp (14 sites) and m_riemann_solvers.fpp (5), with m_time_steppers.fpp at just 1 — yet the time-stepper led the share ranking at 100%. Presenting it as 'most flagrant' conflated sensitivity with where ill-conditioning originates. Reframe: the dd_line/share view is relabeled 'single-precision sensitivity' with an explicit caveat (typically the time integrator, expected/benign, not a cancellation-origin finder); a new per-file cancellation-density line (_cancellation_by_file) headlines where cancellation actually concentrates; console + GitHub summary + inline annotations updated to keep the two signals distinct. Also: no silent caps (truncated dd_line/cancellation/float-max lists now report '…and N more'; annotations emit a dropped-count notice), and a coverage caveat in the summary header (N 1-D cases; a pass is not a guarantee for unexercised multi-D/viscous/MHD/IGR/bubble paths). _cancellation_by_file is pure + TDD'd.
…esolve continuations Two fixes prompted by review: (1) site COUNT is not severity — one catastrophic cancellation outweighs many mild ones; (2) attribution can land on a continuation fragment, so the labelled line was unclear. Severity: Verrou exposes no per-site bit-count, but --cc-threshold-double is itself a severity filter (a site is only reported if it lost >= threshold bits). A second pass at 26 bits identifies SEVERE sites with no false positives (may under-count). Severe sites are listed first and labelled; the count-by-file view is demoted with an explicit 'count != severity' caveat. On sod_standard this surfaces the real origins — flux divergence (m_rhs), divided differences and smoothness indicators (m_weno), HLLC wave speeds (m_riemann) — and correctly omits the time integrator. Continuations: _statement_bounds_in_lines follows free-form '&' continuations (leading or trailing) to the logical-statement start; cancellation sites are de-duplicated and displayed as the full statement at its canonical start line, so a hit on a fragment resolves to the whole expression. Pure helpers TDD'd (60 toolchain tests).
…hand-tuned thresholds Each case had a hand-tuned absolute L-inf threshold spanning 1e-13..2e-7 (six orders), driven by field magnitude and conditioning. Maintaining per-case thresholds is fragile. Normalizing the deviation by the field's peak magnitude removes the scale, so a single global criterion suffices. Pass/fail is now sig_bits = -log2(max_dev / max|ref|) >= MIN_SIG_BITS (24 = single precision retained under random rounding). The per-case 'threshold' field is removed from CASES; pass/fail, the VPREC FAIL marker, console, summary table, and inline annotations all report bits-retained vs the floor. The dd_sym/dd_line oracle keeps its own float-proxy-derived threshold (unchanged). Validated: max_dev spans 1e-14..7e-8 across the 6 cases but sig_bits is a tight 30.3..48.7 band, all >= 24 with margin; classification matches the prior thresholds (6/6 pass). Pure _sig_bits/_stability_pass are TDD'd (67 toolchain tests). A per-case auto-measured baseline + regression delta would add sensitivity for moderate drops; deferred as a heavier change.
… 'severe'), collapse sensitivity Review feedback on the summary: (1) it buried the interesting cancellation origins below the long, mostly-expected sensitivity list; (2) 'severe' is a binary label when an actual magnitude is far more useful; (3) 'bits lost' is not intuitive. Reorder: the cancellation-origins section now leads (right after the results table), ranked worst-first; the single-precision sensitivity list (dominated by the benign time integrator) is collapsed into a <details>. Severity as a number: a sweep of --cc-threshold-double levels [10,20,30,40,48] buckets each site by the highest it survives (_cancellation_severity), giving per-site bits lost (a lower bound). Bits are translated to decimal digits (a double carries ~16; _digits_left) so each entry reads e.g. '>= 12 digits lost (~4 of 16 left)' with the full statement. On sod_standard the worst origins (flux divergence, divided differences, HLLC wave speeds) lose ~14 of 16 digits; the sweep discriminates (23 sites >=10 bits, 11 >=48). 69 toolchain tests.
…vior change) Cleanup pass over the ~680 lines this branch added to fp_stability.py. Extracts shared helpers that had accreted across the six feature commits, with identical behavior (69 toolchain tests + ruff + precheck green; emitted console/summary text unchanged): - _resolve_source / _read_source_lines: the 'abs-path-or-glob-under-src(-then-tree)' + readlines block was repeated in _read_source_line, _macro_context, _statement_at, _get_source_context. A search_whole_tree flag preserves the one difference (only _get_source_context fell back to the whole tree). - _blank_result(name): the 15-field result dict was written verbatim twice. _find_dd_tool(verrou_bin, tool): merges _find_dd_sym/_find_dd_line. _setup_dd_run: shared dd_run.sh/dd_cmp.py setup + threshold-default for dd_sym and dd_line. _capture_gen_source: shared nearest --gen-source capture for confirmation and disambiguation. _more_md: the '…and N more' truncation footer used in three summary sections.
…(no behavior change) The PR had grown fp_stability.py well past the 1000-line soft guideline. Pure relocation — no function body, string, constant, or logic changed — into a clean dependency chain: - fp_stability_metrics.py (474, leaf): regexes/constants + pure parsing, source-reading, sig-bits, cancellation, ranking, statement-bounds helpers. Imports no sibling. - fp_stability_runners.py (530): Verrou subprocess orchestration (run/dd/vprec/cancellation/confirmation/disambiguation). Imports metrics. - fp_stability_report.py (244): GitHub summary + annotation emitters. Imports metrics. - fp_stability.py (715): CLI entry, CASES, _run_case, _blank_result; imports explicitly from the three. No import cycles. Also dropped the unused 'case' parameter from _run_cancellation_check/_run_float_max_check. Verified: 69 toolchain tests, ruff (incl. F-rules: no undefined names), precheck all 7, and a live fp-stability run confirming the cross-module orchestration (sig-bits pass/fail + cancellation sweep) is unchanged. Test import repointed to fp_stability_metrics.
Per the weight review, the precision-build-based per-instance disambiguation was the heaviest piece (its own module + a build flag + CMake plumbing + tests) for the narrowest trigger (fires only when the most-flagrant hotspot is also inside a #:for/#:def expansion). Removed in full: - deleted toolchain/mfc/fp_precision_lines.py and its tests; deleted _disambiguate_instances - reverted CMakeLists.txt and build.py to upstream (no MFC_FP_PRECISION_LINES option, no marker-strip step, no -D flag); dropped the --fp-precision-lines build arg and the --precision-sim-binary fp-stability arg - removed the E3 disambiguation stage, its docstring section, and the per-instance summary display Kept: the lightweight '#:for/#:def-expanded — may represent multiple instances' hotspot warning (cheap, honest, separate from the disambiguation machinery). 57 toolchain tests, ruff, precheck all 7 green; CMakeLists.txt and build.py are byte-identical to upstream.
…asibility guard Following the native convention (run/validate/viz take the case .py as a positional 'input'), fp-stability now does too — './mfc.sh fp-stability my_case.py' analyzes your case instead of the built-in suite; omitting it runs the suite as before. It loads the case via the shared loader (run.input.load), runs it as a single case, and auto-detects the files to diff from the reference run (_autodetect_compare: conserved-var .dat at the final step, prim fallback). Output is forced to serial .dat I/O (parallel_io=F) since the no-MPI binary is run as one process and the suite diffs serial files. Guard (Verrou is ~30x and the suite runs the sim many times): the case must be a small, short, single-process proxy — errors if cells > 100k or work (cells x t_step_stop) > 200k cell-steps, with guidance to coarsen. Validated end-to-end on a real case .py (auto-compare + sig-bits PASS + cancellation digits); guard correctly rejects 1D_sodshocktube (400k cell-steps). 60 toolchain tests, ruff, precheck all 7.
…/cancellation reframe The --help prose was stale: it claimed 'PASS/FAIL against per-case thresholds' (now scale-free sig-bits), didn't mention the case.py positional / its serial-CPU constraints / the feasibility guard, and listed an outdated case set. Rewrote the description to cover: running on a built-in suite or a user case .py (with constraints + guard), the >= 24-bit scale-free pass criterion, and the analysis passes (dd confirmation/ranking, cancellation origins by digits lost). Also updated the module-docstring Usage. The positional INPUT and a case.py example were already shown in --help; this fixes the surrounding prose.
…code, tests, comment rot
From a multi-agent PR review:
- Silent failures (critical): a crashed cancellation/float-max/MCA run was reported as a clean result ('none detected' / 'no overflows' / 'dev=0.0'). The run helpers now log the failure and return None (cancellation/float-max) or a completed-sample count (MCA, distinct from measured zero); _run_case reports 'run failed' instead of a false all-clear.
- Crash (important): np.loadtxt(...)[:,1] raised IndexError on a single-row .dat (reachable via a 1-cell user case), aborting the whole suite. Added _dat_column using np.atleast_2d; also fixed the generated dd_cmp.py oracle.
- Dead code: removed _cancellation_by_file (superseded by the digits-lost severity view) and _stability_pass (the orchestrator inlines the comparison), plus their tests.
- Tests: smoke tests for the CI-only report emitters (blank + populated result, and unconfirmed->::notice:: downgrade), _digits_left clamp, and #:block/#:call/unbalanced macro cases.
- Comment rot: stage-E docstring now says confirmation is set-level (not per-line); dropped the non-existent 'per-file density' reference; fixed the '48 ~ full mantissa' note (53-bit). 45 tests, ruff, precheck all 7.
…ssage Verrou is a compiled Valgrind fork (not a pip/uv/conda-friendly Python package), so it can't be installed into the venv. Add toolchain/bootstrap/verrou.sh — an explicit, opt-in installer that builds Valgrind+Verrou from source into $VERROU_HOME (default ~/.local/verrou), pinned to the same versions as the fp-stability CI workflow (Valgrind 3.26.0 + edf-hpc/verrou@a58d434). It is deliberately opt-in (never auto-built on a bare fp-stability run, since it's a ~20-min source build needing a C toolchain + autotools): idempotent (skips if already present, --force to rebuild), requires Linux (Valgrind has no modern-macOS/Apple-Silicon support), warns but proceeds on non-x86_64 (Valgrind builds on aarch64 etc., but Verrou's FP backends are best-validated on x86_64), and checks build deps up front with guidance. The fp-stability SKIP message now points at it and clarifies Verrou is not a pip package.
Replace the inline Valgrind+Verrou build in the workflow with a call to toolchain/bootstrap/verrou.sh, so the local installer and CI share one pinned recipe (no drift between them). Cache gating and the system-deps step are unchanged; the build step is still skipped on a cache hit. Tightened the verify step to '--tool=verrou --version', and noted that the cache key's pinned versions must track the script.
…ild as fallback verrou.sh now downloads a pinned, hash-verified prebuilt from sbryngelson/verrou-dist@v1 (seconds) and falls back to the ~20-min source build; idempotency check sources env.sh so a relocated prebuilt isn't re-fetched. fp_stability sets VALGRIND_LIB (via _verrou_env, reused by _dd_env) so a relocated tree's valgrind/verrou_dd_* calls resolve — harmless for source builds. CI installs zstd and sources env.sh before verifying.
…ard-fail if it can't Running ./mfc.sh fp-stability with no Verrou present now installs it via the bootstrap (downloads the pinned prebuilt from verrou-dist; source build as fallback) and proceeds, instead of SKIP+exit-0; a failed install is now a hard error. _find_verrou no longer accepts a bare system valgrind on PATH (it has no 'verrou' tool and would only fail at run time) — that case reads as 'Verrou absent' so it gets installed. CI drops the separate Install/Verify Verrou steps; the run does it. Tests added for the discovery logic.
…ROU_HOME tree, more tests verrou.sh try_prebuilt() runs as an if-condition (set -e suppressed), so a failed extract could fall through and the source build would install over a half-written tree; now extract+verify in a staging dir and swap into PREFIX atomically with explicit error checks. _find_verrou now verifies the $VERROU_HOME tree actually runs the verrou tool (with VALGRIND_LIB for a relocated prebuilt) so a broken/stale tree reads as absent and gets reinstalled, not used until it fails per-run. Fix comment rot (fp-stability now auto-installs). Add unit tests for _verrou_env (incl. preserve-user-VALGRIND_LIB), _dd_env PYTHONPATH composition, _install_verrou hard-fail guards, _has_verrou_tool OSError, and the broken-VERROU_HOME case.
…pp flag onto it The dd delta-debug stack (bisection + confirmation positive-control + sensitivity ranking) tried to pinpoint and rank the single most precision-sensitive source line, but fypp #:for/#:def expansion collapses many generated computations onto one .fpp line, so that attribution is instance-ambiguous by construction — the fragile part. Removed it (~900 lines). The cancellation pass stays and now carries the fypp instance-ambiguity flag: each cancellation origin is checked with _macro_context and, if its .fpp line sits inside a #:for/#:def, marked 'may represent multiple instances' in console, annotations, and summary. file:line attribution (cancellation origins, ranked by digits lost) is preserved; only the false-precision line-pinpointing is gone. Verified end-to-end: 27 cancellation sites, 23 flagged fypp-ambiguous.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes
./mfc.sh fp-stability(the Verrou-based FP-stability harness) trustworthy and self-explanatory:dd_linehotspots are verified and ranked; the report leads with where catastrophic cancellation originates and how badly (in decimal digits lost); precision sensitivity is kept clearly separate from cancellation origin; and pass/fail is scale-free, retiring the per-case hand-tuned thresholds.Toolchain-only — no Fortran/solver and no build-system changes (CMakeLists.txt / build.py are untouched relative to master). gfortran is the fp-stability CI compiler.
Motivation
dd_linereported a flat list of source lines as confident "hotspots." Several ways that misled:&-continuation fragment, so the labelled line was unclear.What it does now
Verification + ranking. A positive control perturbs only the suspect lines (Verrou
--source, matching file+line+symbol via--gen-source); lines that don't reproduce the deviation are downgraded::warning::→::notice::(unconfirmed). Lines are ranked by their share of the single-precision deviation.Sensitivity vs cancellation-origin. The report leads with cancellation origins (where ill-conditioning arises), ranked worst-first by decimal digits lost (a sweep of
--cc-threshold-doublelevels buckets each site by the highest it survives; bits → digits, since a double carries ~16). Each entry shows the full statement (continuation fragments resolved to their logical-statement start). The--sourcesensitivity ranking — dominated by the (benign) time integrator — is relabelled as such and collapsed into a<details>, so it's no longer mistaken for a culprit-finder.Scale-free pass/fail. The 6 hand-tuned per-case thresholds are gone. A case passes iff it retains
sig_bits = -log2(max_dev / max|ref|) ≥ 24(single-precision accuracy) — normalizing by field scale collapses the six-order spread into a tight 30–48 bit band, so one global floor suffices.Honesty. Truncated lists report "…and N more"; a coverage caveat notes these are N 1-D cases (not a guarantee for unexercised multi-D/viscous/MHD/IGR/bubble paths); hotspots inside a
#:for/#:defexpansion are flagged as instance-ambiguous.Validation (real MFC,
sod_standard)m_rhs:921, divided differencesm_weno:1090–1093, HLLC wave speedsm_riemann:3069/3100— each losing ~14 of 16 digits. The SSP-RK time integrator is correctly not among them.m_time_steppers.fpp:510= 100% (the integrator — expected, labelled, collapsed).max_devspans1e-14..7e-8but bits-retained is tight../mfc.sh precheckall 7.Code organization
The feature grew
fp_stability.pypast the 1000-line guideline, so it was de-duplicated and split (pure relocation, no behavior change) into:fp_stability_metrics.py(leaf: pure parsing/metrics),fp_stability_runners.py(Verrou orchestration),fp_stability_report.py(emitters), andfp_stability.py(CLI entry,CASES,_run_case) — each under the guideline, no import cycles.Caveats
🤖 Generated with Claude Code