Skip to content

feat(power): multinode measured-power aggregation#1574

Open
arygupt wants to merge 15 commits into
mainfrom
feat/measured-power-multinode
Open

feat(power): multinode measured-power aggregation#1574
arygupt wants to merge 15 commits into
mainfrom
feat/measured-power-multinode

Conversation

@arygupt
Copy link
Copy Markdown
Collaborator

@arygupt arygupt commented May 27, 2026

Summary

Extends single-node measured-power aggregation (#1558) to multinode srt-slurm benchmarks. Wires per-node perf_samples_<host>.csv from srt-slurm's PR #35 perfmon through the launcher into process_result.pyaggregate_power.py, which now namespaces local GPU indices per source CSV stem so each node's local indices 0..N-1 don't collapse across nodes.

Backward compatible: aggregate_power() accepts both Path and Iterable[Path]; single-CSV callers (single-node start_gpu_monitor path) are unchanged. csv_path param name preserved.

Pipeline

srt-slurm perfmon (PR #35 on NVIDIA/srt-slurm, layered on
  SemiAnalysisAI/srt-slurm:feat/inferencex-perfmon)
  → perf_samples_<host>.csv in outputs/<job>/logs/ on shared NFS
  → launch_gb300-cw.sh exports GPU_METRICS_CSV_GLOB to $GITHUB_ENV
  → process_result.py expands glob → aggregate_power.run() with list
  → aggregate_power.py emits cluster-wide avg_power_w + joules_per_*_token
  → InferenceX-app ETL auto-captures (no schema change)

Files

  • utils/aggregate_power.pycsv_path widened to Path | Iterable[Path]. Per-source GPU-id namespacing only kicks in for 2+ sources so single-node num_gpus is unchanged. CLI adds --csv-glob (mutually exclusive with --csv).
  • utils/process_result.py — bridge GPU_METRICS_CSV_GLOB env var. Glob takes precedence over single GPU_METRICS_CSV when both are set.
  • runners/launch_gb300-cw.sh — point dynamo-sglang at our srt-slurm fork, append monitoring: block to each recipe post-copy (idempotent), write GPU_METRICS_CSV_GLOB to $GITHUB_ENV after the job.
  • utils/test_aggregate_power.py — 8 new multinode cases: per-source namespacing, sub-second clock drift, asymmetric prefill/decode power, missing-CSV silent skip, backward-compat single-path-in-list, Iterable acceptance, E2E with list.
  • utils/test_process_result.py — 3 new cases: glob aggregation, precedence over single CSV, empty-match falls through.

Test plan

  • 36/36 aggregator tests pass (28 existing + 8 new)
  • 28/28 process_result tests pass (25 existing + 3 new)
  • nvidia-smi inside sglang container on real gb300-cw emits expected columns (timestamp, gpu, power_w) — verified manually with srun --container-image=...sglang...sqsh nvidia-smi --query-gpu=...
  • First E2E multinode sweep produces avg_power_w + joules_per_*_token in the agg JSON (pending perf-changelog.yaml entry + sweep-enabled label)
  • num_gpus in agg JSON matches prefill_gpus + decode_gpus from launcher (validates per-source namespacing — without the fix, num_gpus would equal a single node's gpus_per_node)
  • Chart at inferencex.semianalysis.com renders the new data via ?unofficialrun=<run_id>

Depends on

SemiAnalysisAI/srt-slurm:feat/inferencex-perfmon (pinned by the launcher). Tracks NVIDIA/srt-slurm PR #35 head; will rebase to upstream main once #35 merges.


Note

Medium Risk
Touches benchmark launchers, result ETL, and multinode job lifecycle; failures are mostly best-effort for telemetry, but incorrect glob/recipe injection could silently omit power data or mis-attribute energy on disagg runs.

Overview
This PR extends measured GPU power and telemetry from single-node runs into multinode benchmarks on both NVIDIA (srt-slurm / gb300-cw) and AMD MI355X (amd_utils SLURM) paths, then patches aggregated results with cluster-wide and per-worker metrics.

NVIDIA: launch_gb300-cw.sh pins a srt-slurm fork with per-node perfmon, recursively injects monitoring: enabled into overlaid recipe YAMLs (fixing a flat-glob silent no-op), and sets GPU_METRICS_CSV_GLOB after the job. aggregate_power.py accepts multiple CSVs, namespaces GPU indices per source file, parses perf_samples_<role>_w<idx>_<host>.csv, and adds temp/util/memory, workers[], and disagg per-stage joules (joules_per_input_token, decode-attributed joules_per_output_token, prefill_avg_power_w / decode_avg_power_w). process_result.py prefers the glob over single-CSV when set and passes disagg through.

AMD: Each disagg node starts start_perf_monitor via benchmark_lib.sh (richer amd-smi sampling, shared perfmon filenames), wired through job.slurm and server_sglang.sh / server_vllm.sh; launch_mi355x-amds.sh collects CSVs early, exports the glob, and hardens squeue polling. CI pre/post-run sudo rm -rf benchmark_logs avoids root-owned checkout failures on cancelled runs.

Large test additions lock multinode, disagg, vendor CSV shapes, and process_result glob behavior.

Reviewed by Cursor Bugbot for commit 6849229. Bugbot is set up for automated code reviews on this repo. Configure here.

@arygupt arygupt requested a review from a team May 27, 2026 19:23
arygupt added a commit that referenced this pull request May 27, 2026
…regation

Appends entry for dsv4-fp4-gb300-dynamo-sglang so run-sweep.yml fires when
the sweep-enabled label is added to PR #1574. The sweep produces the first
multinode agg JSONs with avg_power_w + joules_per_*_token, validating the
per-source GPU-id namespacing and GPU_METRICS_CSV_GLOB env-var bridge
end-to-end on real GB300 hardware (gb300-cw cluster).
@functionstackx functionstackx changed the title feat(power): multinode measured-power aggregation feat(power): multinode measured-montiroring aggregation May 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Comment thread utils/process_result.py
Comment on lines +149 to +172
_csv_arg = None
_glob_pattern = os.environ.get('GPU_METRICS_CSV_GLOB')
if _glob_pattern:
_matched = sorted(Path(p) for p in _glob_module.glob(_glob_pattern))
if _matched:
_csv_arg = _matched
else:
print(
f'[process_result] GPU_METRICS_CSV_GLOB={_glob_pattern!r} matched no files',
file=sys.stderr,
)

if _csv_arg is None:
# Single-node path: gpu_metrics.csv written by start_gpu_monitor in the
# bench container.
_csv_candidates = [
os.environ.get('GPU_METRICS_CSV'),
'gpu_metrics.csv',
'/workspace/gpu_metrics.csv',
]
_csv_arg = next(
(Path(p) for p in _csv_candidates if p and Path(p).is_file()),
None,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 When GPU_METRICS_CSV_GLOB is set but matches no files, _csv_arg stays None and the code falls through to the single-CSV candidate list (GPU_METRICS_CSV, gpu_metrics.csv, /workspace/gpu_metrics.csv) — contradicting the comment at lines 145-148 that the glob 'Takes precedence over the single-CSV fallback'. On a persistent self-hosted runner with a stale /workspace/gpu_metrics.csv from a prior single-node run (or a leaked GPU_METRICS_CSV env var), a multinode run whose perfmon failed on every node would silently patch wrong single-node avg_power_w / joules_per_*_token values into the multinode agg JSON. Fix: when _glob_pattern is truthy, skip the single-CSV fallback regardless of whether the glob matched anything.

Extended reasoning...

The contract violation

The block at utils/process_result.py:142-159 documents the precedence contract clearly:

Takes precedence over the single-CSV fallback — if the launcher set the glob, the run was multinode and there is no single-CSV fallback to make.

But the implementation only honors that contract when the glob actually matches files. On empty match:

_csv_arg = None
_glob_pattern = os.environ.get('GPU_METRICS_CSV_GLOB')
if _glob_pattern:
    _matched = sorted(Path(p) for p in _glob_module.glob(_glob_pattern))
    if _matched:
        _csv_arg = _matched
    else:
        print(..., file=sys.stderr)   # warns but doesn't prevent fallthrough

if _csv_arg is None:                  # still None — falls into single-CSV branch
    _csv_candidates = [
        os.environ.get('GPU_METRICS_CSV'),
        'gpu_metrics.csv',
        '/workspace/gpu_metrics.csv',
    ]
    ...

The else branch just logs; _csv_arg stays None, and the next if _csv_arg is None block consults the single-CSV candidates.

Step-by-step proof on a persistent self-hosted runner

  1. Single-node run on gb300-cw_N completes successfully. benchmarks/benchmark_lib.sh exports GPU_METRICS_CSV=/workspace/gpu_metrics.csv (it lives in gpu_metrics.csv in cwd too). The file is left behind because the runner is persistent across jobs.
  2. Next job is a multinode dynamo-sglang sweep. runners/launch_gb300-cw.sh (lines 297-318) writes GPU_METRICS_CSV_GLOB=$LOGS_DIR/perf_samples_*.csv to $GITHUB_ENV — but only when perf_csv_count > 0. Suppose perfmon failed to start on every node (srt-slurm PR [NVIDIA] Reduce B200 Runs & add B200 FP4 Docker Script #35 had startup issues, host driver mismatch, etc.) — perf_csv_count would be 0 and the glob env var would not be written. Fine — that path is safe.
  3. However, suppose perfmon CSVs were written at the end of the job (so the launcher writes the GLOB), but a downstream cleanup hook between launcher and process_result.py removed them, OR srt-slurm wrote the CSVs to a different path on a subsequent retry, OR a persistent env var (GPU_METRICS_CSV_GLOB from a prior job) leaks in. The glob expansion in process_result.py returns empty.
  4. process_result.py enters the else branch on line 155, prints a warning, and falls through. os.environ.get('GPU_METRICS_CSV') from the prior single-node job returns /workspace/gpu_metrics.csv (or gpu_metrics.csv in cwd is still there). Path(p).is_file() is True. _csv_arg = Path('/workspace/gpu_metrics.csv').
  5. _aggregate_power_run is called with the stale single-node CSV.

Why the bench-window timestamp filter doesn't always save us

One verifier argued the start_unix <= ts <= end_unix filter at aggregate_power.py:177-178 would reject stale samples. That's true if the window comes from explicit Unix timestamps. But this PR adds two new fallback tiers in _load_bench_window:

  • Tier 2: date field parsed as a UTC string (YYYYMMDD-HHMMSS).
  • Tier 3: bench_result_path.stat().st_mtime — the bench JSON's own mtime, which is the current run's mtime, used as bench-end with start = end - duration.

The mtime tier is exactly the danger zone: on a persistent runner the bench JSON is freshly written, so its mtime is now. If the stale gpu_metrics.csv was also written recently (within the derived [mtime - duration, mtime] window — possible if the prior single-node run finished a few minutes ago), its samples do fall inside the window. Result: silent wrong avg_power_w and joules_per_*_token patched into the multinode agg JSON, which InferenceX-app's ETL auto-captures into the dashboard.

What the test misses

The accompanying test test_multinode_csv_glob_empty_match_falls_through_silently only asserts the no-stale-file case (asserts 'avg_power_w' not in patched). It does not stage a stale fallback CSV, so it can't catch the precedence violation. test_multinode_csv_glob_takes_precedence_over_single_csv only tests precedence when the glob matches.

Fix

One-line change in the empty-match branch:

if _glob_pattern:
    _matched = sorted(Path(p) for p in _glob_module.glob(_glob_pattern))
    if _matched:
        _csv_arg = _matched
    else:
        _csv_arg = []           # sentinel: glob attempted, fallback forbidden
        print(...)

if not _csv_arg:                # treats [] same as None for the downstream check, but…
    if _glob_pattern:
        pass                    # …skip single-CSV candidates when glob was attempted
    else:
        _csv_candidates = [...]
        _csv_arg = next(...)

Or more cleanly: guard the single-CSV block on not _glob_pattern instead of _csv_arg is None.

@github-actions
Copy link
Copy Markdown
Contributor

@arygupt arygupt changed the title feat(power): multinode measured-montiroring aggregation feat(power): multinode measured-power aggregation May 28, 2026
arygupt added a commit that referenced this pull request May 28, 2026
…regation

Appends entry for dsv4-fp4-gb300-dynamo-sglang so run-sweep.yml fires when
the sweep-enabled label is added to PR #1574. The sweep produces the first
multinode agg JSONs with avg_power_w + joules_per_*_token, validating the
per-source GPU-id namespacing and GPU_METRICS_CSV_GLOB env-var bridge
end-to-end on real GB300 hardware (gb300-cw cluster).
@arygupt arygupt force-pushed the feat/measured-power-multinode branch from 3caf593 to 8d30341 Compare May 28, 2026 01:01
Comment thread utils/aggregate_power.py
bench = json.loads(bench_result_path.read_text(encoding="utf-8"))
except (OSError, json.JSONDecodeError):
return None
start = bench.get("benchmark_start_time_unix")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multinode num_gpus wrong without GPU column and clock drift

Low Severity

When multiple CSV paths are provided but none have a recognized GPU column (no match for _GPU_INDEX_COL_RE), num_gpus falls back to max(per_sample_row_count.values()). With multinode clock drift, each timestamp bucket only contains rows from a single node, so max returns one node's GPU count instead of the cluster total. This underestimates num_gpus and produces an incorrect total_system_energy_j, leading to wrong joules_per_*_token values in the agg JSON.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8d30341. Configure here.

@github-actions
Copy link
Copy Markdown
Contributor

arygupt added a commit that referenced this pull request May 28, 2026
… joules

Layers per-worker breakdown on top of the cluster-wide multinode
aggregation in the parent PR #1574. New agg JSON fields (additive — all
existing keys preserved bit-for-bit for backward compat):

  workers: [{role, worker_idx, num_gpus, avg_power_w}, ...]
    role ∈ "prefill" / "decode" / "agg" / "frontend". Each (role, idx)
    aggregates across all CSVs for that worker — a multi-node TP=16
    decode worker on 4 nodes produces one workers entry with num_gpus=16.

  prefill_avg_power_w, decode_avg_power_w  (disagg only)
    Weighted per-GPU averages within each role.

  joules_per_input_token         = prefill_energy / total_input_tokens
  joules_per_output_token_decode = decode_energy  / total_output_tokens
    Disagg-only role-split metrics. Existing joules_per_output_token and
    joules_per_total_token keep their cluster-wide semantics so the chart
    won't shift on existing data.

Worker → CSV mapping is by filename: srt-slurm's perfmon (companion change
on SemiAnalysisAI/srt-slurm c4c86dc) writes
`perf_samples_<role>_w<worker_idx>_<host>.csv`. Unlabeled filenames (old
single-CSV format) silently emit empty workers list and skip the role
split — cluster-wide metrics unchanged in that case.

77/77 tests pass (68 existing + 9 new — per-worker grouping, multi-node
worker aggregation, mixed labeled/unlabeled inputs, disagg E2E with role
split, agg E2E omitting disagg-only fields, bit-for-bit backward compat
for old-format callers).
arygupt pushed a commit to SemiAnalysisAI/srt-slurm that referenced this pull request May 28, 2026
Squashed rebase of NVIDIA/srt-slurm PR NVIDIA#35 (kdhruv/gweperf_integration)
onto current main (which now includes default_bash_preamble, added since
PR NVIDIA#35 was opened on 2026-04-13). Original PR NVIDIA#35 had three commits;
their net effect is collapsed here to one because the second commit
replaced the first's gweperf integration with a built-in poller.

Adds:
  - src/srtctl/monitor/perfmon.py (new) - nvidia-smi polling, per-node
    perf_samples_<node>.csv + perf_summary_<node>.json output.
  - MonitoringConfig in src/srtctl/core/schema.py (new) - {enabled,
    sample_interval}, top-level SrtConfig field.
  - _start_perf_monitor / _stop_perf_monitor in BenchmarkStageMixin
    (new) - one process per worker node, started before bench, stopped
    SIGINT with 30s grace.
  - tests/test_monitoring.py (new) - 19 tests, all passing upstream.

Consumed by SemiAnalysisAI/InferenceX#1574 via the pinned ref
SemiAnalysisAI/srt-slurm@feat/inferencex-perfmon. Will revert this
fork pin to NVIDIA/srt-slurm@main once PR NVIDIA#35 merges upstream.
arygupt and others added 7 commits May 28, 2026 11:10
… runs

Builds on PR #1558 (single-node measured-power) for multinode benchmarks
via srt-slurm. Pipeline:

  srt-slurm perfmon (per-node nvidia-smi sampling — PR #35 on
    NVIDIA/srt-slurm, layered on SemiAnalysisAI/srt-slurm:feat/inferencex-perfmon)
   perf_samples_<host>.csv in outputs/<job>/logs/ on shared NFS
   launch_gb300-cw.sh exports GPU_METRICS_CSV_GLOB to $GITHUB_ENV
   process_result.py expands the glob and hands the list to
   aggregate_power.run()
   aggregate_power.py namespaces local GPU indices per source CSV stem so
   each node's local indices 0..N-1 don't collide across nodes; emits
   cluster-wide avg_power_w + joules_per_*_token
   InferenceX-app ETL auto-captures the numeric fields (no schema change)

Changes:

- utils/aggregate_power.py: widen csv_path to Path | Iterable[Path] keeping
  the original param name. Per-source GPU-id namespacing only kicks in when
  there are 2+ sources so single-node num_gpus is unchanged. CLI adds
  --csv-glob (Python-side glob, mutually exclusive with --csv).
- utils/process_result.py: bridge GPU_METRICS_CSV_GLOB env var. Glob takes
  precedence over single GPU_METRICS_CSV when both are set.
- runners/launch_gb300-cw.sh: point dynamo-sglang at our srt-slurm fork,
  append `monitoring:` block to each recipe post-copy (idempotent), and
  write GPU_METRICS_CSV_GLOB to $GITHUB_ENV after the job for the
  downstream Process result step.
- 8 new multinode tests in test_aggregate_power.py (per-source namespacing,
  sub-second clock drift, asymmetric prefill/decode power, missing-CSV
  silent skip, backward-compat single-path-in-list, Iterable acceptance,
  E2E run with list). 3 new in test_process_result.py (glob aggregation,
  precedence over single CSV, empty-match falls through). 64/64 pass.

Verified data-format end-to-end on gb300 hardware: nvidia-smi
inside the sglang container emits the columns aggregate_power.py needs
timestamp, gpu, power_w.
…regation

Appends entry for dsv4-fp4-gb300-dynamo-sglang so run-sweep.yml fires when
the sweep-enabled label is added to PR #1574. The sweep produces the first
multinode agg JSONs with avg_power_w + joules_per_*_token, validating the
per-source GPU-id namespacing and GPU_METRICS_CSV_GLOB env-var bridge
end-to-end on real GB300 hardware (gb300-cw cluster).
… recipes

The previous glob `$SRT_RECIPE_DST/*.yaml` only matched top-level YAMLs,
but recipes live under workload subdirectories (e.g. 8k1k/*.yaml). The
loop iterated zero times, no recipe got the monitoring: block, perfmon
never spawned, no perf_samples_*.csv were written, aggregate_power
silently skipped patching the agg JSON, and the dashboard had no power
data.

Sweep #26548110246 burned hours of GB300 time and shipped "success" with
zero power keys in every agg artifact — exactly the silent-failure chain
we should have caught earlier.

Fix: recurse via `find -type f -name '*.yaml'`. Add a loud WARNING when
zero recipes get the injection so future regressions surface immediately
instead of waiting for missing dashboard data to be noticed.
Previous Run Sweep failed because SemiAnalysisAI/srt-slurm@feat/inferencex-perfmon
was based on PR #35's head from 2026-04-13, predating the default_bash_preamble
schema field that the launcher writes into srtslurm.yaml. srtctl rejected the
config with 'Unknown field' and the job never submitted.

Fork branch has now been rebased onto current NVIDIA/srt-slurm main (which
has default_bash_preamble), with PR #35 perfmon commits + Aryan's role-label
follow-up squashed/cherry-picked on top. Empty commit here re-fires the
Run Sweep workflow so we can validate end-to-end on real DSv4 FP4 GB300.
Prior sweep (#26548110246) on SHA 8d30341 completed green but produced
zero power data because of a flat-glob bug in the monitoring-injection
loop. Fix is on HEAD (6da2f1b) but the workflow's path filter only
fires on perf-changelog.yaml edits, so this commit re-touches that file
to re-dispatch.
Extends multinode measured-power aggregation with per-worker breakdown
and per-stage joules attribution. The cluster-wide avg_power_w +
joules_per_*_token fields stay backward-compatible; new disagg-only
fields layer on top.

New agg JSON fields:
  - power_by_worker: list of {role, worker_idx, hosts, num_gpus,
    avg_power_w} parsed from srt-slurm perfmon CSV filenames
    (`perf_samples_<role>_w<idx>_<host>.csv`). Roles: prefill, decode,
    agg, frontend. Workers spanning N nodes collapse one entry whose
    num_gpus is the cross-node sum.
  - joules_per_input_token: prefill_energy / total_input_tokens
    (disagg only — meaningless without a prefill stage).

Per-stage attribution (disagg only) replaces cluster-wide ratios for
existing fields:
  - joules_per_output_token = decode_energy / output_tokens
  - joules_per_total_token = (prefill + decode) / all_tokens
Frontend-labeled CSVs are excluded from per-stage energy but still
listed for observability. Falls back to cluster-wide math if only one
stage's CSVs survived.

process_result.py now passes DISAGG through to aggregate_power.run().
launch_gb300-cw.sh's recipe-injection loop reports found/injected
counts so a zero-recipes-found bug is distinguishable from the
benign all-already-monitored case.

Tests: 88/88 pass (68 existing + 20 new). New coverage: label parsing
across host formats, multi-node-per-worker collapse, per-stage J/token
math, frontend exclusion, single-stage fallback, zero-input degenerate,
end-to-end disagg wiring through process_result.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Workflow's paths: filter only fires on perf-changelog.yaml. This bumps
the dsv4-fp4-gb300-dynamo-sglang entry so the sweep picks up the new
per-worker power + per-stage J/token aggregation from 24f46ff.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@arygupt arygupt force-pushed the feat/measured-power-multinode branch from f5b5c77 to 1af17ab Compare May 28, 2026 18:11
arygupt and others added 2 commits May 28, 2026 13:30
…+ add temp/util/mem

Realigns the per-worker / per-stage schema introduced in 06558b9 to
match the canonical METRIC_KEYS already declared in InferenceX-app
(packages/app/src/lib/metric-keys.ts). Previously this PR overrode
cluster-wide joules_per_output_token for disagg runs, which would
silently shift the meaning of a shared field. New per-stage values are
emitted as separate flat scalars so the cluster keys stay byte-stable.

Schema changes:
  - Revert disagg override on joules_per_output_token and
    joules_per_total_token — both are now ALWAYS cluster-wide
    (total_system_energy / token_count), matching single-node math
    and the frontend's existing axis labels.
  - Add new disagg-only flat scalars (already in frontend METRIC_KEYS):
      prefill_avg_power_w           cluster mean across prefill workers
      decode_avg_power_w            cluster mean across decode workers
      joules_per_output_token_decode  decode_energy / output_tokens
    joules_per_input_token unchanged (prefill_energy / input_tokens).
  - Rename power_by_worker[] -> workers[] to match
    InferenceX-app's BenchmarkRow.workers / WorkerPower interface.
  - Each workers[] entry extended with per-worker telemetry:
      avg_temp_c, peak_temp_c, avg_util_pct, avg_mem_used_mb
  - Add matching cluster-wide telemetry scalars (per-GPU mean, omitted
    when CSV lacks the column).

Implementation:
  - _read_samples + _aggregate_rows refactored to extract all metric
    columns in one pass (single-vendor regex per metric, gracefully
    degrades when a column is absent).
  - aggregate_power() preserved as a thin compat wrapper returning the
    old (power, num_gpus) tuple so external callers don't break.
  - Per-stage prefill_avg_power_w / decode_avg_power_w use weighted
    mean by num_gpus (matches how cluster avg_power_w is computed).
  - Frontend-labeled CSVs still excluded from per-stage energy
    attribution; included in cluster totals.

Tests: 107/107 pass (88 existing baseline preserved, 14 new telemetry
tests, 5 schema-renamed tests updated in place). New coverage: temp /
util / mem extraction across NVIDIA + AMD + srt-slurm CSV schemas,
peak vs avg distinction, missing-column graceful degradation, per-
worker telemetry, per-stage weighted-mean scalars.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirror the NVIDIA gb300/srt-slurm measured-power path on the AMD
multi-node disaggregated inference path. With no orchestrator perfmon,
each SGLang/vLLM disagg node starts its own amd-smi monitor via
start_perf_monitor (benchmark_lib.sh), writing
perf_samples_<role>_w<idx>_<host>.csv into the NFS-shared
/benchmark_logs/perfmon mount; launch_mi355x-amds.sh collects them and
exports GPU_METRICS_CSV_GLOB so the existing vendor-agnostic
utils/aggregate_power.py produces per-worker + per-stage power.

AMD perfmon wiring:
- benchmark_lib.sh: start_perf_monitor helper; case-insensitive amd-smi
  header filter; log captured CSV header for schema-mismatch visibility
- amd_utils/job.slurm: PERFMON_OUTPUT_DIR + interval into each container
- amd_utils/server_sglang.sh / server_vllm.sh: per-node role + worker-idx
  classification (matches each engine's own placement); monitor start +
  stop on every exit path
- runners/launch_mi355x-amds.sh: collect per-node CSVs immediately after
  job completion (before result-processing early-exits / EXIT-trap wipe),
  export GPU_METRICS_CSV_GLOB
- utils/aggregate_power.py: docstring documents the AMD source (logic
  already vendor-agnostic)
- utils/test_aggregate_power.py: AMD amd-smi multinode tests (per-worker,
  per-stage J/token, multi-node-per-worker collapse, vLLM topology)
- perf-changelog.yaml: trigger the 6 mi355x disagg sweeps (sglang+vllm)

Also lands the concurrent per-metric telemetry extension in
aggregate_power.py / tests: temp/util/mem aggregation, workers[] schema,
and flat per-stage scalars (prefill_avg_power_w, decode_avg_power_w,
joules_per_input_token, joules_per_output_token_decode).

Verified locally: 107 utils tests pass; bash syntax + shellcheck clean;
role mapping + filename contract + full amd-smi->agg pipeline validated;
adversarial review findings addressed (CSV collection moved ahead of
early exits; case-insensitive amd-smi header).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread utils/aggregate_power.py Fixed
Comment thread utils/aggregate_power.py Outdated
Comment thread utils/aggregate_power.py Outdated
Resolve perf-changelog.yaml append conflict by keeping all three new
entries: main's #1579 (qwen3.5-fp4-mi355x-sglang-disagg) plus this
branch's #1574 re-trigger and the AMD multinode measured-power entry.
Append-only file (process_changelog rejects deletions); no lines removed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

- _MEM_EXCLUDE_RE now excludes "clock" and "util" (not just "total"), so
  nvidia-smi's clocks.current.memory (a frequency) and utilization.memory
  (a percent) are no longer mislabeled as avg_mem_used_mb. (cursor[bot] Medium)
- Remove dead _disagg_stage_energies shim — no callers. (cursor[bot] Low)
- Add regression test: mem detection ignores clock/util memory columns.

108 utils tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@arygupt arygupt enabled auto-merge (squash) May 28, 2026 21:33
Comment thread utils/aggregate_power.py Fixed
@github-actions
Copy link
Copy Markdown
Contributor

Comment thread utils/aggregate_power.py
per_sample_mean = [
total / max(len(per_sample_gpus.get(ts, ())), 1)
for ts, total in totals.items()
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heterogeneous CSV schemas produce wrong cluster-wide metric averages

Medium Severity

When CSVs have heterogeneous schemas (e.g. one node's CSV has a temperature.gpu column but another's doesn't), _avg_per_gpu for optional metrics like temp/util/mem divides by the total GPU count in the bucket (per_sample_gpus) rather than by the number of GPUs that actually reported that metric. The per_sample_gpus set is populated from all sources regardless of which metrics they carry, so the denominator is inflated by GPUs that contributed no data for that metric, silently halving (or worse) the reported cluster-wide avg_temp_c, avg_util_pct, or avg_mem_used_mb.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit dea49cd. Configure here.

Per reviewer: in disagg serving, attribute each token type to only its
stage's GPUs — input tokens to prefill GPUs, output tokens to decode GPUs
(symmetric). joules_per_output_token is now decode_energy / output_tokens
for disagg (was cluster-wide); joules_per_input_token already used prefill
energy / input_tokens. joules_per_total_token stays cluster-wide (overall
efficiency). Single-node / non-disagg / single-stage keep the cluster-wide
output ratio so the field is always populated.

Removes the now-redundant joules_per_output_token_decode key (folded into
joules_per_output_token). Docstring, CLI help, and tests updated; 108 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread utils/aggregate_power.py Fixed
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

The AMD multinode container runs as root and writes benchmark_logs/. If a
job is cancelled (e.g. concurrency supersede), its cleanup trap never runs,
leaving root-owned dirs. actions/checkout (clean: true) then can't rmdir
them (EACCES) and fails BEFORE the job starts — poison-failing every job
scheduled onto that runner. Add `sudo rm -rf $GITHUB_WORKSPACE/benchmark_logs`
to the shared Slurm-cleanup anchor (runs pre-checkout AND post-run) so a
dirty runner self-heals.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

A full sweep floods slurmctld, so `squeue` intermittently returns
"slurm_load_jobs error: Socket timed out". The old liveness check
(`! squeue ... | grep -q $JOB_ID`) treated that empty/failed output as
"job died" and exit 1'd — a false failure on a healthy job (observed on
dsr1-fp8-mi355x-sglang-disagg conc 1024x2048).

Add job_alive(): a non-zero squeue exit is treated as "still alive" (don't
false-fail on a scheduler blip); only a SUCCESSFUL squeue that omits the
job — re-checked once to avoid a single-sample race — counts as gone. Used
by both the wait-for-log loop and the completion poll.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 488ce46. Configure here.

Comment thread utils/aggregate_power.py
# Per-worker rollup is best-effort: only emitted when CSV filenames carry
# the perfmon role/index encoding. Single-node `gpu_metrics.csv` won't
# parse, so aggregate_power_by_worker returns None and the field is omitted.
workers = aggregate_power_by_worker(paths, start, end)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double CSV I/O: every file read twice unnecessarily

Low Severity

The run() function calls aggregate_metrics(paths, start, end) and then aggregate_power_by_worker(paths, start, end) unconditionally. Both internally call _read_samples() on every CSV path, meaning every perfmon CSV is opened, parsed, and filtered twice. For multinode runs with many large per-node CSVs, this doubles the I/O and CPU cost of the aggregation step. The first call's parsed samples could be reused by the second.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 488ce46. Configure here.

…pot)

The amd-smi monitor only ran `metric -p -c -t -u`, so no VRAM column was
emitted and avg_mem_used_mb never populated on AMD. It also used util/mem
column matchers tuned for NVIDIA/srt-slurm names, which miss amd-smi's
conventions — so avg_util_pct and avg_temp_c silently dropped too.

- benchmark_lib.sh: add `-m` (mem-usage) to the amd-smi command so a
  used_vram column is captured.
- aggregate_power.py column detection:
  - util: also match amd-smi `gfx_activity` (umc/mm_activity excluded).
  - mem: match positively on memory/vram + "used" instead of broad "mem"
    minus a growing exclude list — picks memory.used / mem_used_mb /
    used_vram while rejecting mem_temperature, mem_voltage, total/free_vram,
    the memory clock, and utilization.memory.
  - temp: prefer hotspot/junction over the first temp column, since edge
    temperature reads N/A on data-center AMD parts (MI300/MI355).

NVIDIA and srt-slurm detection is unchanged (verified by existing tests).
Adds AMD-header detection tests; full suite 111 passed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread utils/aggregate_power.py


# Back-compat shim — some external callers may have imported _parse_power.
_parse_power = _parse_numeric_cell
@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant