Skip to content

[MoRI short term patch] add qwen3.5-fp4 MI355X SGLang PD-disaggregation#1579

Merged
functionstackx merged 1 commit into
mainfrom
yichaozhu/sglang-disagg-qwen3.5-fp4
May 28, 2026
Merged

[MoRI short term patch] add qwen3.5-fp4 MI355X SGLang PD-disaggregation#1579
functionstackx merged 1 commit into
mainfrom
yichaozhu/sglang-disagg-qwen3.5-fp4

Conversation

@YukioZzz
Copy link
Copy Markdown
Collaborator

@YukioZzz YukioZzz commented May 28, 2026

Add Qwen3.5-397B-A17B-MXFP4 MI355X SGLang PD-disaggregation

  • Bump image to lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523, 1P1D TP8/EP1, dp-attn false, conc [8..512]
  • MoRI conn.py overlay (48e459b) via job.slurm; launcher qwen3.5_fp4_mi355x_sglang-disagg.sh

Note

Low Risk
Benchmark and CI configuration only; no application runtime or security-sensitive code paths.

Overview
Adds MI355X prefill–decode disaggregation benchmarks for Qwen3.5-397B-A17B MXFP4 on SGLang, parallel to the existing FP8 disagg entry.

A new qwen3.5-fp4-mi355x-sglang-disagg matrix entry uses lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523, model amd/Qwen3.5-397B-A17B-MXFP4, 1P1D with prefill/decode TP8/EP1, dp-attn false, and fixed-seq sweeps at 1k1k and 8k1k with conc 8–512 (no MTP).

Supporting changes add Qwen3.5-397B-A17B-MXFP4 server flags in benchmarks/multi_node/amd_utils/models.yaml (MoRI transfer, aiter attention, prefill/decode tuning) and launcher qwen3.5_fp4_mi355x_sglang-disagg.sh, which maps CI env vars and calls amd_utils/submit.sh like the FP8 disagg script. perf-changelog.yaml documents the config and MoRI conn.py overlay via job.slurm.

Reviewed by Cursor Bugbot for commit 9e26be6. Bugbot is set up for automated code reviews on this repo. Configure here.

@YukioZzz YukioZzz requested a review from a team May 28, 2026 09:53
@YukioZzz YukioZzz changed the title Yichaozhu/sglang disagg qwen3.5 fp4 [AMD][MI35X]: add qwen3.5-fp4 MI355X SGLang PD-disaggregation May 28, 2026
Comment thread benchmarks/multi_node/amd_utils/patches/mori_conn.py Fixed
Comment thread benchmarks/multi_node/amd_utils/patches/mori_conn.py Fixed
Comment thread benchmarks/multi_node/amd_utils/patches/mori_conn.py Fixed
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5845a91. Configure here.

Comment thread benchmarks/multi_node/amd_utils/server.sh Outdated
Comment thread benchmarks/multi_node/amd_utils/setup_deps.sh
Comment thread perf-changelog.yaml
Comment on lines +2310 to +2317

- config-keys:
- qwen3.5-fp4-mi355x-sglang-disagg
description:
- "Add Qwen3.5-397B-A17B-MXFP4 MI355X SGLang PD-disaggregation"
- "Bump image to lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523, 1P1D TP8/EP1, dp-attn false, conc [8..512]"
- "MoRI conn.py overlay (48e459bd) via job.slurm; launcher qwen3.5_fp4_mi355x_sglang-disagg.sh"
pr-link: XXX
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new perf-changelog entry at the bottom of perf-changelog.yaml has pr-link: XXX (a literal placeholder; every other entry uses a real https://github.com/SemiAnalysisAI/InferenceX/pull/N URL), and its config-keys lists only qwen3.5-fp4-mi355x-sglang-disagg even though the same PR also adds two new top-level keys to amd-master.yaml: qwen3.5-fp8-mi355x-sglang-disagg and glm5-fp8-mi355x-sglang-disagg. Please replace XXX with this PR's URL and either add the two missing keys to config-keys (with their own description bullets) or split them into their own changelog entries.

Extended reasoning...

What's wrong

The new changelog entry added at the bottom of perf-changelog.yaml (lines 2310–2317) has two issues:

  1. Placeholder pr-link: XXX. Every other entry in this file uses a real GitHub PR URL (e.g. pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1300 on line 2309, and similarly at lines 2285, 2293, 2301). XXX is unique to this entry — grep confirms it appears on exactly one line in the file. It's clearly a stale placeholder the author forgot to fill in before pushing.

  2. config-keys is missing two of the three new configs. The amd-master.yaml diff in this same PR adds three new top-level keys:

    • qwen3.5-fp4-mi355x-sglang-disagg (line 351 of the new file)
    • qwen3.5-fp8-mi355x-sglang-disagg (line 384)
    • glm5-fp8-mi355x-sglang-disagg (line 537)

    Plus the corresponding models.yaml entries (Qwen3.5-397B-A17B-MXFP4, Qwen3.5-397B-A17B-FP8, GLM-5-FP8) and launcher scripts (qwen3.5_fp4_…, qwen3.5_fp8_…, glm5_fp8_…). All three are clearly intended to ship in this PR. The changelog block only lists the FP4 one under config-keys and the description bullets only mention the FP4 launcher.

Why it matters

pr-link: XXX is not a valid URL — any downstream tooling that parses pr-link as a hyperlink (changelog renderers, release-note generators, dashboards that link back to the PR) will at best render a broken link and at worst error out. The pydantic ChangelogEntry schema in utils/matrix_logic/validation.py only types pr-link as str, so it won't fail schema validation — making this exactly the kind of issue that slips past CI and surfaces in published artifacts.

The missing config-keys entries mean the FP8 Qwen3.5 and GLM-5 disagg configurations are introduced silently: anyone looking at the changelog to understand what changed for those configs will see nothing, even though the PR is clearly shipping them (full amd-master.yaml scenarios, models.yaml model entries with model-specific flags and tuning, dedicated launcher scripts, and GLM-5-specific env vars in env.sh).

Step-by-step verification

  1. grep -n 'pr-link: XXX' perf-changelog.yaml → matches exactly one line (2317).
  2. grep -n 'pr-link:' perf-changelog.yaml | head → every other entry has a real https://github.com/... URL.
  3. In the amd-master.yaml hunk of this PR's diff, count the new top-level keys ending in -sglang-disagg added by this PR (not present before): three keys, as listed above.
  4. The new entry's config-keys: list contains exactly one item — qwen3.5-fp4-mi355x-sglang-disagg. The other two are absent.

How to fix

Two one-line edits in perf-changelog.yaml:

- config-keys:
    - qwen3.5-fp4-mi355x-sglang-disagg
    - qwen3.5-fp8-mi355x-sglang-disagg
    - glm5-fp8-mi355x-sglang-disagg
  description:
    - "Add Qwen3.5-397B-A17B-MXFP4 MI355X SGLang PD-disaggregation"
    - "Add Qwen3.5-397B-A17B-FP8 MI355X SGLang PD-disaggregation (dp-attn=false; see amd-master.yaml comment for the upstream MoRI/is_deepep_class_backend issue)"
    - "Add GLM-5-FP8 MI355X SGLang PD-disaggregation (NSA attention; sets SGLANG_ROCM_FUSED_DECODE_MLA=0 in env.sh, installs transformers glm_moe_dsa via setup_deps.sh)"
    - "Bump image to lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523, 1P1D TP8/EP1, conc [8..512]"
    - "MoRI conn.py overlay (48e459bd) via job.slurm; launchers qwen3.5_fp4_/qwen3.5_fp8_/glm5_fp8_mi355x_sglang-disagg.sh"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1579

Or alternatively, split into three separate changelog entries (one per new config key) if you prefer to keep each entry narrowly scoped — that's also consistent with how other multi-config PRs in this file are structured.

Severity

Documentation/changelog hygiene only — no runtime impact, no risk of breaking benchmarks. But pr-link: XXX is a known-bad placeholder that should always be replaced before merging, and the omitted config-keys cause real misattribution in any automation that consumes this file. Easy pre-merge fix.

Comment on lines +71 to +77
$ISL $OSL "${CONC_LIST// /x}" inf \
${PREFILL_ENABLE_EP} ${PREFILL_ENABLE_DP} \
${DECODE_ENABLE_EP} ${DECODE_ENABLE_DP} \
${PREFILL_TP} ${DECODE_TP} \
${RANDOM_RANGE_RATIO})

if [[ $? -ne 0 ]]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The glm5-fp8 disagg launcher (added in this PR) invokes submit.sh with only 15 positional args, omitting the 16th ${NODE_LIST:-} arg that the two sibling qwen3.5 disagg launchers in the same PR pass. Since submit.sh reads NODE_LIST=${16} to populate --nodelist for sbatch, any caller exporting NODE_LIST to pin glm5 disagg jobs to specific hostnames will be silently ignored — only for glm5. Append ${NODE_LIST:-} to match the qwen launchers (one-line fix).

Extended reasoning...

What the bug is

This PR adds three sibling launcher scripts for the new sglang-disagg configs:

  • benchmarks/multi_node/qwen3.5_fp4_mi355x_sglang-disagg.sh
  • benchmarks/multi_node/qwen3.5_fp8_mi355x_sglang-disagg.sh
  • benchmarks/multi_node/glm5_fp8_mi355x_sglang-disagg.sh

All three call amd_utils/submit.sh with the same positional-argument schema. The two qwen launchers pass 16 positional args, terminating with ${RANDOM_RANGE_RATIO} \\\n ${NODE_LIST:-}. The glm5 launcher stops at ${RANDOM_RANGE_RATIO} — only 15 positional args.

Code path that triggers it

benchmarks/multi_node/amd_utils/submit.sh line 76 reads:

NODE_LIST=${16}

This is a positional-arg read, not an env-var read. Bash does not auto-fill positional args from the environment, so when the glm5 launcher omits the 16th arg, ${16} expands to the empty string regardless of whether the caller exported NODE_LIST=... before invoking the launcher.

Further down (around submit.sh line 130–145), NODE_LIST is parsed as comma-separated hostnames and used to build an sbatch --nodelist <csv> flag. So with the glm5 launcher, --nodelist is never set; SLURM picks any matching idle nodes.

Why existing code doesn't prevent it

The schema is purely positional (no getopts/named args), so there is no validation that the launcher and submit.sh agree on the trailing args. The two qwen scripts and the glm5 script are otherwise identical templates, so the divergence is almost certainly a copy-paste omission rather than intentional.

Impact

Asymmetric, surprising behavior between three sibling launchers added in the same PR:

  • NODE_LIST=node07,node08 bash qwen3.5_fp8_mi355x_sglang-disagg.sh → sbatch is invoked with --nodelist node07,node08 (pinned).
  • NODE_LIST=node07,node08 bash glm5_fp8_mi355x_sglang-disagg.sh → sbatch is invoked without --nodelist (any nodes; the env var is silently dropped).

Today nothing in the in-tree workflow plumbing sets NODE_LIST, so the feature is dormant for all three. But the moment a caller (CI step, runner script, or a human running locally) tries to pin GLM-5 disagg jobs using the documented escape hatch that already works for qwen, they'll get a silent failure with no error message — the job will land on whatever nodes SLURM picks, which on a shared cluster may be the wrong hardware class entirely.

Fix

One-line change in benchmarks/multi_node/glm5_fp8_mi355x_sglang-disagg.sh — append ${NODE_LIST:-} to the submit.sh call to match the two qwen siblings:

JOB_ID=$(bash ./submit.sh $PREFILL_NODES \
    $PREFILL_NUM_WORKERS \
    $DECODE_NODES \
    $DECODE_NUM_WORKERS \
    $ISL $OSL "${CONC_LIST// /x}" inf \
    ${PREFILL_ENABLE_EP} ${PREFILL_ENABLE_DP} \
    ${DECODE_ENABLE_EP} ${DECODE_ENABLE_DP} \
    ${PREFILL_TP} ${DECODE_TP} \
    ${RANDOM_RANGE_RATIO} \
    ${NODE_LIST:-})

Step-by-step proof

  1. User exports NODE_LIST=smci355-ccs-aus-07,smci355-ccs-aus-08 and runs bash benchmarks/multi_node/glm5_fp8_mi355x_sglang-disagg.sh.
  2. The launcher executes bash ./submit.sh $1 ... ${RANDOM_RANGE_RATIO} — 15 positional args. The exported NODE_LIST is in the child shell's env but not in $@.
  3. Inside submit.sh, NODE_LIST=${16} runs. $16 is unset (only 15 args were passed), so NODE_LIST is set to the empty string. The shell-level assignment masks any env-inherited NODE_LIST because NODE_LIST=${16} is a plain assignment (not NODE_LIST=${16:-$NODE_LIST}).
  4. The if [[ -n "${NODE_LIST//[[:space:]]/}" ]] guard around line 130 evaluates false (empty after stripping whitespace), so NODELIST_OPT stays empty.
  5. sbatch is invoked without --nodelist; SLURM picks any matching idle nodes from the partition.

Same five-step recipe with qwen3.5_fp8_mi355x_sglang-disagg.sh substituted at step 1: step 2 passes 16 args, step 3 sets NODE_LIST=smci355-ccs-aus-07,smci355-ccs-aus-08, step 4 builds --nodelist smci355-ccs-aus-07,smci355-ccs-aus-08, and step 5 pins sbatch to those exact nodes. The only difference between the two paths is the missing 16th argument in the glm5 launcher.

Comment on lines +211 to +222
PREFILL_DECODE_DIFFERENT_TP=""
if [[ "$PREFILL_ENABLE_DP" != "$DECODE_ENABLE_DP" ]]; then
if is_truthy "$ENABLE_DISAGG_DECODE_PARALLELISM_FLAGS"; then
if [[ "$DECODE_ENABLE_DP" == "true" ]]; then
PREFILL_DECODE_DIFFERENT_TP="--disaggregation-decode-tp ${DECODE_TP_SIZE} --disaggregation-decode-dp ${DECODE_TP_SIZE}"
else
PREFILL_DECODE_DIFFERENT_TP="--disaggregation-decode-tp ${DECODE_TP_SIZE} --disaggregation-decode-dp 1"
fi
else
echo "[INFO] Skipping --disaggregation-decode-tp/--disaggregation-decode-dp; set ENABLE_DISAGG_DECODE_PARALLELISM_FLAGS=true to enable them."
fi
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 PREFILL_DECODE_DIFFERENT_TP is populated at server.sh:211-217 but never appended to PREFILL_MODE_FLAGS, PREFILL_SERVER_CONFIG, or the PREFILL_CMD heredoc, so the new ENABLE_DISAGG_DECODE_PARALLELISM_FLAGS opt-in plumbed through submit.sh → job.slurm → server.sh is dead code: --disaggregation-decode-tp / --disaggregation-decode-dp never reach sglang.launch_server. The fix is to append $PREFILL_DECODE_DIFFERENT_TP to the prefill flags (e.g. into PREFILL_MODE_FLAGS). The configs added in this PR all use matching prefill/decode dp-attn so the outer guard doesn't trigger today, but the documented feature is non-functional as-merged.

Extended reasoning...

What the bug is. The new block at benchmarks/multi_node/amd_utils/server.sh:208-222 introduces ENABLE_DISAGG_DECODE_PARALLELISM_FLAGS, an opt-in switch threaded all the way from submit.sh (line 101) → job.slurm (lines 92, 306, 436) → server.sh (line 27). When the user opts in and PREFILL_ENABLE_DP != DECODE_ENABLE_DP, the block computes a flag string into PREFILL_DECODE_DIFFERENT_TP:

PREFILL_DECODE_DIFFERENT_TP="--disaggregation-decode-tp ${DECODE_TP_SIZE} --disaggregation-decode-dp ${DECODE_TP_SIZE}"
# or
PREFILL_DECODE_DIFFERENT_TP="--disaggregation-decode-tp ${DECODE_TP_SIZE} --disaggregation-decode-dp 1"

But that variable is never read again. PREFILL_MODE_FLAGS (line 227), build_server_config (lines ~336-405), PREFILL_SERVER_CONFIG (line ~409), and the PREFILL_CMD heredoc that actually invokes sglang.launch_server (lines ~474, ~620) all proceed without it. Grepping the entire repo confirms only three references to PREFILL_DECODE_DIFFERENT_TP exist (the init and the two assignments inside the if/else), and --disaggregation-decode-tp / --disaggregation-decode-dp appear nowhere else in the repo.

Why existing code doesn't prevent it. The [INFO] Skipping ... log in the else branch is suppressed when the user opts in, which makes the bug worse: the user is told implicitly that the flags were enabled, but they were silently dropped before reaching sglang.

Step-by-step proof. Suppose a user picks a heterogeneous P/D config and sets ENABLE_DISAGG_DECODE_PARALLELISM_FLAGS=true, PREFILL_ENABLE_DP=false, DECODE_ENABLE_DP=true, DECODE_TP_SIZE=8:

  1. submit.sh:101 exports ENABLE_DISAGG_DECODE_PARALLELISM_FLAGS=true into the env.
  2. job.slurm:92 reads it, :306 re-exports it, and :436 injects it into the docker -e block.
  3. Inside the container, server.sh:27 re-reads it; the block at L208 triggers because false != true; is_truthy returns 0; the DECODE_ENABLE_DP=true branch fires, so PREFILL_DECODE_DIFFERENT_TP="--disaggregation-decode-tp 8 --disaggregation-decode-dp 8".
  4. Execution falls through to L227 where PREFILL_MODE_FLAGS is constructed without $PREFILL_DECODE_DIFFERENT_TP.
  5. build_server_config "prefill" ... (called at the line just below DECODE_SERVER_CONFIG=...) concatenates parallel_args + MODEL_BASE_FLAGS + MODEL_DP_FLAGS + PREFILL_MODE_FLAGS — none of which contain the variable.
  6. PREFILL_CMD="... python3 -m sglang.launch_server ... ${PREFILL_SERVER_CONFIG}" launches sglang without the disaggregation-decode flags.

Impact. The opt-in feature is silently a no-op. The configs added by this PR (qwen3.5-fp4/fp8 + glm5-fp8 sglang-disagg) all use dp-attn=false on both prefill and decode, so the outer if [[ "$PREFILL_ENABLE_DP" != "$DECODE_ENABLE_DP" ]] guard never fires for the in-tree YAMLs, meaning this PR's sweeps are unaffected. However, the entire new plumbing was added explicitly to enable heterogeneous P/D configurations — and every future user who flips the switch will silently get a mis-launched sglang. Worse, the user will not see the [INFO] Skipping ... log that would otherwise tip them off, so they will believe the flags took effect.

How to fix. Append $PREFILL_DECODE_DIFFERENT_TP to the prefill flag bundle. Minimal patch:

PREFILL_MODE_FLAGS="--mem-fraction-static ${PREFILL_MEM_FRACTION_STATIC} ... --cuda-graph-bs ${prefill_cuda_graph_bs[*]} ${PREFILL_DECODE_DIFFERENT_TP} "

Or thread it through build_server_config and the specific_config branch for prefill. Either way, the variable needs to actually reach PREFILL_CMD.

@YukioZzz YukioZzz force-pushed the yichaozhu/sglang-disagg-qwen3.5-fp4 branch from 5845a91 to 5053106 Compare May 28, 2026 10:27
@github-actions
Copy link
Copy Markdown
Contributor

Add qwen3.5-fp4-mi355x-sglang-disagg recipe; MXFP4 model; MoRI overlay.
@functionstackx functionstackx force-pushed the yichaozhu/sglang-disagg-qwen3.5-fp4 branch from 5053106 to 9e26be6 Compare May 28, 2026 16:35
@functionstackx functionstackx changed the title [AMD][MI35X]: add qwen3.5-fp4 MI355X SGLang PD-disaggregation [MoRI short term patch] add qwen3.5-fp4 MI355X SGLang PD-disaggregation May 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx added non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim) and removed sweep-enabled labels May 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx
Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

@functionstackx functionstackx merged commit 34528e3 into main May 28, 2026
65 checks passed
@functionstackx functionstackx deleted the yichaozhu/sglang-disagg-qwen3.5-fp4 branch May 28, 2026 18:55
arygupt added a commit that referenced this pull request May 28, 2026
Resolve perf-changelog.yaml append conflict by keeping all three new
entries: main's #1579 (qwen3.5-fp4-mi355x-sglang-disagg) plus this
branch's #1574 re-trigger and the AMD multinode measured-power entry.
Append-only file (process_changelog rejects deletions); no lines removed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim)

Projects

Development

Successfully merging this pull request may close these issues.

2 participants