Skip to content

[NV] Update H100 Qwen3.5 SGLang agg config#1544

Open
anish-shanbhag wants to merge 3 commits into
mainfrom
codex/qwen35-h100-sglang-pareto-upstream
Open

[NV] Update H100 Qwen3.5 SGLang agg config#1544
anish-shanbhag wants to merge 3 commits into
mainfrom
codex/qwen35-h100-sglang-pareto-upstream

Conversation

@anish-shanbhag
Copy link
Copy Markdown
Collaborator

@anish-shanbhag anish-shanbhag commented May 21, 2026

Updates Qwen3.5-397B-A17B-FP8 H100 SGLang agg recipes with tuned configs for 1k/1k and 8k/1k.


Note

Low Risk
Benchmark and CI config tuning only; no production runtime or auth/data paths.

Overview
Retunes the Qwen3.5-397B-A17B-FP8 aggregate SGLang recipe on H100 for 1k/1k and 8k/1k fixed-seq-len sweeps.

The NVIDIA master config now runs two parallel layouts instead of a single TP8/EP8 band capped at conc 32: TP8/EP1 for conc 1–8, and TP8/EP8 for conc 16–256. The launcher script matches that split—expert parallel is only passed when EP > 1—and maps scheduler-recv-interval per concurrency tier.

Server flags move away from the prior H100 “tight memory” defaults: higher max-running-requests, chunked-prefill-size, and mem-fraction-static, plus enable-symm-mem and concurrency-scaled cuda-graph-max-bs. perf-changelog records the recipe change.

Reviewed by Cursor Bugbot for commit 213a1d2. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@anish-shanbhag anish-shanbhag changed the title Tune H100 Qwen SGLang Pareto recipe [NV] Update H100 Qwen3.5 SGLang agg config May 21, 2026
@anish-shanbhag anish-shanbhag force-pushed the codex/qwen35-h100-sglang-pareto-upstream branch 2 times, most recently from 21edeff to 394e886 Compare May 21, 2026 00:35
@anish-shanbhag anish-shanbhag marked this pull request as ready for review May 21, 2026 00:38
@anish-shanbhag anish-shanbhag requested a review from a team May 21, 2026 00:38
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 21, 2026

Claude finished @anish-shanbhag's task in 3m 54s —— View job


Review of PR #1544

  • Read changed files and gather context
  • Validate master config and perf-changelog
  • Review benchmark script changes
  • Post review summary

LGTM - no blocking issues found

The PR cleanly restructures the H100 Qwen3.5 SGLang aggregate recipe to sweep across three parallelism tiers (TP8/EP1 for low concurrency, TP8/EP8 for mid, TP8/EP8+DP-attention for high). Verified:

  • nvidia-master.yaml modified and perf-changelog.yaml updated with entry appended at the end ✅
  • EP is conditionally enabled via EP_SIZE > 1 (correct SGLang pattern) ✅
  • DP-attention handling (--dp-size "$TP" --enable-dp-attention) is consistent with other scripts (e.g., dsv4_fp4_b300_sglang_mtp.sh) ✅
  • Server launch command properly formatted with line continuations ✅
  • Container image lmsysorg/sglang:v0.5.12-cu130 is publicly accessible on Docker Hub ✅
  • case statement for scheduler-recv-interval covers all concurrency values reachable from the config search-space, with a safe * fallback that exits on unexpected values ✅

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, straightforward benchmark recipe tuning.

Extended reasoning...

Overview

This PR tunes the Qwen3.5-397B-A17B-FP8 H100 SGLang aggregate recipe for both 1k/1k and 8k/1k sweeps. It replaces a single search-space row with three tiered rows (TP8/EP1 conc 1-8, TP8/EP8 conc 16-64, TP8/EP8 + DP-attention conc 128-256), updates the benchmark shell script to thread DP_ATTENTION/EP_SIZE through optional CLI args, adds per-concurrency --scheduler-recv-interval tuning, and bumps max-running-requests/chunked-prefill-size/mem-fraction-static back up from the conservative H200-derived defaults. A changelog entry is included.

Security risks

None. This is a benchmark configuration and shell script change in a non-production, single-node benchmark recipe. No auth, crypto, networking, or input-handling surface area is touched.

Level of scrutiny

Low. This is benchmark recipe tuning isolated to one model/hardware combo. The script changes follow the established DP_ATTENTION env-var pattern used by other recipes (e.g. dsv4_fp4_b200_vllm.sh), the dp-attn search-space key is already supported by utils/matrix_logic/generate_sweep_configs.py, and the YAML edit is local to the qwen3.5-fp8-h100-sglang block.

Other factors

The new case "$CONC" covers every concurrency value the search-space will generate for the non-DP branches (1, 2, 4, 8, 16, 32, 64) and explicitly errors on unsupported values, so silent misconfiguration is unlikely. No bugs were flagged by the bug hunting system, and the prior commit 394e886 on main is on this same tuning track.

@github-actions
Copy link
Copy Markdown
Contributor

@anish-shanbhag anish-shanbhag force-pushed the codex/qwen35-h100-sglang-pareto-upstream branch from 394e886 to 4587b6e Compare May 29, 2026 00:42
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

@anish-shanbhag
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/nvidia-master.yaml --runner-node-filter h100-cw --config-keys qwen3.5-fp8-h100-sglang --conc 4 --seq-lens 8k1k --no-evals

@github-actions
Copy link
Copy Markdown
Contributor

@anish-shanbhag Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26614204211
Command: test-config --config-files .github/configs/nvidia-master.yaml --runner-node-filter h100-cw --config-keys qwen3.5-fp8-h100-sglang --conc 4 --seq-lens 8k1k --no-evals
Pinned ref: 213a1d2
Approval: not required (trusted collaborator).

@github-actions
Copy link
Copy Markdown
Contributor

2 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant