[NV] Update H100 Qwen3.5 SGLang agg config#1544
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
21edeff to
394e886
Compare
|
Claude finished @anish-shanbhag's task in 3m 54s —— View job Review of PR #1544
LGTM - no blocking issues found The PR cleanly restructures the H100 Qwen3.5 SGLang aggregate recipe to sweep across three parallelism tiers (TP8/EP1 for low concurrency, TP8/EP8 for mid, TP8/EP8+DP-attention for high). Verified:
|
There was a problem hiding this comment.
LGTM, straightforward benchmark recipe tuning.
Extended reasoning...
Overview
This PR tunes the Qwen3.5-397B-A17B-FP8 H100 SGLang aggregate recipe for both 1k/1k and 8k/1k sweeps. It replaces a single search-space row with three tiered rows (TP8/EP1 conc 1-8, TP8/EP8 conc 16-64, TP8/EP8 + DP-attention conc 128-256), updates the benchmark shell script to thread DP_ATTENTION/EP_SIZE through optional CLI args, adds per-concurrency --scheduler-recv-interval tuning, and bumps max-running-requests/chunked-prefill-size/mem-fraction-static back up from the conservative H200-derived defaults. A changelog entry is included.
Security risks
None. This is a benchmark configuration and shell script change in a non-production, single-node benchmark recipe. No auth, crypto, networking, or input-handling surface area is touched.
Level of scrutiny
Low. This is benchmark recipe tuning isolated to one model/hardware combo. The script changes follow the established DP_ATTENTION env-var pattern used by other recipes (e.g. dsv4_fp4_b200_vllm.sh), the dp-attn search-space key is already supported by utils/matrix_logic/generate_sweep_configs.py, and the YAML edit is local to the qwen3.5-fp8-h100-sglang block.
Other factors
The new case "$CONC" covers every concurrency value the search-space will generate for the non-DP branches (1, 2, 4, 8, 16, 32, 64) and explicitly errors on unsupported values, so silent misconfiguration is unlikely. No bugs were flagged by the bug hunting system, and the prior commit 394e886 on main is on this same tuning track.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26198417664 |
394e886 to
4587b6e
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26610973668 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26611114296 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26611114296 |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --runner-node-filter h100-cw --config-keys qwen3.5-fp8-h100-sglang --conc 4 --seq-lens 8k1k --no-evals |
|
@anish-shanbhag Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26614204211 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26611114296 |
2 similar comments
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26611114296 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26611114296 |
Updates Qwen3.5-397B-A17B-FP8 H100 SGLang agg recipes with tuned configs for 1k/1k and 8k/1k.
Note
Low Risk
Benchmark and CI config tuning only; no production runtime or auth/data paths.
Overview
Retunes the Qwen3.5-397B-A17B-FP8 aggregate SGLang recipe on H100 for 1k/1k and 8k/1k fixed-seq-len sweeps.
The NVIDIA master config now runs two parallel layouts instead of a single TP8/EP8 band capped at conc 32: TP8/EP1 for conc 1–8, and TP8/EP8 for conc 16–256. The launcher script matches that split—expert parallel is only passed when EP > 1—and maps scheduler-recv-interval per concurrency tier.
Server flags move away from the prior H100 “tight memory” defaults: higher max-running-requests, chunked-prefill-size, and mem-fraction-static, plus enable-symm-mem and concurrency-scaled cuda-graph-max-bs. perf-changelog records the recipe change.
Reviewed by Cursor Bugbot for commit 213a1d2. Bugbot is set up for automated code reviews on this repo. Configure here.