-
Notifications
You must be signed in to change notification settings - Fork 178
[NV] B300 (Agg): migrate model path #1539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
6377d35
629c2ac
b46504a
5d2e1d4
bb1a4ae
c036986
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,6 +18,18 @@ check_env_vars \ | |
| DP_ATTENTION \ | ||
| EP_SIZE | ||
|
|
||
| # `hf download` creates the target dir if missing and is itself idempotent. | ||
| # When MODEL_PATH is unset (stand-alone runs), fall back to the HF_HUB_CACHE | ||
| # Either way, MODEL_PATH is what the server is launched with. | ||
| if [[ -n "${MODEL_PATH:-}" ]]; then | ||
| if [[ ! -d "$MODEL_PATH" || -z "$(ls -A "$MODEL_PATH" 2>/dev/null)" ]]; then | ||
| hf download "$MODEL" --local-dir "$MODEL_PATH" | ||
| fi | ||
| else | ||
| hf download "$MODEL" | ||
| export MODEL_PATH="$MODEL" | ||
| fi | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
@@ -47,10 +59,6 @@ sanitize_slurm_mpi_env_for_trtllm | |
| export NCCL_NVLS_ENABLE="${NCCL_NVLS_ENABLE:-0}" | ||
| echo "NCCL_NVLS_ENABLE: $NCCL_NVLS_ENABLE" | ||
|
|
||
| if [[ "$MODEL" != /* ]]; then | ||
| hf download "$MODEL" | ||
| fi | ||
|
|
||
| nvidia-smi | ||
|
|
||
| SERVER_LOG="$PWD/server.log" | ||
|
|
@@ -108,7 +116,7 @@ start_gpu_monitor --output "$PWD/gpu_metrics.csv" | |
|
|
||
| set -x | ||
| SERVE_CMD=( | ||
| trtllm-serve "$MODEL" \ | ||
| trtllm-serve "$MODEL_PATH" \ | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TRT scripts missing
|
||
| --host 0.0.0.0 \ | ||
| --port "$PORT" \ | ||
| --trust_remote_code \ | ||
|
|
||


There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The reduced search-space for
dsr1-fp8-b300-sglangis a single point attp=4, isl=1024, osl=1024, butbenchmarks/single_node/dsr1_fp8_b300.shlines 47-51 contain a guard that doesexit 1whenTP=4is combined with anything other thanISL=8192/OSL=1024. Every run of this config will exit before the server launches, defeating the wiring-verification goal stated in the PR description. Fix by changing the sweep point totp: 8, switching toisl: 8192, or relaxing the script guard.Extended reasoning...
What the bug is
The PR shrinks the
dsr1-fp8-b300-sglangsweep in.github/configs/nvidia-master.yamlto one point:But the bench script the launcher resolves for this config,
benchmarks/single_node/dsr1_fp8_b300.sh, contains an explicit guard:With
TP=4andISL=1024,$ISL -ne 8192short-circuits true, the script prints the rejection message and exits 1 beforesglang.launch_serveris ever invoked.Code path
runners/launch_b300-nv.sh(agg branch) resolves the bench script asbenchmarks/single_node/dsr1_fp8_b300_sglang.shfirst, then falls back todsr1_fp8_b300.sh(LEGACY_FW_SUFFIXis empty for sglang). Onlydsr1_fp8_b300.shexists, so that is what runs..github/workflows/benchmark-tmpl.ymlexportsTPfrom the matrix entry, soTP=4is passed into the script.Why pre-existing code doesn't prevent it
Before this PR the only
ISL=1024/OSL=1024entry underdsr1-fp8-b300-sglangwastp: 8, which is handled by the TP=8 branch above the guard.TP=4only appeared underISL=8192— the one combination the guard explicitly permits. The PR's reduction removed the safeTP=8point and replaced it with the exact(TP=4, ISL=1024)combination the guard rejects.Impact
The PR description explicitly motivates the search-space reduction as wiring verification ("Reduce search-space to single (isl=1024, osl=1024, conc=4) point per config to verify model-path wiring end-to-end"). Because every job for
dsr1-fp8-b300-sglangwill exit 1 before launching the server, the model-path wiring for this config will not be verified at all — the script bails before reaching theMODEL=…path read,nvidia-smi, orsglang.launch_server --model-path $MODEL.Step-by-step proof
tp=4, ep=1, conc=4, isl=1024, osl=1024.benchmark-tmpl.ymlexportsTP=4 ISL=1024 OSL=1024 CONC=4 EP_SIZE=1into the container.launch_b300-nv.sh(agg branch) selectsbenchmarks/single_node/dsr1_fp8_b300.shand execs it.check_env_vars(passes), prints SLURM banner, then runsnvidia-smi.$TP -eq 8is false;$TP -eq 4is true.$ISL -ne 8192→ true (ISL is 1024). Short-circuit OR makes the outer test true.TP=4 not yet supported for ISL=1024 OSL=1024!andexit 1.sglang.launch_serveris never invoked; the universal MODEL rewrite the PR adds is never exercised.Fix
Three equivalent options:
tp: 8(matches the previous ISL=1024 sweep).islto8192(the one ISL the TP=4 branch permits).dsr1_fp8_b300.shso TP=4 supports ISL=1024 (this would require validating the recipe at that shape, since the existing TP=4 tuning was only for 8192/1024).