Bump Ray 2.55.1 / CUDA 12.9.1 / torch 2.10 / vllm 0.18; HAProxy ingress#1895
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/claude review |
|
/ok to test 9e5b992 |
There was a problem hiding this comment.
Code looks good — dependency bumps are clean, HAProxy gating logic is solid (the socat check is a nice catch), and the runtime-dir anchor under Ray's session dir makes sense for the benchmarking runner's session_latest pickup.
One comment on test coverage for the new _merge_package_runtime_env dict-form merge paths.
Greptile SummaryThis PR bumps the core stack (CUDA 12.9.1, torch 2.10, Ray 2.55.1, vLLM capped at
Confidence Score: 4/5Safe to merge with attention to the config merge gap; all other changes are well-structured and tested One P1 finding: the config key in merge_runtime_envs is shallow-merged, which silently drops the critical 1800s setup timeout whenever a user includes any config dict in their runtime_env. Everything else — HAProxy integration, runtime_dir anchoring, version bumps, vLLM API adaptation — is clean and well-tested. nemo_curator/core/serve/base.py — the config merge gap affects the new 1800s flash-attn timeout Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["init_cluster()"] --> B{haproxy + socat on PATH?}
B -- yes --> C[Set RAY_SERVE_ENABLE_HA_PROXY=1
RAY_SERVE_HAPROXY_METRICS_PORT=free_port]
B -- no --> D[Log debug: fallback to Python proxy]
C --> E[Popen ray start]
D --> E
E --> F["DynamoBackend.start()"]
F --> G["ray.init(namespace=nemo_curator_dynamo)"]
G --> H[Anchor runtime_dir under Ray session dir]
H --> I[_deploy_and_healthcheck]
I -- success --> J[Backend running]
I -- failure --> K[_teardown_actors_and_pgs
runtime_dir retained for logs]
L["merge_runtime_envs(base, override)"] --> M["{**base, **override}"]
M --> N[deep-merge env_vars]
M --> O[deep-merge pip/uv via _merge_package_runtime_env]
M --> P["config: shallow only ⚠️"]
|
|
/ok to test 972245d |
|
/ok to test 1133e3d |
|
/ok to test 78c883d |
|
/claude review |
|
/ok to test 632f341 |
|
/ok to test bb58caf |
|
/ok to test 7dc9341 |
| "torchaudio>=2.8.0", # Override whisperx's torchaudio~=2.8.0 | ||
| "torchvision>=0.23.0", # Match torch>=2.8.0 | ||
| "torchcodec>=0.9.0; platform_machine == 'x86_64' and platform_system != 'Darwin'", # Must match torch 2.9.x; override pyannote-audio's >=0.7.0 floor; gated since aarch64 lacks wheels | ||
| "torch==2.10.0", # Override whisperx's <2.9 cap to match cu129 / vllm 0.18.x |
There was a problem hiding this comment.
I guess going forward the override here always has to strictly match what vllm supports?
There was a problem hiding this comment.
Yup that's my understanding, unless we start doing sub venvs
|
/ok to test d8535ff |
- docker/Dockerfile: install haproxy + socat (via install_haproxy.sh) and set RAY_SERVE_HAPROXY_BINARY_PATH so Ray Serve can find the binary when RAY_SERVE_ENABLE_HA_PROXY=1. - core/utils.py: opportunistically opt into HAProxy ingress in init_cluster when both haproxy and socat are on PATH. Both are required at runtime — Ray Serve runs HAProxy as a subprocess and uses socat to query the admin socket for is_running()/stats health checks; if socat is missing the controller's healthcheck silently fails and trips a 5s timeout. Pin a free metrics port (default 9101) so multiple clusters on one host don't fight over HAProxy's prometheus bind. - core/constants.py: DEFAULT_RAY_SERVE_HAPROXY_METRICS_PORT seed. - CUDA bump 12.8.1 -> 12.9.1 to match the broader dep upgrade. Signed-off-by: Praateek <praateekm@gmail.com>
- ray[default,data]>=2.55.1 (was 2.54) and ray[serve,llm]>=2.55.1 to align inference_server with the Ray Serve HAProxy ingress added in the previous commit. - torch / torchaudio == 2.10.0 + torchvision == 0.25.0 in override-dependencies to match the CUDA 12.9.1 base image and vllm 0.18's wheels. - pytorch wheel index moved cu128 -> cu129 to match the new torch. - video_cuda12 + build dependency-group: bump torch upper bound 2.9.1 -> 2.10.0. - inference_server vllm cap moved from <0.16.0 to <0.19 (i.e. allow the latest 0.18.x). vllm 0.19+ regresses embedding-generation throughput by ~10-30% on the bench (verified end-to-end against google/embeddinggemma-300m + ndd_ray_serve_dp4 + gpt-oss-20b), and 0.18.x requires transformers <5 — kept that constraint plus hf-hub <1.0 override. Signed-off-by: Praateek <praateekm@gmail.com>
Ray accepts pip/uv runtime_env entries in two shapes: the legacy list
form ``["pkg1", "pkg2"]`` or the structured dict form
``{"packages": [...], "uv_pip_install_options": [...]}``. The previous
merger only handled list+list. Backends that own dict-form defaults
(e.g. injecting ``--reinstall-package`` for a build-from-source dep)
need a user's list-form override to append to ``packages`` without
dropping the installer options.
The new ``_merge_package_runtime_env`` static method handles all four
input shape combinations (None, list, dict) and preserves
``uv_pip_install_options`` / ``pip_install_options`` across the merge.
Existing list+list call sites are unaffected — the resulting list-form
output is byte-for-byte identical to the prior implementation.
Signed-off-by: Praateek <praateekm@gmail.com>
…kspace per-run - DynamoBackend.start(): the runtime dir (housing per-actor subprocess logs and the deployment manifest) was a process-local ``tempfile.mkdtemp``, which left it stranded when the benchmarking runner copied ``session_latest/`` to the persistent results path before tearing down ``ray_temp_dir``. Move it under ``<ray_temp_dir>/<session_name>/nemo_curator_dynamo_<short_id>/`` so ``benchmarking/runner/ray_cluster.py`` picks the dir up automatically. - DynamoBackend.stop(): drop the explicit ``shutil.rmtree`` cleanup — the runner's session_latest copy needs the dir to still exist when ``ray_client.stop()`` runs. Ray's session lifecycle then reaps the dir at session end. - dynamo/vllm.py: anchor FlashInfer's workspace (``FLASHINFER_WORKSPACE_BASE``) under the Dynamo runtime dir via a small ``_worker_subprocess_env`` helper. FlashInfer's default cache can pick up cubins compiled by a prior Ray session whose actor venv has since been replaced; per-run isolation keeps that off the path. - tests/.../test_backend.py: replace the ``tempfile.mkdtemp`` mock with a ``ray.get_runtime_context`` mock that returns deterministic ``temp_dir`` and ``session_name`` values, plus an ``os.makedirs`` stub so the unit test stays hermetic. Signed-off-by: Praateek <praateekm@gmail.com>
torchcodec 0.11.x is built against torch 2.11; our torch is pinned at 2.10.0 (vllm 0.18.1 requires that exact version). The ABI mismatch manifests at runtime as ``OSError: undefined symbol: torch_dtype_float4_e2m1fn_x2`` when audio stages (e.g. SplitLongAudioStage in tests/stages/audio/tagging/e2e/test_tts_e2e.py) try to load torchcodec's native library. torchcodec doesn't declare a torch dep in its requires_dist, so the resolver can't enforce ABI matching; we have to do it in override-dependencies. Signed-off-by: Praateek <praateekm@gmail.com>
- backend.py stop(): remove ``self._runtime_dir = None`` and its comment. Nothing downstream branched on the reset (``_write_manifest`` is only called during start()), so it was a no-op with a misleading comment. - utils.py: collapse the HAProxy-enable rationale from 8 lines to 6 — keep the load-bearing socat-or-5s-timeout warning and the pre-Popen ordering note. - base.py: trim the ``_merge_package_runtime_env`` docstring to a 4-line comment that captures the only non-obvious point (list-form override must not drop dict-form base's installer options). Signed-off-by: Praateek <praateekm@gmail.com>
- tests/backends/test_integration.py: Ray 2.55 added a ``strict`` flag to ``StreamingRepartition``'s repr; teach ``test_ray_data_execution_plan`` about both pre- and post-2.55 forms. - tests/core/serve/test_runtime_env.py: add ``TestMergePackageRuntimeEnv`` exercising every shape combination of ``BaseModelConfig._merge_package_runtime_env`` (dict+list, list+dict, dict+dict, None+value, value+None) per claude[bot] review. - nemo_curator/core/serve/base.py: short comment on the intentional asymmetry between dict-base (extra keys ride along via ``deepcopy``) and dict-override (extra keys folded in by the loop) per greptile review. - docker/common/install_haproxy.sh: ``apt-get purge --auto-remove`` the build-only deps (build-essential, libc6-dev, liblua5.3-dev, libpcre3-dev, libssl-dev, zlib1g-dev) after compilation per greptile review. Keep liblua5.3-0 + socat — HAProxy links those at runtime. Signed-off-by: Praateek <praateekm@gmail.com>
``apt-get purge --auto-remove libpcre3-dev`` also reaped ``libpcre3`` (the runtime lib HAProxy linked against), so ``haproxy -v`` failed in the image with ``libpcre.so.3: cannot open shared object file``. Same risk for libssl3 / zlib1g / libc6. Plain ``apt-get purge`` removes the ``-dev`` packages but leaves their auto-installed runtime deps in place. We give up reaping the few MB of orphan auto-installed deps in exchange for a working binary. Signed-off-by: Praateek <praateekm@gmail.com>
Drop the ``<0.19`` cap from the bare ``[vllm]`` extra so a standalone ``pip install nemo-curator[vllm]`` keeps tracking the latest. The Ray Serve LLM compat constraint only applies when ``[inference_server]`` is in the install set, so move the cap there with an inline reason. Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
Dynamo's actor venv rebuilds flash-attn from source on first launch (forced by ``--reinstall-package flash-attn`` in ``DYNAMO_VLLM_RUNTIME_ENV``) — about 10 min on H100. Combined with 4-replica gpt-oss-20b weight load, the prior 700 s budget left no room for the actual inference work and the runner killed the entry mid-build. Signed-off-by: Praateek <praateekm@gmail.com>
The benchmark ndd_dynamo_dp4 (gpt-oss-20b) crashed at engine init with
``ImportError: undefined symbol:
c10::cuda::c10_cuda_check_implementation`` — the prebuilt flash-attn
wheel ai-dynamo[vllm] pulls is built against a torch ABI that doesn't
match the actor venv's torch (ai-dynamo 1.0.2's ``[vllm]`` extra pins
``vllm[flashinfer,runai]==0.16.0``, which itself pins
``torch==2.9.1``, while our base image torch is 2.10).
DYNAMO_VLLM_RUNTIME_ENV moves to dict-form ``uv`` payload so we can
inject ``--reinstall-package flash-attn`` +
``--no-build-isolation-package flash-attn`` — forces flash-attn to
build from source against whatever torch ends up in the actor venv.
Verified: ndd_dynamo_dp4 now passes (success=True, throughput
3.03 rows/sec, requirements_not_met={}). flash-attn rebuild dominates
``serve_startup_s`` (~22 min); follow-up to reduce that via a
prebuilt actor venv tracked separately.
Tests:
- ``TestDynamoRuntimeEnv``: existing list-form assertions move to
the dict-form output that the new ``DYNAMO_VLLM_RUNTIME_ENV``
produces after merge; new
``test_default_carries_flash_attn_rebuild_flags`` is a regression
guard so future edits to the constant can't silently drop the
rebuild flags.
- ``TestMergeModelRuntimeEnvs``:
``test_user_dict_form_uv_concatenates_install_options`` covers the
path where a user supplies their own dict-form ``uv`` override,
ensuring base ``uv_pip_install_options`` survive.
- ``TestDynamoSingleGpuServer.test_actor_runtime_env_imports_flash_attn``
spawns a Ray actor with the same ``DYNAMO_VLLM_RUNTIME_ENV`` Dynamo
uses and asserts ``flash_attn`` imports cleanly. The integration
test's SmolLM2-135M model doesn't exercise vLLM's flash-attn rotary
path, so this assertion is what surfaces regressions.
Notes:
- Pinning ``ai-dynamo==1.0.2`` in ``inference_server`` is deferred
(TODO comment in pyproject): forcing a re-resolve while
Lightning-AI/pytorch-lightning#21691 (PyPI quarantine) is open
can't currently find a satisfiable lightning version for
nemo-toolkit + pyannote.
- ``uvloop<0.22`` and ``flashinfer-cubin==0.6.3`` from NVIDIA-NeMo#1889 are not
re-added: existing GPU integration tests parametrize over
``RayDataExecutor`` and pass on Ray 2.55.1 (so uvloop incompat
looks fixed upstream), and SmolLM2 doesn't use FlashInfer (so
cubin pin's necessity is unverified). If the benchmark fails on
those, their error signatures are distinct from this commit's
symbol error and we'll add them back with evidence.
Signed-off-by: Praateek <praateekm@gmail.com>
``runtime_resources/`` holds Ray's runtime_env-resolved venvs (uv/pip/ conda) which can be many GB per actor — copying them into every benchmark artifact archive bloats the result without aiding debugging. Treat it the same as ``sockets`` and skip during ``_copy_session_contents``. Signed-off-by: Praateek <praateekm@gmail.com>
- ``docker/common/install_haproxy.sh``: TODO to drop the source-build dance once we're on Ray 2.56+, where ``ray-project/ray-haproxy`` ships HAProxy as a bundled distribution instead. - ``nemo_curator/core/utils.py``: log the HAProxy metrics port we picked so multi-cluster hosts can confirm the bind without grepping for ``RAY_SERVE_HAPROXY_METRICS_PORT`` in the environment. Signed-off-by: Praateek <praateekm@gmail.com>
- ``torchcodec~=0.10.0`` (was ``>=0.9.0,<0.11``). torchcodec wheels are torch-version-specific and don't declare a torch dep — torchcodec 0.9.x is built against torch 2.9 and would fail with the same ``undefined symbol`` profile as 0.11.x did. Our ``torch==2.10.0`` pin only matches torchcodec 0.10.x. Per @ayushdg's review. - Pin ``ai-dynamo==1.0.2`` in ``[inference_server]`` so the Dynamo actor venv resolves to the same release we test against. The earlier TODO blocked on the pytorch-lightning PyPI quarantine (Lightning-AI/pytorch-lightning#21691); resolving cleanly now. Signed-off-by: Praateek <praateekm@gmail.com>
The greptile-suggested ``apt-get purge`` of ``build-essential`` + ``libc6-dev`` made the image leaner but stripped tools that the next Dockerfile stage needs: ``uv sync`` builds ``fasttext`` from source (C++17 compiler required) and falls over with ``RuntimeError: Unsupported compiler -- at least C++17 support is needed``. Image bloat is a P2 trade-off; build failure is a P0. Drop the purge entirely. We can revisit a more surgical cleanup that preserves the C/C++ toolchain in a follow-up. Signed-off-by: Praateek <praateekm@gmail.com>
Ray's default ``setup_timeout_seconds`` is 600 s, but ``DYNAMO_VLLM_RUNTIME_ENV`` forces flash-attn to rebuild from source (``--no-build-isolation-package flash-attn``) which alone takes ~15 min on the dev machine. The 600 s ceiling was cancelling installs with ``RuntimeEnvSetupError`` before the actor could come up, surfacing in the nightly bench as ndd_dynamo_dp4 ❌ Run Failed in 1220 s. Set ``config.setup_timeout_seconds=1800`` so the install fits its worst-case wall time. Proper fix is the prebuilt-venv plan tracked in ``ai_agent_notes/dynamo-vllm-prebuild/findings.md``; this is the unblocker. Signed-off-by: Praateek <praateekm@gmail.com>
Summary
torch/torchaudio/torchvisionto the 2.10 line (cu129 wheels). Move the pytorch wheel index tocu129.[default,data]and[serve,llm]extras).vllm<0.19to keep the stack on thetransformers<5/huggingface-hub<1.0path. Resolves to vLLM 0.18.1.torchcodec<0.11to match torch 2.10's ABI (torchcodec 0.11 is built for torch 2.11). Resolves to torchcodec 0.10.0+cu129.haproxy(built from source) +socatin the Curator image and setRAY_SERVE_HAPROXY_BINARY_PATH.init_clusteropportunistically opts in viaRAY_SERVE_ENABLE_HA_PROXY=1when both binaries resolve onPATH. Both are required: Ray Serve usessocatto drive HAProxy's admin socket — without it, the controller's healthcheck silently returnsFalseand trips a 5s timeout.RAY_SERVE_HAPROXY_METRICS_PORTso multiple Curator clusters on a single host don't collide on HAProxy's prometheus bind.BaseModelConfig.merge_runtime_envsto handle Ray's dict-formpip/uvruntime_env entries (previously list+list only). Backwards compatible — existing list-form call sites continue to produce list-form output.<ray_temp_dir>/<session_name>/nemo_curator_dynamo_<short_id>/) so subprocess logs / manifests sit alongside Ray's own session logs. PinFLASHINFER_WORKSPACE_BASEper-run so cubin artifacts compiled by a prior actor venv don't leak into a fresh session.Test plan
bash benchmarking/tools/build_docker.sh.init_clusteremits the env vars when both binaries are onPATH, falls back silently otherwise.Note on vLLM 0.19
We also explored bumping vLLM to 0.19.1, which transitively requires
transformers>=5andhuggingface-hub>=1.0(and follow-on Curator-side fixes for the boundary changes those introduce). We backed off because vLLM 0.19.1 isn't compatible with Ray 2.55.1 —ray.llm._infer_supports_visioncrashes duringLLMServer.start()for the model configs we hit. Sticking with vLLM 0.18.1 keeps the upgrade scope minimal until the upstream Ray ↔ vLLM 0.19 path stabilizes.