Skip to content

chore: Update dynamo to 1.1.0#1957

Open
praateekmahajan wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
praateekmahajan:praateek/dynamo-110
Open

chore: Update dynamo to 1.1.0#1957
praateekmahajan wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
praateekmahajan:praateek/dynamo-110

Conversation

@praateekmahajan
Copy link
Copy Markdown
Contributor

@praateekmahajan praateekmahajan commented May 7, 2026

Description

Now dynamo 1.1.0 uses vllm 0.19 and curator base is on vLLM 0.18, this means the dynamo setup time will be fractional of what it was before, since FA2 doesn't need to be rebuilt, as torch version remains the same.

We also see a perf improvement (likely because we went from vLLM 0.16 to vLLM 0.19) for the dynamo workers
image

Minor small hack needed, because dynamo[vllm] now pins ray to 2.55, while curator is on 2.55.1
So we create a file on all the nodes containing uv override.

Also pulled the per-node fan-out shared by execute_setup_on_node and the new override-file write into a small run_on_each_node helper in nemo_curator/utils/ray_utils.py (with get_head_node_id / is_head_node colocated there). Both call sites go through it now, so the override write respects CURATOR_IGNORE_RAY_HEAD_NODE the same way executor scheduling does.

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@praateekmahajan praateekmahajan requested review from a team as code owners May 7, 2026 16:54
@praateekmahajan praateekmahajan requested review from meatybobby and removed request for a team May 7, 2026 16:54
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 7, 2026

Greptile Summary

This PR upgrades ai-dynamo from 1.0.2 to 1.1.0 — which ships with vllm 0.19 and no longer requires a from-source flash-attn rebuild — cutting actor venv setup time significantly. It introduces a per-node uv --override file mechanism to pin ray to the cluster's installed version (working around dynamo's hard ray pin) and extracts get_head_node_id / run_on_each_node into a new nemo_curator/utils/ray_utils.py module used by both the backend setup path and the new override fan-out.

  • New ray_utils.py module: consolidates get_head_node_id, is_head_node, and the new run_on_each_node helper (with NodeAffinitySchedulingStrategy), replacing duplicated logic in backends/utils.py.
  • ensure_actor_overrides_on_all_nodes: materializes a ray=={ray.__version__} constraints file at a fixed tmp path on every alive node before any dynamo worker is spawned, so uv can resolve the pinned ray dependency without network access.
  • Simplified DYNAMO_VLLM_RUNTIME_ENV: drops the flash-attn reinstall/no-build-isolation flags and the 1800 s timeout, leaving a clean ai-dynamo[vllm] install with the override reference.

Confidence Score: 5/5

Safe to merge — the change is a well-scoped dependency bump plus a targeted refactor that extracts shared Ray helpers and replaces the flash-attn rebuild hack with a uv override file.

The core logic paths are all well-covered by existing and new tests. The override file fan-out correctly uses module-scope @ray.remote and ray.get to confirm writes on every node before actors land. The _setup_stage_on_node simplification (deriving NodeInfo from runtime context rather than passing it from the driver) is strictly safer. No new data-correctness or scheduling issues were found in the changed paths.

No files require special attention. nemo_curator/backends/utils.py is worth a quick read to confirm the log-then-schedule split in execute_setup_on_node matches your expectations for observability.

Important Files Changed

Filename Overview
nemo_curator/utils/ray_utils.py New utility module; clean implementation of get_head_node_id (with module-level cache), is_head_node, and run_on_each_node with NodeAffinitySchedulingStrategy fan-out and ray.get(refs) collection.
nemo_curator/core/serve/dynamo/vllm.py Removes flash-attn rebuild flags; adds _ACTOR_VENV_OVERRIDES_PATH (fixed at import time), @ray.remote _write_actor_overrides_file at module scope, and ensure_actor_overrides_on_all_nodes fan-out; DYNAMO_VLLM_RUNTIME_ENV simplified to a single package + --override reference.
nemo_curator/backends/utils.py Removes get_head_node_id / is_head_node (moved to ray_utils); execute_setup_on_node now delegates fan-out to run_on_each_node per stage; _setup_stage_on_node now derives NodeInfo from runtime context rather than accepting it as an argument.
nemo_curator/core/serve/dynamo/backend.py Adds ensure_actor_overrides_on_all_nodes call before _sweep_orphan_actors so the override file exists on every node before any actor with DYNAMO_VLLM_RUNTIME_ENV lands.
pyproject.toml Bumps ai-dynamo from 1.0.2 to 1.1.0; all other dependency pins unchanged.
tests/utils/test_ray_utils.py New test file covering run_on_each_node (result count, ignore_head_node filtering) and head node cache lifecycle.
tests/core/serve/dynamo/test_vllm.py Adds TestEnsureActorOverridesOnAllNodes verifying the file is written with the correct ray version; patches _ACTOR_VENV_OVERRIDES_PATH cleanly via mock.patch.object.

Sequence Diagram

sequenceDiagram
    participant D as Driver (DynamoBackend.start)
    participant RU as ray_utils.run_on_each_node
    participant N1 as Node 1
    participant N2 as Node N

    D->>RU: ensure_actor_overrides_on_all_nodes()
    RU->>N1: "_write_actor_overrides_file(path, "ray==X.Y.Z")"
    RU->>N2: "_write_actor_overrides_file(path, "ray==X.Y.Z")"
    RU-->>D: ray.get([refs]) - both writes confirmed

    D->>D: _sweep_orphan_actors()
    D->>D: remove_named_pgs_with_prefix()

    D->>N1: spawn actor (DYNAMO_VLLM_RUNTIME_ENV)
    Note over N1: uv install ai-dynamo[vllm]<br/>--override /tmp/...overrides.txt<br/>(file already present on node)
    D->>N2: spawn actor (DYNAMO_VLLM_RUNTIME_ENV)
    Note over N2: uv install ai-dynamo[vllm]<br/>--override /tmp/...overrides.txt<br/>(file already present on node)
Loading

Reviews (4): Last reviewed commit: "Fix dynamo runtime_env tests after dropp..." | Re-trigger Greptile

Comment thread nemo_curator/core/serve/dynamo/vllm.py Outdated
Comment on lines +94 to +96
alive_nodes = [n for n in ray.nodes() if n.get("Alive")]
if not alive_nodes:
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 If ray.nodes() returns an empty list and the function exits early, DYNAMO_VLLM_RUNTIME_ENV still carries --override <path>, but the file is never written. Any actor spawned on any node will then fail with a uv "file not found" error for the override path. Inside ray.init() the head node is always alive, so this shouldn't trigger in production — but it silently swallows the failure rather than surfacing it. A log warning would make this condition observable without changing behaviour.

Suggested change
alive_nodes = [n for n in ray.nodes() if n.get("Alive")]
if not alive_nodes:
return
alive_nodes = [n for n in ray.nodes() if n.get("Alive")]
if not alive_nodes:
logger.warning(
"ensure_actor_overrides_on_all_nodes: no alive Ray nodes found; "
"override file was NOT written — actor venv installs will fail if workers are spawned."
)
return

Comment thread nemo_curator/core/serve/dynamo/vllm.py Outdated
Comment on lines +90 to +92
@ray.remote(num_cpus=0)
def _write_override(path: str, body: str) -> None:
Path(path).write_text(body)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Remote function re-registered on every call

_write_override is decorated with @ray.remote inside the function body, so each call to ensure_actor_overrides_on_all_nodes re-registers a new remote function object. In Ray, re-registering a remote function with the same name (e.g., after an autoscale event triggers a re-call) can produce spurious warnings and, in some Ray versions, cause serialization issues. Moving the definition to module scope avoids repeated registration and makes the intent clearer.

Comment on lines +214 to +221
for stage in stages:
run_on_each_node(
_setup_stage_on_node,
stage,
ignore_head_node=ignore_head_node,
num_cpus=stage.resources.cpus if stage.resources is not None else 1,
num_gpus=stage.resources.gpus if stage.resources is not None else 0,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Stage setup serialized instead of parallelized

The original code collected all (stage, node) remote tasks into a single list and awaited them with one ray.get call, so all stages ran concurrently across all nodes. The refactored version calls run_on_each_node — which internally calls ray.get(refs) before returning — once per stage, so each stage must complete on every node before the next stage starts. With multiple ProcessingStage entries, setup time scales as the sum of per-stage times rather than the maximum, turning what was a parallel fan-out into a sequential pipeline. If stages are independent (which is the common case), this is an unintended performance regression.

ai-dynamo 1.1.0 pins vllm 0.19.0 with the same torch (==2.10.0) as
Curator's base vllm 0.18, so the inherited flash-attn wheel keeps a
matching ABI and the ~15 min source rebuild that 1.0.2 required is no
longer needed. setup_timeout drops from 1800s back to 600s.

The [vllm] extra hard-pins ray to a specific patch, conflicting with
Curator's own ray pin. Ray refuses actor venvs whose ray version
differs from the cluster head's, so we override the transitive pin
via a constraints file. uv has no inline override syntax, so the file
is materialized on every alive node by ensure_actor_overrides_on_all_nodes(),
which fans out a NodeAffinity-scheduled task per node from
DynamoBackend.start() before any worker is spawned.

Signed-off-by: Praateek <praateekm@gmail.com>
The override file content was hardcoded to "ray==2.55.1\n", which would
silently drift after the next Curator ray bump. Read the version from
the running interpreter at fan-out time so future bumps need no edit
here.

Tests:
- TestEnsureActorOverridesOnAllNodes uses shared_ray_client + tmp_path
  (same pattern as TestExecuteSetupOnNode in tests/backends/test_utils.py)
  to assert the file ends up containing the running ray version at the
  configured path.
- TestDynamoBackendStart now also asserts ensure_actor_overrides_on_all_nodes
  runs before any worker-spawning step, so the actor venv install on each
  worker can read the file from the local filesystem.

Signed-off-by: Praateek <praateekm@gmail.com>
Move per-node fan-out logic shared by execute_setup_on_node and
ensure_actor_overrides_on_all_nodes into a single
nemo_curator/utils/ray_utils.run_on_each_node helper, and relocate
get_head_node_id / is_head_node / _HEAD_NODE_ID_CACHE there alongside it.
Plumb ignore_head_node through ensure_actor_overrides_on_all_nodes and
have the Dynamo backend pass ignore_ray_head_node() at the call site so
the actor-venv override file isn't materialized on the head when head
scheduling is disabled.

Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
Two CPU tests asserted the now-removed workaround was wired up
(setup_timeout >= 1800 s, --reinstall-package/--no-build-isolation-package
flash-attn flags). Delete them since the rebuild is gone.

test_actor_runtime_env_imports_flash_attn was passing only because the
old runtime_env force-installed flash-attn into the actor venv. Without
the force-install, the actor venv inherits flash-attn only when Ray's
virtualenv clone (virtualenv-clone or --system-site-packages) picks it
up from the driver venv -- which CI's GPU-serve doesn't have, since
[inference_server] extra doesn't include flash-attn. Add a driver-side
pytest.importorskip so the guard skips when it cannot run, but still
fires on dev venvs that include flash-attn, catching the case where
ai-dynamo[vllm]'s transitive deps bump torch and break the inherited .so.

Signed-off-by: Praateek <praateekm@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant