chore: Update dynamo to 1.1.0 by praateekmahajan · Pull Request #1957 · NVIDIA-NeMo/Curator

praateekmahajan · 2026-05-07T16:54:33Z

Description

Now dynamo 1.1.0 uses vllm 0.19 and curator base is on vLLM 0.18, this means the dynamo setup time will be fractional of what it was before, since FA2 doesn't need to be rebuilt, as torch version remains the same.

We also see a perf improvement (likely because we went from vLLM 0.16 to vLLM 0.19) for the dynamo workers

Minor small hack needed, because dynamo[vllm] now pins ray to 2.55, while curator is on 2.55.1
So we create a file on all the nodes containing uv override.

Also pulled the per-node fan-out shared by execute_setup_on_node and the new override-file write into a small run_on_each_node helper in nemo_curator/utils/ray_utils.py (with get_head_node_id / is_head_node colocated there). Both call sites go through it now, so the override write respects CURATOR_IGNORE_RAY_HEAD_NODE the same way executor scheduling does.

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

greptile-apps · 2026-05-07T16:58:25Z

Greptile Summary

This PR upgrades ai-dynamo from 1.0.2 to 1.1.0 — which ships with vllm 0.19 and no longer requires a from-source flash-attn rebuild — cutting actor venv setup time significantly. It introduces a per-node uv --override file mechanism to pin ray to the cluster's installed version (working around dynamo's hard ray pin) and extracts get_head_node_id / run_on_each_node into a new nemo_curator/utils/ray_utils.py module used by both the backend setup path and the new override fan-out.

New ray_utils.py module: consolidates get_head_node_id, is_head_node, and the new run_on_each_node helper (with NodeAffinitySchedulingStrategy), replacing duplicated logic in backends/utils.py.
ensure_actor_overrides_on_all_nodes: materializes a ray=={ray.__version__} constraints file at a fixed tmp path on every alive node before any dynamo worker is spawned, so uv can resolve the pinned ray dependency without network access.
Simplified DYNAMO_VLLM_RUNTIME_ENV: drops the flash-attn reinstall/no-build-isolation flags and the 1800 s timeout, leaving a clean ai-dynamo[vllm] install with the override reference.

Confidence Score: 5/5

Safe to merge — the change is a well-scoped dependency bump plus a targeted refactor that extracts shared Ray helpers and replaces the flash-attn rebuild hack with a uv override file.

The core logic paths are all well-covered by existing and new tests. The override file fan-out correctly uses module-scope @ray.remote and ray.get to confirm writes on every node before actors land. The _setup_stage_on_node simplification (deriving NodeInfo from runtime context rather than passing it from the driver) is strictly safer. No new data-correctness or scheduling issues were found in the changed paths.

No files require special attention. nemo_curator/backends/utils.py is worth a quick read to confirm the log-then-schedule split in execute_setup_on_node matches your expectations for observability.

Important Files Changed

Filename	Overview
nemo_curator/utils/ray_utils.py	New utility module; clean implementation of get_head_node_id (with module-level cache), is_head_node, and run_on_each_node with NodeAffinitySchedulingStrategy fan-out and ray.get(refs) collection.
nemo_curator/core/serve/dynamo/vllm.py	Removes flash-attn rebuild flags; adds _ACTOR_VENV_OVERRIDES_PATH (fixed at import time), @ray.remote _write_actor_overrides_file at module scope, and ensure_actor_overrides_on_all_nodes fan-out; DYNAMO_VLLM_RUNTIME_ENV simplified to a single package + --override reference.
nemo_curator/backends/utils.py	Removes get_head_node_id / is_head_node (moved to ray_utils); execute_setup_on_node now delegates fan-out to run_on_each_node per stage; _setup_stage_on_node now derives NodeInfo from runtime context rather than accepting it as an argument.
nemo_curator/core/serve/dynamo/backend.py	Adds ensure_actor_overrides_on_all_nodes call before _sweep_orphan_actors so the override file exists on every node before any actor with DYNAMO_VLLM_RUNTIME_ENV lands.
pyproject.toml	Bumps ai-dynamo from 1.0.2 to 1.1.0; all other dependency pins unchanged.
tests/utils/test_ray_utils.py	New test file covering run_on_each_node (result count, ignore_head_node filtering) and head node cache lifecycle.
tests/core/serve/dynamo/test_vllm.py	Adds TestEnsureActorOverridesOnAllNodes verifying the file is written with the correct ray version; patches _ACTOR_VENV_OVERRIDES_PATH cleanly via mock.patch.object.

Sequence Diagram

sequenceDiagram
    participant D as Driver (DynamoBackend.start)
    participant RU as ray_utils.run_on_each_node
    participant N1 as Node 1
    participant N2 as Node N

    D->>RU: ensure_actor_overrides_on_all_nodes()
    RU->>N1: "_write_actor_overrides_file(path, "ray==X.Y.Z")"
    RU->>N2: "_write_actor_overrides_file(path, "ray==X.Y.Z")"
    RU-->>D: ray.get([refs]) - both writes confirmed

    D->>D: _sweep_orphan_actors()
    D->>D: remove_named_pgs_with_prefix()

    D->>N1: spawn actor (DYNAMO_VLLM_RUNTIME_ENV)
    Note over N1: uv install ai-dynamo[vllm]<br/>--override /tmp/...overrides.txt<br/>(file already present on node)
    D->>N2: spawn actor (DYNAMO_VLLM_RUNTIME_ENV)
    Note over N2: uv install ai-dynamo[vllm]<br/>--override /tmp/...overrides.txt<br/>(file already present on node)

_{Reviews (4): Last reviewed commit: "Fix dynamo runtime_env tests after dropp..." | Re-trigger Greptile}

greptile-apps · 2026-05-07T16:58:29Z

+    alive_nodes = [n for n in ray.nodes() if n.get("Alive")]
+    if not alive_nodes:
+        return


If ray.nodes() returns an empty list and the function exits early, DYNAMO_VLLM_RUNTIME_ENV still carries --override <path>, but the file is never written. Any actor spawned on any node will then fail with a uv "file not found" error for the override path. Inside ray.init() the head node is always alive, so this shouldn't trigger in production — but it silently swallows the failure rather than surfacing it. A log warning would make this condition observable without changing behaviour.

Suggested change

alive_nodes = [n for n in ray.nodes() if n.get("Alive")]

if not alive_nodes:

return

alive_nodes = [n for n in ray.nodes() if n.get("Alive")]

if not alive_nodes:

logger.warning(

"ensure_actor_overrides_on_all_nodes: no alive Ray nodes found; "

"override file was NOT written — actor venv installs will fail if workers are spawned."

)

return

greptile-apps · 2026-05-07T16:58:30Z

+    @ray.remote(num_cpus=0)
+    def _write_override(path: str, body: str) -> None:
+        Path(path).write_text(body)


Remote function re-registered on every call

_write_override is decorated with @ray.remote inside the function body, so each call to ensure_actor_overrides_on_all_nodes re-registers a new remote function object. In Ray, re-registering a remote function with the same name (e.g., after an autoscale event triggers a re-call) can produce spurious warnings and, in some Ray versions, cause serialization issues. Moving the definition to module scope avoids repeated registration and makes the intent clearer.

greptile-apps · 2026-05-07T17:36:39Z

+    for stage in stages:
+        run_on_each_node(
+            _setup_stage_on_node,
+            stage,
+            ignore_head_node=ignore_head_node,
+            num_cpus=stage.resources.cpus if stage.resources is not None else 1,
+            num_gpus=stage.resources.gpus if stage.resources is not None else 0,
+        )


Stage setup serialized instead of parallelized

The original code collected all (stage, node) remote tasks into a single list and awaited them with one ray.get call, so all stages ran concurrently across all nodes. The refactored version calls run_on_each_node — which internally calls ray.get(refs) before returning — once per stage, so each stage must complete on every node before the next stage starts. With multiple ProcessingStage entries, setup time scales as the sum of per-stage times rather than the maximum, turning what was a parallel fan-out into a sequential pipeline. If stages are independent (which is the common case), this is an unintended performance regression.

ai-dynamo 1.1.0 pins vllm 0.19.0 with the same torch (==2.10.0) as Curator's base vllm 0.18, so the inherited flash-attn wheel keeps a matching ABI and the ~15 min source rebuild that 1.0.2 required is no longer needed. setup_timeout drops from 1800s back to 600s. The [vllm] extra hard-pins ray to a specific patch, conflicting with Curator's own ray pin. Ray refuses actor venvs whose ray version differs from the cluster head's, so we override the transitive pin via a constraints file. uv has no inline override syntax, so the file is materialized on every alive node by ensure_actor_overrides_on_all_nodes(), which fans out a NodeAffinity-scheduled task per node from DynamoBackend.start() before any worker is spawned. Signed-off-by: Praateek <praateekm@gmail.com>

The override file content was hardcoded to "ray==2.55.1\n", which would silently drift after the next Curator ray bump. Read the version from the running interpreter at fan-out time so future bumps need no edit here. Tests: - TestEnsureActorOverridesOnAllNodes uses shared_ray_client + tmp_path (same pattern as TestExecuteSetupOnNode in tests/backends/test_utils.py) to assert the file ends up containing the running ray version at the configured path. - TestDynamoBackendStart now also asserts ensure_actor_overrides_on_all_nodes runs before any worker-spawning step, so the actor venv install on each worker can read the file from the local filesystem. Signed-off-by: Praateek <praateekm@gmail.com>

Move per-node fan-out logic shared by execute_setup_on_node and ensure_actor_overrides_on_all_nodes into a single nemo_curator/utils/ray_utils.run_on_each_node helper, and relocate get_head_node_id / is_head_node / _HEAD_NODE_ID_CACHE there alongside it. Plumb ignore_head_node through ensure_actor_overrides_on_all_nodes and have the Dynamo backend pass ignore_ray_head_node() at the call site so the actor-venv override file isn't materialized on the head when head scheduling is disabled. Signed-off-by: Praateek <praateekm@gmail.com>

Signed-off-by: Praateek <praateekm@gmail.com>

Two CPU tests asserted the now-removed workaround was wired up (setup_timeout >= 1800 s, --reinstall-package/--no-build-isolation-package flash-attn flags). Delete them since the rebuild is gone. test_actor_runtime_env_imports_flash_attn was passing only because the old runtime_env force-installed flash-attn into the actor venv. Without the force-install, the actor venv inherits flash-attn only when Ray's virtualenv clone (virtualenv-clone or --system-site-packages) picks it up from the driver venv -- which CI's GPU-serve doesn't have, since [inference_server] extra doesn't include flash-attn. Add a driver-side pytest.importorskip so the guard skips when it cannot run, but still fires on dev venvs that include flash-attn, catching the case where ai-dynamo[vllm]'s transitive deps bump torch and break the inherited .so. Signed-off-by: Praateek <praateekm@gmail.com>

praateekmahajan requested review from a team as code owners May 7, 2026 16:54

praateekmahajan requested review from meatybobby and removed request for a team May 7, 2026 16:54

copy-pr-bot Bot temporarily deployed to test May 7, 2026 16:54 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 7, 2026 16:55 Error

copy-pr-bot Bot temporarily deployed to nemo-ci May 7, 2026 16:55 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 7, 2026 16:55 Error

copy-pr-bot Bot temporarily deployed to nemo-ci May 7, 2026 16:55 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 7, 2026 16:55 Error

copy-pr-bot Bot temporarily deployed to nemo-ci May 7, 2026 16:55 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 7, 2026 16:55 Failure

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to test May 7, 2026 17:04 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 7, 2026 17:04 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci May 7, 2026 17:04 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 7, 2026 17:13 Inactive

praateekmahajan requested review from abhinavg4, ayushdg and oyilmaz-nvidia as code owners May 7, 2026 17:33

copy-pr-bot Bot temporarily deployed to test May 7, 2026 17:35 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 7, 2026 17:35 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 7, 2026 17:35 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci May 7, 2026 17:35 Inactive

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to nemo-ci May 7, 2026 17:44 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 7, 2026 17:44 Failure

praateekmahajan added 5 commits May 8, 2026 11:02

resolve conflict

017c42c

Signed-off-by: Praateek <praateekm@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Update dynamo to 1.1.0#1957

chore: Update dynamo to 1.1.0#1957
praateekmahajan wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
praateekmahajan:praateek/dynamo-110

praateekmahajan commented May 7, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 7, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

praateekmahajan commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Checklist

Uh oh!

greptile-apps Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

praateekmahajan commented May 7, 2026 •

edited

Loading

greptile-apps Bot commented May 7, 2026 •

edited

Loading