Skip to content

remove nccl pd#1341

Open
hiworldwzj wants to merge 4 commits into
mainfrom
wzj_del_pd
Open

remove nccl pd#1341
hiworldwzj wants to merge 4 commits into
mainfrom
wzj_del_pd

Conversation

@hiworldwzj

Copy link
Copy Markdown
Collaborator

No description provided.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the prefill-decode (PD) disaggregation deployment by removing the legacy prefill and decode run modes in favor of nixl_prefill and nixl_decode. It also introduces a new NCCL-based transport backend for the NIXL path and cleans up associated legacy files, scripts, and documentation. The reviewer feedback correctly identifies critical issues: a missing import and exposure of NIXLDPForDecodeNode that would cause a runtime NameError, and a potential race condition in nccl_kv_transporter.py due to modifying self._comms without acquiring self._comm_create_lock.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines 25 to 27
NIXLChunckedPrefillForPrefillNode,
NIXLDPChunkedForPrefillNode,
NIXLDecodeNode,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The class NIXLDPForDecodeNode is used in model_rpc.py (when args.dp > 1 and run_mode == 'nixl_decode') but is not imported in this file. This will result in a NameError at runtime. Please import NIXLDPForDecodeNode from the backend module.

Suggested change
NIXLChunckedPrefillForPrefillNode,
NIXLDPChunkedForPrefillNode,
NIXLDecodeNode,
NIXLChunckedPrefillForPrefillNode,
NIXLDPChunkedForPrefillNode,
NIXLDecodeNode,
NIXLDPForDecodeNode,

from .continues_batch.pd_mode.prefill_node_impl.prefill_impl_for_dp import DPChunkedForPrefillNode
from .pd_nixl.prefill_node_impl.prefill_impl import NIXLChunckedPrefillForPrefillNode
from .pd_nixl.prefill_node_impl.prefill_impl_for_dp import NIXLDPChunkedForPrefillNode
from .pd_nixl.decode_node_impl.decode_impl import NIXLDecodeNode

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The class NIXLDPForDecodeNode is not exposed in mode_backend/__init__.py. Please import and expose it so that it can be imported by model_rpc.py.

Suggested change
from .pd_nixl.decode_node_impl.decode_impl import NIXLDecodeNode
from .pd_nixl.decode_node_impl.decode_impl import NIXLDecodeNode
from .pd_nixl.decode_node_impl.decode_impl_for_dp import NIXLDPForDecodeNode

Comment on lines +201 to +207
def remove_remote_agent(self, peer_name: str):
if peer_name in self.remote_agents:
self.remote_agents.pop(peer_name, None)
self.remote_stores.pop(peer_name, None)
comm = self._comms.pop(peer_name, None)
if comm is not None:
comm.destroy()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In remove_remote_agent, self._comms.pop(peer_name, None) is called without acquiring self._comm_create_lock. Since self._comms is modified concurrently by other threads in _ensure_comm and _drop_comm under the lock, accessing/modifying it here without the lock can lead to race conditions and inconsistent state. Please wrap the pop and destroy operations in the _comm_create_lock context manager.

Suggested change
def remove_remote_agent(self, peer_name: str):
if peer_name in self.remote_agents:
self.remote_agents.pop(peer_name, None)
self.remote_stores.pop(peer_name, None)
comm = self._comms.pop(peer_name, None)
if comm is not None:
comm.destroy()
def remove_remote_agent(self, peer_name: str):
if peer_name in self.remote_agents:
self.remote_agents.pop(peer_name, None)
self.remote_stores.pop(peer_name, None)
with self._comm_create_lock:
comm = self._comms.pop(peer_name, None)
if comm is not None:
comm.destroy()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant