remove nccl pd#1341
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the prefill-decode (PD) disaggregation deployment by removing the legacy prefill and decode run modes in favor of nixl_prefill and nixl_decode. It also introduces a new NCCL-based transport backend for the NIXL path and cleans up associated legacy files, scripts, and documentation. The reviewer feedback correctly identifies critical issues: a missing import and exposure of NIXLDPForDecodeNode that would cause a runtime NameError, and a potential race condition in nccl_kv_transporter.py due to modifying self._comms without acquiring self._comm_create_lock.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| NIXLChunckedPrefillForPrefillNode, | ||
| NIXLDPChunkedForPrefillNode, | ||
| NIXLDecodeNode, |
There was a problem hiding this comment.
The class NIXLDPForDecodeNode is used in model_rpc.py (when args.dp > 1 and run_mode == 'nixl_decode') but is not imported in this file. This will result in a NameError at runtime. Please import NIXLDPForDecodeNode from the backend module.
| NIXLChunckedPrefillForPrefillNode, | |
| NIXLDPChunkedForPrefillNode, | |
| NIXLDecodeNode, | |
| NIXLChunckedPrefillForPrefillNode, | |
| NIXLDPChunkedForPrefillNode, | |
| NIXLDecodeNode, | |
| NIXLDPForDecodeNode, |
| from .continues_batch.pd_mode.prefill_node_impl.prefill_impl_for_dp import DPChunkedForPrefillNode | ||
| from .pd_nixl.prefill_node_impl.prefill_impl import NIXLChunckedPrefillForPrefillNode | ||
| from .pd_nixl.prefill_node_impl.prefill_impl_for_dp import NIXLDPChunkedForPrefillNode | ||
| from .pd_nixl.decode_node_impl.decode_impl import NIXLDecodeNode |
There was a problem hiding this comment.
The class NIXLDPForDecodeNode is not exposed in mode_backend/__init__.py. Please import and expose it so that it can be imported by model_rpc.py.
| from .pd_nixl.decode_node_impl.decode_impl import NIXLDecodeNode | |
| from .pd_nixl.decode_node_impl.decode_impl import NIXLDecodeNode | |
| from .pd_nixl.decode_node_impl.decode_impl_for_dp import NIXLDPForDecodeNode |
| def remove_remote_agent(self, peer_name: str): | ||
| if peer_name in self.remote_agents: | ||
| self.remote_agents.pop(peer_name, None) | ||
| self.remote_stores.pop(peer_name, None) | ||
| comm = self._comms.pop(peer_name, None) | ||
| if comm is not None: | ||
| comm.destroy() |
There was a problem hiding this comment.
In remove_remote_agent, self._comms.pop(peer_name, None) is called without acquiring self._comm_create_lock. Since self._comms is modified concurrently by other threads in _ensure_comm and _drop_comm under the lock, accessing/modifying it here without the lock can lead to race conditions and inconsistent state. Please wrap the pop and destroy operations in the _comm_create_lock context manager.
| def remove_remote_agent(self, peer_name: str): | |
| if peer_name in self.remote_agents: | |
| self.remote_agents.pop(peer_name, None) | |
| self.remote_stores.pop(peer_name, None) | |
| comm = self._comms.pop(peer_name, None) | |
| if comm is not None: | |
| comm.destroy() | |
| def remove_remote_agent(self, peer_name: str): | |
| if peer_name in self.remote_agents: | |
| self.remote_agents.pop(peer_name, None) | |
| self.remote_stores.pop(peer_name, None) | |
| with self._comm_create_lock: | |
| comm = self._comms.pop(peer_name, None) | |
| if comm is not None: | |
| comm.destroy() |
No description provided.