feat: add multi-platform support with ascend and maca by zhangts20 · Pull Request #1335 · ModelTC/LightLLM

zhangts20 · 2026-06-08T06:49:30Z

No description provided.

gemini-code-assist

Code Review

This pull request introduces a unified platform abstraction layer (lightllm/platform) to support multiple hardware backends (CUDA, Ascend/NPU, MUSA, MACA) and implements a new Paged FlashAttention-3 backend (PagedFa3AttBackend). It refactors memory management, attention states, and multimodal models to use device-agnostic target devices and platform-specific runtimes. Feedback on these changes highlights several critical issues: returning CPU tensors instead of device tensors in ReqManager's paged allocation, using a fixed-size attention mask in PagedFa3PrefillAttState that risks out-of-bounds errors, allocating softmax_lse with an incorrect size of 1 during NPU graph capture, omitting data type conversion in BaseLayerWeight._to_device, and missing pinned memory allocation for CPU reference tensors in TpPartBaseModel which introduces synchronization overhead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-08T06:54:37Z

+        mask = ~need_new_page_mask
+        if mask.any():
+            token_idxs[mask] = b_last_mem_index[mask] + 1
+        return token_idxs


In the second branch of _alloc_paged_mem_indices, token_idxs is created and computed on the CPU. However, alloc_mem_indices is expected to return a device tensor (as it does in the first branch and when page_size == 1). Returning a CPU tensor here will cause a device mismatch crash during the model forward pass or when asserting model_input.mem_indexes.device == self.target_device. The returned tensor must be moved to the target device.

Suggested change

return token_idxs

return token_idxs.to(device=self.mem_manager.target_device)

gemini-code-assist · 2026-06-08T06:54:37Z

+        if self.atten_mask is None:
+            self.atten_mask = torch.triu(torch.ones([2048, 2048]), diagonal=1).to(dtype=torch.int8, device=self.infer_state.input_ids.device)


The attention mask self.atten_mask is initialized with a fixed size of [2048, 2048]. If the input sequence length (max_kv_seq_len) exceeds 2048, this will cause out-of-bounds errors or incorrect attention computation. It should be dynamically sized or resized when max_kv_seq_len exceeds the current mask size. Additionally, allocating the tensor directly on the target device avoids unnecessary host-to-device copy overhead.

Suggested change

if self.atten_mask is None:

self.atten_mask = torch.triu(torch.ones([2048, 2048]), diagonal=1).to(dtype=torch.int8, device=self.infer_state.input_ids.device)

if self.atten_mask is None or self.atten_mask.shape[0] < self.infer_state.max_kv_seq_len:

mask_size = max(2048, self.infer_state.max_kv_seq_len)

self.atten_mask = torch.triu(

torch.ones([mask_size, mask_size], dtype=torch.int8, device=self.infer_state.input_ids.device),

diagonal=1,

)

gemini-code-assist · 2026-06-08T06:54:37Z

+            v = v.view(-1, self.backend.page_size, N_KV * HEAD_DIM)
+
+            output = torch.empty_like(q)
+            softmax_lse = torch.empty(1, dtype=torch.float16, device=q.device)


During NPU graph capture, tensors must be pre-allocated with their exact shapes. Allocating softmax_lse with size 1 and passing it to .out will cause out-of-bounds memory writes during graph replay, leading to silent memory corruption or crashes. It should be allocated with the correct shape (e.g., [q.shape[0], q.shape[1], 1]) and dtype (typically torch.float32).

Suggested change

softmax_lse = torch.empty(1, dtype=torch.float16, device=q.device)

softmax_lse = torch.empty((q.shape[0], q.shape[1], 1), dtype=torch.float32, device=q.device)

gemini-code-assist · 2026-06-08T06:54:37Z

+    def _to_device(self, cpu_tensor: torch.Tensor) -> torch.Tensor:
+        return cpu_tensor.contiguous().to(device=self.target_device)


The new _to_device method does not convert the tensor to the target data type (self.data_type_), unlike the original _cuda method. This can lead to weight tensors remaining in float32 (or whatever their source format was), causing runtime type mismatches during forward pass or doubling the memory usage.

Suggested change

def _to_device(self, cpu_tensor: torch.Tensor) -> torch.Tensor:

return cpu_tensor.contiguous().to(device=self.target_device)

def _to_device(self, cpu_tensor: torch.Tensor) -> torch.Tensor:

return cpu_tensor.contiguous().to(device=self.target_device, dtype=self.data_type_)

gemini-code-assist · 2026-06-08T06:54:37Z

+                torch.empty(shared_len, dtype=torch.int32).to(get_current_device_id()),
+                torch.empty(shared_len, dtype=torch.int32).to(get_current_device_id()),


Allocating tensors on CPU and then moving them to the device using .to(get_current_device_id()) is inefficient and causes unnecessary host-to-device copies. It is much better to allocate them directly on the target device using device=self.model.target_device.

Suggested change

torch.empty(shared_len, dtype=torch.int32).to(get_current_device_id()),

torch.empty(shared_len, dtype=torch.int32).to(get_current_device_id()),

torch.empty(shared_len, dtype=torch.int32, device=self.model.target_device),

torch.empty(shared_len, dtype=torch.int32, device=self.model.target_device),

gemini-code-assist · 2026-06-08T06:54:37Z

+            self.b1_cu_q_seq_len_cpu_ref = torch.zeros(self.graph_max_batch_size, dtype=torch.int32)
+            self.b_cu_kv_seq_len_cpu_ref = torch.zeros(self.graph_max_batch_size, dtype=torch.int32)


The CPU reference tensors self.b1_cu_q_seq_len_cpu_ref and self.b_cu_kv_seq_len_cpu_ref are used for asynchronous copying and updating attention parameters on NPU streams. To avoid host-device synchronization overhead and enable fast asynchronous transfers, these tensors should be allocated with pin_memory=True.

Suggested change

self.b1_cu_q_seq_len_cpu_ref = torch.zeros(self.graph_max_batch_size, dtype=torch.int32)

self.b_cu_kv_seq_len_cpu_ref = torch.zeros(self.graph_max_batch_size, dtype=torch.int32)

self.b1_cu_q_seq_len_cpu_ref = torch.zeros(self.graph_max_batch_size, dtype=torch.int32, pin_memory=True)

self.b_cu_kv_seq_len_cpu_ref = torch.zeros(self.graph_max_batch_size, dtype=torch.int32, pin_memory=True)

gemini-code-assist · 2026-06-08T06:54:37Z

+            total_pages_needed, p_token_len = self._expand_by_page_size(b_token_len, page_size)
+            paged_token_idxs = self.mem_manager.alloc(total_pages_needed * page_size)
+            pages = paged_token_idxs.view(-1, page_size)
+            mask = torch.arange(page_size, device=p_token_len.device) < p_token_len.unsqueeze(1)


The boolean mask mask is created on the CPU because p_token_len is on the CPU. Indexing the device tensor pages with a CPU mask causes synchronous host-device transfers and slow CPU-GPU synchronization. The mask should be created directly on the device of pages.

Suggested change

mask = torch.arange(page_size, device=p_token_len.device) < p_token_len.unsqueeze(1)

mask = torch.arange(page_size, device=pages.device) < p_token_len.to(device=pages.device).unsqueeze(1)

zhangtaoshan added 18 commits May 22, 2026 22:23

Initial commit

3de45c4

Initial commit of BackendGraph

6468f7c

Fix BackendRuntime

84c1d86

Initial commit of BackendOps

98f3d47

Add dist_backend in BackendRuntime

c8b6a9a

Fix some Backend calls

b7e7988

Initial commit of ascend backend

f4d1eb5

support non-ascend platforms

6893d7e

optimize graph calls

daa9a22

fix non-ascend platforms

c7524db

fix about multimodal

f654ad9

add weight/infer ops

841a04d

fix about graph

ce503ab

misc

4c173be

add maca backend

c358abf

add register_host_register_worker

f09144b

fix about multimodal radix cache

ea3c0bd

refactor ops

8fa6213

gemini-code-assist Bot reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add multi-platform support with ascend and maca#1335

feat: add multi-platform support with ascend and maca#1335
zhangts20 wants to merge 18 commits into
ModelTC:mainfrom
zhangts20:refactor_platform_dev

zhangts20 commented Jun 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	return token_idxs
	return token_idxs.to(device=self.mem_manager.target_device)

		if self.atten_mask is None:
		self.atten_mask = torch.triu(torch.ones([2048, 2048]), diagonal=1).to(dtype=torch.int8, device=self.infer_state.input_ids.device)

-        if self.atten_mask is None:
-            self.atten_mask = torch.triu(torch.ones([2048, 2048]), diagonal=1).to(dtype=torch.int8, device=self.infer_state.input_ids.device)
+        if self.atten_mask is None or self.atten_mask.shape[0] < self.infer_state.max_kv_seq_len:
+            mask_size = max(2048, self.infer_state.max_kv_seq_len)
+            self.atten_mask = torch.triu(
+                torch.ones([mask_size, mask_size], dtype=torch.int8, device=self.infer_state.input_ids.device),
+                diagonal=1,
+            )

	softmax_lse = torch.empty(1, dtype=torch.float16, device=q.device)
	softmax_lse = torch.empty((q.shape[0], q.shape[1], 1), dtype=torch.float32, device=q.device)

		def _to_device(self, cpu_tensor: torch.Tensor) -> torch.Tensor:
		return cpu_tensor.contiguous().to(device=self.target_device)

		torch.empty(shared_len, dtype=torch.int32).to(get_current_device_id()),
		torch.empty(shared_len, dtype=torch.int32).to(get_current_device_id()),

		self.b1_cu_q_seq_len_cpu_ref = torch.zeros(self.graph_max_batch_size, dtype=torch.int32)
		self.b_cu_kv_seq_len_cpu_ref = torch.zeros(self.graph_max_batch_size, dtype=torch.int32)

	mask = torch.arange(page_size, device=p_token_len.device) < p_token_len.unsqueeze(1)
	mask = torch.arange(page_size, device=pages.device) < p_token_len.to(device=pages.device).unsqueeze(1)

Conversation

zhangts20 commented Jun 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant