Skip to content

DAOS-18541 cart: flush RMA puts in Mercury na_ucx to ensure remote co…#18378

Closed
wangshilong wants to merge 1 commit into
release/2.8from
shilongw/DAOS-18541-2.8
Closed

DAOS-18541 cart: flush RMA puts in Mercury na_ucx to ensure remote co…#18378
wangshilong wants to merge 1 commit into
release/2.8from
shilongw/DAOS-18541-2.8

Conversation

@wangshilong
Copy link
Copy Markdown
Contributor

…mpletion

The ucx+dc_x data path completes a bulk fetch (server-side RDMA put) on
the put's local completion. For ucp_put_nbx, local completion only means
the source buffer can be reused, not that the data is visible in the
remote process memory. The server then replies and the client verifies
the checksum against a destination buffer that is still empty or holds
stale (reused) bytes, producing a calculated checksum of 0 and a
spurious DER_CSUM (-2021). The window widens under congestion/timeouts,
matching the observed rebuild checksum bursts.

Patch Mercury na_ucx so that, on a put's local completion, the endpoint
is flushed via ucp_ep_flush_nbx() to force remote completion, deferring
the NA completion to the flush callback. Gets are unaffected since their
local completion already implies data delivery. The behavior can be
disabled with NA_UCX_RMA_FLUSH=0 for comparison.

Add deps/patches/mercury/0004_na_ucx_rma_remote_completion.patch and
register it in utils/build.config.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link
Copy Markdown

Ticket title is 'Rebuild stuck on Bear cluster'
Status is 'Awaiting Verification'
Labels: 'scrubbed_2.8,test_2.8'
https://daosio.atlassian.net/browse/DAOS-18541

@daosbuild3
Copy link
Copy Markdown
Collaborator

…mpletion

 The ucx+dc_x data path completes a bulk fetch (server-side RDMA put) on
 the put's local completion. For ucp_put_nbx, local completion only means
 the source buffer can be reused, not that the data is visible in the
 remote process memory. The server then replies and the client verifies
 the checksum against a destination buffer that is still empty or holds
 stale (reused) bytes, producing a calculated checksum of 0 and a
 spurious DER_CSUM (-2021). The window widens under congestion/timeouts,
 matching the observed rebuild checksum bursts.

 Patch Mercury na_ucx so that, on a put's local completion, the endpoint
 is flushed via ucp_ep_flush_nbx() to force remote completion, deferring
 the NA completion to the flush callback. Gets are unaffected since their
 local completion already implies data delivery. The behavior can be
 disabled with NA_UCX_RMA_FLUSH=0 for comparison.

 Add deps/patches/mercury/0004_na_ucx_rma_remote_completion.patch and
 register it in utils/build.config.

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
@wangshilong wangshilong force-pushed the shilongw/DAOS-18541-2.8 branch from d747119 to 3621374 Compare May 29, 2026 08:39
@daosbuild3
Copy link
Copy Markdown
Collaborator

@wangshilong wangshilong closed this Jun 3, 2026
@wangshilong
Copy link
Copy Markdown
Contributor Author

Replaced.

@wangshilong wangshilong deleted the shilongw/DAOS-18541-2.8 branch June 3, 2026 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants