DAOS-18541 cart: flush RMA puts in Mercury na_ucx to ensure remote co…#18378
Closed
wangshilong wants to merge 1 commit into
Closed
DAOS-18541 cart: flush RMA puts in Mercury na_ucx to ensure remote co…#18378wangshilong wants to merge 1 commit into
wangshilong wants to merge 1 commit into
Conversation
|
Ticket title is 'Rebuild stuck on Bear cluster' |
Collaborator
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18378/2/testReport/ |
…mpletion The ucx+dc_x data path completes a bulk fetch (server-side RDMA put) on the put's local completion. For ucp_put_nbx, local completion only means the source buffer can be reused, not that the data is visible in the remote process memory. The server then replies and the client verifies the checksum against a destination buffer that is still empty or holds stale (reused) bytes, producing a calculated checksum of 0 and a spurious DER_CSUM (-2021). The window widens under congestion/timeouts, matching the observed rebuild checksum bursts. Patch Mercury na_ucx so that, on a put's local completion, the endpoint is flushed via ucp_ep_flush_nbx() to force remote completion, deferring the NA completion to the flush callback. Gets are unaffected since their local completion already implies data delivery. The behavior can be disabled with NA_UCX_RMA_FLUSH=0 for comparison. Add deps/patches/mercury/0004_na_ucx_rma_remote_completion.patch and register it in utils/build.config. Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
d747119 to
3621374
Compare
Collaborator
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18378/3/testReport/ |
Contributor
Author
|
Replaced. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…mpletion
The ucx+dc_x data path completes a bulk fetch (server-side RDMA put) on
the put's local completion. For ucp_put_nbx, local completion only means
the source buffer can be reused, not that the data is visible in the
remote process memory. The server then replies and the client verifies
the checksum against a destination buffer that is still empty or holds
stale (reused) bytes, producing a calculated checksum of 0 and a
spurious DER_CSUM (-2021). The window widens under congestion/timeouts,
matching the observed rebuild checksum bursts.
Patch Mercury na_ucx so that, on a put's local completion, the endpoint
is flushed via ucp_ep_flush_nbx() to force remote completion, deferring
the NA completion to the flush callback. Gets are unaffected since their
local completion already implies data delivery. The behavior can be
disabled with NA_UCX_RMA_FLUSH=0 for comparison.
Add deps/patches/mercury/0004_na_ucx_rma_remote_completion.patch and
register it in utils/build.config.
Steps for the author:
After all prior steps are complete: