Skip to content

DAOS-18976 rebuild: migration fetch/enumerate should retry for network error#18387

Open
gnailzenh wants to merge 1 commit into
release/2.6from
liang/b2_6_migrate_retry
Open

DAOS-18976 rebuild: migration fetch/enumerate should retry for network error#18387
gnailzenh wants to merge 1 commit into
release/2.6from
liang/b2_6_migrate_retry

Conversation

@gnailzenh
Copy link
Copy Markdown
Collaborator

@gnailzenh gnailzenh commented May 30, 2026

  • There's a clear asymmetry between the scan (push) side and the pull (fetch) side:

    . Scan side (rebuild_objects_send_ult): Already retries ALL daos_crt_network_error()
    properly handles transient network errors when pushing OID lists to pullers.

    . Pull side (mrone_obj_fetch_internal): Does NOT retry network errors when fetching
    data from source

    . This patch makes them consistent and always retry for network error for both cases

  • rebuild IV refresh can arrive out of order, make sure it doesn't revert global done flag

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@gnailzenh gnailzenh requested review from a team as code owners May 30, 2026 08:58
@github-actions
Copy link
Copy Markdown

Ticket title is 'Aurora rebuild failing with DER_HG / DER_SHUTDOWN'
Status is 'Awaiting Verification'
Labels: 'test_2.6.5rc1'
Job should run at elevated priority (1)
https://daosio.atlassian.net/browse/DAOS-18976

@github-actions github-actions Bot added the priority Ticket has high priority (automatically managed) label May 30, 2026
…k error

- There's a clear asymmetry between the scan (push) side and the pull (fetch) side:

  . Scan side (rebuild_objects_send_ult): Already retries ALL daos_crt_network_error()
    properly handles transient network errors when pushing OID lists to pullers.

  . Pull side (mrone_obj_fetch_internal): Does NOT retry network errors when fetching
    data from source

  . This patch makes them consistent and always retry for network error for both cases

- rebuild IV refresh can arrive out of order, make sure it doesn't revert global done flag

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
@gnailzenh gnailzenh force-pushed the liang/b2_6_migrate_retry branch from 86b2454 to eb3f0d2 Compare May 31, 2026 15:26
@gnailzenh gnailzenh changed the title DAOS-18976 rebuild: migration fetch should retry for network error DAOS-18976 rebuild: migration fetch/enumerate should retry for network error May 31, 2026
@gnailzenh gnailzenh requested a review from wangshilong June 1, 2026 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority Ticket has high priority (automatically managed)

Development

Successfully merging this pull request may close these issues.

3 participants