Skip to content

[ci_dcn_site] Add retry logic to Nova aggregate creation#3968

Open
tosky wants to merge 1 commit into
openstack-k8s-operators:mainfrom
tosky:dcn_nova_aggregate_creation_retry
Open

[ci_dcn_site] Add retry logic to Nova aggregate creation#3968
tosky wants to merge 1 commit into
openstack-k8s-operators:mainfrom
tosky:dcn_nova_aggregate_creation_retry

Conversation

@tosky
Copy link
Copy Markdown
Contributor

@tosky tosky commented May 29, 2026

Add retry logic (10 attempts, 30s delay) to the Nova aggregate creation task to handle transient MessageDeliveryFailure errors during RabbitMQ restarts or queue rebalancing.

This aligns with the existing defensive coding pattern used throughout the ci_dcn_site role, where similar k8s_exec and Kubernetes API operations already include retry logic (see pre-ceph.yml, post-ceph.yml, etc.).

Root cause: DataPlaneDeployment triggers RabbitMQ queue rebalance during DCN deployment, causing rolling restarts. Nova aggregate creation can fail with MessageDeliveryFailure if attempted during this window.

This patch provides reactive recovery through retries. Total retry time is up to 5 minutes (10 × 30s), which covers typical RabbitMQ restart windows observed in CI.

Related-Issue: DCN deployment failure with MessageDeliveryFailure

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jokke-ilujo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

The Nova aggregate create API call can return HTTP 500 (MessageDeliveryFailure)
when RabbitMQ restarts during a DCN deployment triggered by a queue rebalance.
However, the aggregate is written to the Nova DB before the scheduler fanout
fails, so the resource actually exists despite the error response.

Retrying the create then fails permanently with HTTP 409 (ConflictException:
Aggregate already exists), exhausting all retry attempts without ever succeeding.

Fix this by following the established check-then-create pattern used across
this role and in roles/federation/tasks/run_openstack_setup.yml:
- Mark the create task with ignore_errors: true (consistent with the
  surrounding tasks in this file: aggregate show at line 19, add host at
  line 45), so a transient 500 does not abort the play.
- Add a dedicated verification task that uses the existing retry pattern
  (retries/delay/until: rc == 0) to confirm the aggregate exists, polling
  until the RabbitMQ-induced transient failure has passed. This task is
  gated on the same when condition so it only runs when a creation was
  attempted.

Root cause: DataPlaneDeployment applies Glance az0 config, triggering a
RabbitMQ queue rebalance and rolling restart. Nova aggregate creation is
attempted during this window and the scheduler fanout fails.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Related-Issue: DCN deployment failure with MessageDeliveryFailure
Signed-off-by: Luigi Toscano <ltoscano@redhat.com>
@tosky tosky force-pushed the dcn_nova_aggregate_creation_retry branch from 172af74 to 7d45b2a Compare May 31, 2026 20:32
@centosinfra-prod-github-app
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/rdoproject.org/buildset/e8f469accb2a4100add91ed3e9dc1830

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 27m 08s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 27m 21s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 35m 49s
✔️ cifmw-crc-podified-edpm-baremetal-minor-update SUCCESS in 2h 12m 54s
✔️ cifmw-pod-zuul-files SUCCESS in 6m 28s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 9m 19s
cifmw-pod-pre-commit FAILURE in 11m 20s
✔️ cifmw-molecule-ci_dcn_site SUCCESS in 2m 41s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant