[ci_dcn_site] Add retry logic to Nova aggregate creation by tosky · Pull Request #3968 · openstack-k8s-operators/ci-framework

tosky · 2026-05-29T07:38:56Z

Add retry logic (10 attempts, 30s delay) to the Nova aggregate creation task to handle transient MessageDeliveryFailure errors during RabbitMQ restarts or queue rebalancing.

This aligns with the existing defensive coding pattern used throughout the ci_dcn_site role, where similar k8s_exec and Kubernetes API operations already include retry logic (see pre-ceph.yml, post-ceph.yml, etc.).

Root cause: DataPlaneDeployment triggers RabbitMQ queue rebalance during DCN deployment, causing rolling restarts. Nova aggregate creation can fail with MessageDeliveryFailure if attempted during this window.

This patch provides reactive recovery through retries. Total retry time is up to 5 minutes (10 × 30s), which covers typical RabbitMQ restart windows observed in CI.

Related-Issue: DCN deployment failure with MessageDeliveryFailure

openshift-ci · 2026-05-29T07:39:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jokke-ilujo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

roles/ci_dcn_site/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

The Nova aggregate create API call can return HTTP 500 (MessageDeliveryFailure) when RabbitMQ restarts during a DCN deployment triggered by a queue rebalance. However, the aggregate is written to the Nova DB before the scheduler fanout fails, so the resource actually exists despite the error response. Retrying the create then fails permanently with HTTP 409 (ConflictException: Aggregate already exists), exhausting all retry attempts without ever succeeding. Fix this by following the established check-then-create pattern used across this role and in roles/federation/tasks/run_openstack_setup.yml: - Mark the create task with ignore_errors: true (consistent with the surrounding tasks in this file: aggregate show at line 19, add host at line 45), so a transient 500 does not abort the play. - Add a dedicated verification task that uses the existing retry pattern (retries/delay/until: rc == 0) to confirm the aggregate exists, polling until the RabbitMQ-induced transient failure has passed. This task is gated on the same when condition so it only runs when a creation was attempted. Root cause: DataPlaneDeployment applies Glance az0 config, triggering a RabbitMQ queue rebalance and rolling restart. Nova aggregate creation is attempted during this window and the scheduler fanout fails. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Related-Issue: DCN deployment failure with MessageDeliveryFailure Signed-off-by: Luigi Toscano <ltoscano@redhat.com>

centosinfra-prod-github-app · 2026-05-31T23:10:35Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/rdoproject.org/buildset/e8f469accb2a4100add91ed3e9dc1830

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 27m 08s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 27m 21s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 35m 49s
✔️ cifmw-crc-podified-edpm-baremetal-minor-update SUCCESS in 2h 12m 54s
✔️ cifmw-pod-zuul-files SUCCESS in 6m 28s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 9m 19s
❌ cifmw-pod-pre-commit FAILURE in 11m 20s
✔️ cifmw-molecule-ci_dcn_site SUCCESS in 2m 41s

github-actions Bot added the Ready For Review label May 29, 2026

tosky force-pushed the dcn_nova_aggregate_creation_retry branch from 172af74 to 7d45b2a Compare May 31, 2026 20:32

github-actions Bot removed the Ready For Review label May 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci_dcn_site] Add retry logic to Nova aggregate creation#3968

[ci_dcn_site] Add retry logic to Nova aggregate creation#3968
tosky wants to merge 1 commit into
openstack-k8s-operators:mainfrom
tosky:dcn_nova_aggregate_creation_retry

tosky commented May 29, 2026

Uh oh!

openshift-ci Bot commented May 29, 2026

Uh oh!

centosinfra-prod-github-app Bot commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tosky commented May 29, 2026

Uh oh!

openshift-ci Bot commented May 29, 2026

Uh oh!

centosinfra-prod-github-app Bot commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant