[ci_dcn_site] Add retry logic to Nova aggregate creation#3968
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
The Nova aggregate create API call can return HTTP 500 (MessageDeliveryFailure) when RabbitMQ restarts during a DCN deployment triggered by a queue rebalance. However, the aggregate is written to the Nova DB before the scheduler fanout fails, so the resource actually exists despite the error response. Retrying the create then fails permanently with HTTP 409 (ConflictException: Aggregate already exists), exhausting all retry attempts without ever succeeding. Fix this by following the established check-then-create pattern used across this role and in roles/federation/tasks/run_openstack_setup.yml: - Mark the create task with ignore_errors: true (consistent with the surrounding tasks in this file: aggregate show at line 19, add host at line 45), so a transient 500 does not abort the play. - Add a dedicated verification task that uses the existing retry pattern (retries/delay/until: rc == 0) to confirm the aggregate exists, polling until the RabbitMQ-induced transient failure has passed. This task is gated on the same when condition so it only runs when a creation was attempted. Root cause: DataPlaneDeployment applies Glance az0 config, triggering a RabbitMQ queue rebalance and rolling restart. Nova aggregate creation is attempted during this window and the scheduler fanout fails. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Related-Issue: DCN deployment failure with MessageDeliveryFailure Signed-off-by: Luigi Toscano <ltoscano@redhat.com>
172af74 to
7d45b2a
Compare
|
Build failed (check pipeline). Post ✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 27m 08s |
Add retry logic (10 attempts, 30s delay) to the Nova aggregate creation task to handle transient MessageDeliveryFailure errors during RabbitMQ restarts or queue rebalancing.
This aligns with the existing defensive coding pattern used throughout the ci_dcn_site role, where similar k8s_exec and Kubernetes API operations already include retry logic (see pre-ceph.yml, post-ceph.yml, etc.).
Root cause: DataPlaneDeployment triggers RabbitMQ queue rebalance during DCN deployment, causing rolling restarts. Nova aggregate creation can fail with MessageDeliveryFailure if attempted during this window.
This patch provides reactive recovery through retries. Total retry time is up to 5 minutes (10 × 30s), which covers typical RabbitMQ restart windows observed in CI.
Related-Issue: DCN deployment failure with MessageDeliveryFailure