Fix workspace deletion race condition in AMPLS private endpoint#4764
Fix workspace deletion race condition in AMPLS private endpoint#4764Copilot wants to merge 25 commits into
Conversation
…PLS private endpoints Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
|
/test-extended c32a87a |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/19553208613 (with refid (in response to this comment from @marrobi) |
…o fix deletion ordering Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
The test revealed a different root cause. The error was that Private DNS Zones couldn't be deleted because their Virtual Network Links still existed. I've added |
Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
|
/test-extended d66e33f |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/19571335811 (with refid (in response to this comment from @marrobi) |
|
/test-extended d66e33f |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/19576408604 (with refid (in response to this comment from @marrobi) |
Unit Test Results0 tests 0 ✅ 0s ⏱️ Results for commit 101932b. ♻️ This comment has been updated with latest results. |
|
@copilot please resolve: Code update without version change |
Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
|
/test-extended 963842a |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/21690761743 (with refid (in response to this comment from @marrobi) |
2 similar comments
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/21690761743 (with refid (in response to this comment from @marrobi) |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/21690761743 (with refid (in response to this comment from @marrobi) |
|
@copilot still seeing Step 1: Main step for 615ec8ca-dcac-4083-ba47-7a620f503bac |
Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
Updated in commit e1caf5f. Changed the azapi_resource configuration to use |
|
@copilot not sure jsonencode is valid for the latest versions of Az API provider. |
Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
Reverted to plain HCL syntax in commit cb12424. Removed |
|
/test-extended a5550fc |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/22372694381 (with refid (in response to this comment from @marrobi) |
|
@copilot The current approach still won't fix the deletion failure because it's missing the actual root cause. Please make the following change. Root cause (important): This is NOT an ordering problem. The error is a transient, retryable 409 Required change in Add these two things to that resource: retry = {
error_message_regex = ["AnotherOperationInProgress", "Another operation on this or dependent resource is in progress"]
interval_seconds = 15
max_interval_seconds = 90
}
timeouts {
create = "30m"
delete = "30m"
}Also please:
Context on why prior runs failed: the azapi extraction and the ordering tweaks were correct structure but never added retry-on-conflict, so the same 409 kept surfacing on destroy. The retry is the missing piece. A couple of notes:
|
Done in the latest commit. Added retry = {
error_message_regex = ["AnotherOperationInProgress", "Another operation on this or dependent resource is in progress"]
interval_seconds = 15
max_interval_seconds = 90
}
timeouts {
create = "30m"
delete = "30m"
}
|
|
/test-extended |
|
🤖 pr-bot 🤖
(in response to this comment from @marrobi) |
|
/test-extended 101932b |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/28264307435 (with refid (in response to this comment from @marrobi) |
1 similar comment
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/28264307435 (with refid (in response to this comment from @marrobi) |
Resolves #3194
What is being addressed
Workspace deletions fail intermittently when Terraform attempts to delete Azure Monitor resources. The issue manifests as a transient
409 AnotherOperationInProgresserror when deleting the AMPLS private DNS zone group — Azure performs internal cleanup of record sets and zone links that overlaps with the private endpoint deletion, all within a single delete API call. This is not an ordering problem but a retryable transient conflict.How is this addressed
private_dns_zone_groupblock from within theazurerm_private_endpointresource and created it as a separateazapi_resourceusing the Azure Resource Manager API (Microsoft.Network/privateEndpoints/privateDnsZoneGroups@2023-11-01). This uses the already-required azapi provider and is compatible with the azurerm provider version used in the porter bundle build environment.retryattribute (azapi v2 object syntax) toazapi_resource.azure_monitor_dns_zone_groupso that transientAnotherOperationInProgress409 responses during delete are automatically retried rather than failing immediately.createanddeletetimeout to bound the retry window.ampls_app_insightsandampls_log_anaytics) to the private endpoint'sdepends_onlist to ensure proper creation and deletion ordering.Technical Details
Key resource configuration:
The
retryattribute ensures that when Azure returns a transient 409 conflict during the DNS zone group delete (due to internal cleanup of its 5 zone configs), the azapi provider retries automatically within the 30-minute delete timeout window.References
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.