Enable soft delete for workspace backup Recovery Services vaults, fix core public IP lifecycle issues, and fix e2e token expiry#4907
Conversation
Agent-Logs-Url: https://github.com/microsoft/AzureTRE/sessions/c847d284-ba5e-406d-ba8d-c5c4eaa47d84 Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
|
@copilot no, dont use azapi: Enabling soft delete does not prevent you from deleting the Backup Vault itself, as long as there are no protected (or soft‑deleted) backup items still in it. So set it explicitly to on. But consider side effects and need to maintain current functionality. Also ensure the azureRm provider is the latest release (pinned). |
Agent-Logs-Url: https://github.com/microsoft/AzureTRE/sessions/09dce1c2-3f0c-4c8a-b9bc-950d78b75e6d Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
Addressed in 3d11658. The vault is back on AzureRM, Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
There was a problem hiding this comment.
Pull request overview
This PR updates the workspace base Terraform template to comply with Azure’s requirement that soft delete is enabled on newly created Recovery Services Vaults, while preserving current workspace deletion/cleanup behavior by ensuring protected items are purged on destroy.
Changes:
- Enable
soft_delete_enabled = truefor the workspace backupazurerm_recovery_services_vault. - Configure AzureRM provider
recovery_servicefeatures to purge protected items and avoid retaining VM backup data on destroy. - Bump the workspace base AzureRM provider pin/lockfile to
4.58.0and update the template version + changelog entry.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| templates/workspaces/base/terraform/providers.tf | Bumps AzureRM to 4.58.0 and configures provider recovery services destroy behavior. |
| templates/workspaces/base/terraform/backup/backup.tf | Enables soft delete on the workspace backup Recovery Services Vault. |
| templates/workspaces/base/terraform/.terraform.lock.hcl | Updates AzureRM provider lock to 4.58.0 and corresponding hashes. |
| templates/workspaces/base/porter.yaml | Bumps tre-workspace-base bundle version to 2.8.4. |
| CHANGELOG.md | Documents the fix under BUG FIXES. |
Files not reviewed (1)
- templates/workspaces/base/terraform/.terraform.lock.hcl: Language not supported
Agent-Logs-Url: https://github.com/microsoft/AzureTRE/sessions/fb2a7fcc-0cc4-47ff-acac-961ef1cb9334 Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
|
Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
|
/test-extended 43c865c |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26244326090 (with refid (in response to this comment from @marrobi) |
…file_share not vice versa Agent-Logs-Url: https://github.com/microsoft/AzureTRE/sessions/9233b422-0f60-414c-af10-9fe74ed1f471 Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
Fixed the dependency direction in commit 2834a83. |
|
/test-extended 2834a83 |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26467689287 (with refid (in response to this comment from @marrobi) |
|
@copilot The error message in the latest run confirms the real container name uses lowercase …but the script has capital container_name="StorageContainer;Storage;${self.input.resource_group_name};${self.input.storage_account_name}"So |
|
/test-extended 6cf6d9b |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26476846339 (with refid (in response to this comment from @marrobi) |
2 similar comments
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26476846339 (with refid (in response to this comment from @marrobi) |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26476846339 (with refid (in response to this comment from @marrobi) |
|
🤖 pr-bot 🤖
(in response to this comment from @marrobi) |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26476846339 (with refid (in response to this comment from @marrobi) |
1 similar comment
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26476846339 (with refid (in response to this comment from @marrobi) |
|
The poll approach can't work because the issue isn't transient. Walking through what happens on destroy with the current code:
Polling can never recover from this — the container won't spontaneously leave FixReplace the passive poll with an active pre-destroy step that:
Replace resource "terraform_data" "prepare_backup_for_destroy" {
count = var.enable_backup ? 1 : 0
input = {
resource_group_name = azurerm_resource_group.ws.name
vault_name = module.backup[0].vault_name
storage_account_name = azurerm_storage_account.stg.name
subscription_id = data.azurerm_client_config.current.subscription_id
}
provisioner "local-exec" {
when = destroy
interpreter = ["/bin/bash", "-c"]
command = <<EOT
set -euo pipefail
az login --identity
az account set --subscription "${self.input.subscription_id}"
vault="${self.input.vault_name}"
rg="${self.input.resource_group_name}"
container_name="StorageContainer;storage;${self.input.resource_group_name};${self.input.storage_account_name}"
echo "Disabling soft delete on Recovery Services vault '$vault' so destroy can hard-delete protected items..."
az backup vault backup-properties set \
--name "$vault" --resource-group "$rg" \
--soft-delete-feature-state Disable --output none
for attempt in 1 2 3 4 5 6; do
state=$(az backup vault backup-properties show \
--name "$vault" --resource-group "$rg" \
--query softDeleteFeatureState -o tsv || echo "")
echo "Attempt $attempt: softDeleteFeatureState='$state'"
[ "$state" = "Disabled" ] && break
sleep 10
done
status=$(az backup container show \
--name "$container_name" --resource-group "$rg" --vault-name "$vault" \
--backup-management-type AzureStorage \
--query properties.registrationStatus -o tsv 2>/dev/null || echo "NotFound")
echo "Container status before destroy: '$status'"
if [ "$status" = "SoftDeleted" ]; then
echo "Container is soft-deleted; re-registering before AzureRM destroy."
az backup container re-register \
--resource-group "$rg" --vault-name "$vault" \
--backup-management-type AzureStorage --workload-type AzureFileShare \
--container-name "$container_name" --yes --output none || echo "re-register failed, continuing"
for attempt in 1 2 3 4 5 6 7 8 9 10; do
status=$(az backup container show \
--name "$container_name" --resource-group "$rg" --vault-name "$vault" \
--backup-management-type AzureStorage \
--query properties.registrationStatus -o tsv 2>/dev/null || echo "NotFound")
echo "Attempt $attempt: container status='$status'"
[ "$status" = "Registered" ] || [ "$status" = "NotFound" ] && break
sleep 30
done
fi
echo "Backup vault prepared for destroy."
EOT
}
depends_on = [
azurerm_backup_container_storage_account.storage_account,
azurerm_backup_protected_file_share.file_share,
]
}The The provider |
Disable vault soft delete before destroy and re-register soft-deleted containers so AzureRM's Disable Protection + Delete Protected Item can hard-delete rather than soft-delete, preventing BMSUserErrorContainerNotUndeleted.
|
/test-extended ac12799 |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26575855619 (with refid (in response to this comment from @marrobi) |
|
@copilot the token is expiring during tests. On the e2e harness side: refresh the bearer token before each check_deployment poll. We migyt revisot this once PR is complete. |
Add optional token_fn callable to wait_for, post_resource, and disable_and_delete_resource. When provided, token_fn() is called before each poll to obtain fresh auth headers, preventing token expiry failures during long-running operations (e.g. workspace destroy with backup). Update disable_and_delete_tre_resource and disable_and_delete_ws_resource in conftest.py to pass a token_fn that re-fetches a token for the correct scope before each poll.
|
/test-extended 11334d7 |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26593745738 (with refid (in response to this comment from @marrobi) |
What is being addressed
Workspace creation with backup enabled can fail when the workspace backup Recovery Services Vault is created with soft delete disabled. Azure now requires soft delete to be enabled for new vaults, so attempts to create the vault with
soft_delete_enabled = falsecan be rejected and leave the workspace deployment stuck.This change also needs to preserve current workspace delete behavior so backup-related resources do not block cleanup. Workspace destroy fails with
BMSUserErrorContainerNotUndeletedbecause when vault soft delete is enabled (Azure's default), AzureRM'sDisable Protectioncall moves the protected item intoSoftDeletedstate and the subsequentDelete Protected Itemcall is then rejected by the Backup service. Thepurge_protected_items_from_vault_on_destroyprovider feature does not reliably handle this case (see hashicorp/terraform-provider-azurerm#28264).Additionally, Azure automatically sets
ip_tagson public IPs (e.g."FirstPartyUsage": "/Unprivileged"), which causes Terraform to force replacement of the firewall transit public IP and the application gateway public IP on subsequent applies. This cascades into unnecessary recreation of dependent resources such as private DNS zones.Finally, the e2e test harness bearer token can expire during long-running operations (such as workspace destroy with backup), causing poll failures mid-wait.
How is this addressed
Recovery Services Vault configuration
azurerm_recovery_services_vaultsoft_delete_enabledargument (removed in AzureRM v4.64.0+); soft delete is now enabled by default on the platformDelete behavior
recovery_servicefeatures to purge protected items from the vault on destroy and setvm_backup_stop_protection_and_retain_data_on_destroy = falseas belt-and-bracesterraform_data.prepare_backup_for_destroyresource instorage.tfwith awhen = destroyprovisioner that:az backup vault backup-properties set --soft-delete-feature-state Disable) before AzureRM's destroy runs, soDisable Protection+Delete Protected Itemhard-deletes rather than soft-deletes the backup itemaz backup container re-register) and waits forRegistered/NotFoundstatus, self-healing workspaces stuck from earlier failed destroy attemptsprepare_backup_for_destroydepends onazurerm_backup_protected_file_share.file_shareso that on destroy Terraform's reversed dependency ordering causes the provisioner to run first (before the file share unprotect)storagein the backup container name (StorageContainer;storage;<rg>;<account>) to match the actual Azure container naming conventionCore infrastructure — public IP lifecycle fixes
ip_tagstolifecycle { ignore_changes = [...] }onazurerm_public_ip.fwtransitincore/terraform/firewall/firewall.tfto prevent forced replacement of the firewall transit public IPip_tagstolifecycle { ignore_changes = [...] }onazurerm_public_ip.appgwpipincore/terraform/appgateway/appgateway.tfto prevent forced replacement of the application gateway public IP and the cascading recreation of dependent DNS zonesE2e test harness — token refresh
token_fncallable parameter towait_for,post_resource, anddisable_and_delete_resourceine2e_tests/resources/resource.py; when provided,token_fn()is called before each poll iteration to obtain fresh auth headers, preventing token expiry failures during long-running operationsdisable_and_delete_tre_resourceanddisable_and_delete_ws_resourceine2e_tests/conftest.pyto pass atoken_fnlambda that re-fetches a token for the correct scope (admin or workspace-scoped) before eachcheck_deploymentpollProvider/version updates
4.57.0to4.73.0(latest)core/version.txtto0.16.17to satisfy CI version-change enforcement for the core Terraform changesTemplate/versioning
tre-workspace-baseto2.8.5CHANGELOG.mdto describe the AzureRM soft-delete-based fix, public IP lifecycle changes, and the active pre-destroy backup preparation step