Skip to content

Enable soft delete for workspace backup Recovery Services vaults, fix core public IP lifecycle issues, and fix e2e token expiry#4907

Open
Copilot wants to merge 17 commits into
mainfrom
copilot/fix-recovery-service-vault-deployment
Open

Enable soft delete for workspace backup Recovery Services vaults, fix core public IP lifecycle issues, and fix e2e token expiry#4907
Copilot wants to merge 17 commits into
mainfrom
copilot/fix-recovery-service-vault-deployment

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 21, 2026

What is being addressed

Workspace creation with backup enabled can fail when the workspace backup Recovery Services Vault is created with soft delete disabled. Azure now requires soft delete to be enabled for new vaults, so attempts to create the vault with soft_delete_enabled = false can be rejected and leave the workspace deployment stuck.

This change also needs to preserve current workspace delete behavior so backup-related resources do not block cleanup. Workspace destroy fails with BMSUserErrorContainerNotUndeleted because when vault soft delete is enabled (Azure's default), AzureRM's Disable Protection call moves the protected item into SoftDeleted state and the subsequent Delete Protected Item call is then rejected by the Backup service. The purge_protected_items_from_vault_on_destroy provider feature does not reliably handle this case (see hashicorp/terraform-provider-azurerm#28264).

Additionally, Azure automatically sets ip_tags on public IPs (e.g. "FirstPartyUsage": "/Unprivileged"), which causes Terraform to force replacement of the firewall transit public IP and the application gateway public IP on subsequent applies. This cascades into unnecessary recreation of dependent resources such as private DNS zones.

Finally, the e2e test harness bearer token can expire during long-running operations (such as workspace destroy with backup), causing poll failures mid-wait.

How is this addressed

  • Recovery Services Vault configuration

    • Keep the workspace backup vault on azurerm_recovery_services_vault
    • Remove the deprecated soft_delete_enabled argument (removed in AzureRM v4.64.0+); soft delete is now enabled by default on the platform
    • Preserve the existing Zone Redundant storage configuration
    • Preserve the existing CMK configuration on the AzureRM vault resource
  • Delete behavior

    • Configure the workspace base AzureRM provider recovery_service features to purge protected items from the vault on destroy and set vm_backup_stop_protection_and_retain_data_on_destroy = false as belt-and-braces
    • Add terraform_data.prepare_backup_for_destroy resource in storage.tf with a when = destroy provisioner that:
      1. Disables vault soft delete (az backup vault backup-properties set --soft-delete-feature-state Disable) before AzureRM's destroy runs, so Disable Protection + Delete Protected Item hard-deletes rather than soft-deletes the backup item
      2. Re-registers soft-deleted containers (az backup container re-register) and waits for Registered/NotFound status, self-healing workspaces stuck from earlier failed destroy attempts
    • prepare_backup_for_destroy depends on azurerm_backup_protected_file_share.file_share so that on destroy Terraform's reversed dependency ordering causes the provisioner to run first (before the file share unprotect)
    • Use lowercase storage in the backup container name (StorageContainer;storage;<rg>;<account>) to match the actual Azure container naming convention
  • Core infrastructure — public IP lifecycle fixes

    • Add ip_tags to lifecycle { ignore_changes = [...] } on azurerm_public_ip.fwtransit in core/terraform/firewall/firewall.tf to prevent forced replacement of the firewall transit public IP
    • Add ip_tags to lifecycle { ignore_changes = [...] } on azurerm_public_ip.appgwpip in core/terraform/appgateway/appgateway.tf to prevent forced replacement of the application gateway public IP and the cascading recreation of dependent DNS zones
  • E2e test harness — token refresh

    • Add an optional token_fn callable parameter to wait_for, post_resource, and disable_and_delete_resource in e2e_tests/resources/resource.py; when provided, token_fn() is called before each poll iteration to obtain fresh auth headers, preventing token expiry failures during long-running operations
    • Update disable_and_delete_tre_resource and disable_and_delete_ws_resource in e2e_tests/conftest.py to pass a token_fn lambda that re-fetches a token for the correct scope (admin or workspace-scoped) before each check_deployment poll
  • Provider/version updates

    • Update the workspace base AzureRM provider pin from 4.57.0 to 4.73.0 (latest)
    • Update the workspace base Terraform lock file to match the pinned AzureRM provider version
    • Bump core/version.txt to 0.16.17 to satisfy CI version-change enforcement for the core Terraform changes
  • Template/versioning

    • Bump tre-workspace-base to 2.8.5
    • Update CHANGELOG.md to describe the AzureRM soft-delete-based fix, public IP lifecycle changes, and the active pre-destroy backup preparation step

Copilot AI changed the title [WIP] Fix workspace deployment failure on Recovery Service Vault Use AzAPI for workspace backup Recovery Services vault creation May 21, 2026
Copilot AI requested a review from marrobi May 21, 2026 15:23
@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 21, 2026

@copilot no, dont use azapi:

Enabling soft delete does not prevent you from deleting the Backup Vault itself, as long as there are no protected (or soft‑deleted) backup items still in it.
What's actually happening
Azure now requires soft delete to be enabled on Microsoft.DataProtection/BackupVaults — the API rejects vaults created with soft_delete = "Off". That's why your azurerm_data_protection_backup_vault deploy is failing. The supported values are:
• AlwaysOn – soft delete enabled and immutable (cannot be turned off). ⚠️ This makes the vault very hard to remove.
• On – soft delete enabled (default) but can be disabled later. ✅ Safe choice.
• Off – no longer accepted by the platform for new vaults.

So set it explicitly to on. But consider side effects and need to maintain current functionality. Also ensure the azureRm provider is the latest release (pinned).

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 21, 2026

@copilot no, dont use azapi:

Enabling soft delete does not prevent you from deleting the Backup Vault itself, as long as there are no protected (or soft‑deleted) backup items still in it....

Addressed in 3d11658. The vault is back on AzureRM, soft_delete_enabled is set to true, the workspace base AzureRM pin/lock were updated to 4.58.0, and the provider now enables recovery_service.purge_protected_items_from_vault_on_destroy with non-retaining VM cleanup so workspace deletes keep the expected behavior.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • checkpoint-api.hashicorp.com
    • Triggering command: /tmp/terraform-bin/terraform /tmp/terraform-bin/terraform fmt -check -recursive (dns block)
    • Triggering command: /tmp/terraform-bin/terraform /tmp/terraform-bin/terraform -chdir=templates/workspaces/base/terraform init -backend=false -upgrade (dns block)
    • Triggering command: /tmp/terraform-bin/terraform /tmp/terraform-bin/terraform -chdir=templates/workspaces/base/terraform validate (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI changed the title Use AzAPI for workspace backup Recovery Services vault creation Enable soft delete for workspace backup Recovery Services vaults May 21, 2026
Comment thread templates/workspaces/base/terraform/providers.tf Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the workspace base Terraform template to comply with Azure’s requirement that soft delete is enabled on newly created Recovery Services Vaults, while preserving current workspace deletion/cleanup behavior by ensuring protected items are purged on destroy.

Changes:

  • Enable soft_delete_enabled = true for the workspace backup azurerm_recovery_services_vault.
  • Configure AzureRM provider recovery_service features to purge protected items and avoid retaining VM backup data on destroy.
  • Bump the workspace base AzureRM provider pin/lockfile to 4.58.0 and update the template version + changelog entry.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
templates/workspaces/base/terraform/providers.tf Bumps AzureRM to 4.58.0 and configures provider recovery services destroy behavior.
templates/workspaces/base/terraform/backup/backup.tf Enables soft delete on the workspace backup Recovery Services Vault.
templates/workspaces/base/terraform/.terraform.lock.hcl Updates AzureRM provider lock to 4.58.0 and corresponding hashes.
templates/workspaces/base/porter.yaml Bumps tre-workspace-base bundle version to 2.8.4.
CHANGELOG.md Documents the fix under BUG FIXES.
Files not reviewed (1)
  • templates/workspaces/base/terraform/.terraform.lock.hcl: Language not supported

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 21, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • checkpoint-api.hashicorp.com
    • Triggering command: ./terraform ./terraform version (dns block)
    • Triggering command: /tmp/terraform /tmp/terraform providers lock -platform=linux_amd64 -platform=linux_arm64 -platform=darwin_amd64 -platform=darwin_arm64 -platform=windows_amd64 hashicorp/azurerm (dns block)
    • Triggering command: /tmp/terraform /tmp/terraform fmt -check -recursive (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI requested a review from marrobi May 21, 2026 18:02
@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 21, 2026

/test-extended 43c865c

@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26244326090 (with refid 14403c22)

(in response to this comment from @marrobi)

…file_share not vice versa

Agent-Logs-Url: https://github.com/microsoft/AzureTRE/sessions/9233b422-0f60-414c-af10-9fe74ed1f471

Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 26, 2026

@copilot

Fix not working — destroy ordering issue (and possible container name casing)...

Fixed the dependency direction in commit 2834a83. wait_for_backup_container_ready now depends on azurerm_backup_protected_file_share.file_share instead of the other way around. On destroy, Terraform reverses the dependency graph, so wait_for_backup_container_ready is destroyed first (running the when = destroy poller), and only then is file_share deleted — ensuring the container is in a ready state before the unprotect operation.

@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 26, 2026

/test-extended 2834a83

@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26467689287 (with refid 14403c22)

(in response to this comment from @marrobi)

@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 26, 2026

@copilot The error message in the latest run confirms the real container name uses lowercase storage:

Protection Container Name: "StorageContainer;storage;rg-***-ws-6e1c;stgws6e1c"

…but the script has capital Storage:

container_name="StorageContainer;Storage;${self.input.resource_group_name};${self.input.storage_account_name}"

So az backup container show returns NotFound immediately, the poll exits as a no-op, and the file_share delete proceeds straight into the race. Change Storagestorage in storage.tf

@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 26, 2026

/test-extended 6cf6d9b

@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26476846339 (with refid 14403c22)

(in response to this comment from @marrobi)

2 similar comments
@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26476846339 (with refid 14403c22)

(in response to this comment from @marrobi)

@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26476846339 (with refid 14403c22)

(in response to this comment from @marrobi)

@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

⚠️ The specified SHA 2834a8346fa023fed0fc4f40f73808526290ea38 is not the latest commit on the PR. Please validate the latest commit and re-run /test

(in response to this comment from @marrobi)

@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26476846339 (with refid 14403c22)

(in response to this comment from @marrobi)

1 similar comment
@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26476846339 (with refid 14403c22)

(in response to this comment from @marrobi)

@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 28, 2026

@copilot

The poll approach can't work because the issue isn't transient. Walking through what happens on destroy with the current code:

  1. wait_for_backup_container_ready runs → container is Registered → exits immediately ✅
  2. AzureRM calls Disable Protection on the file share
  3. Because the vault has soft delete enabled (Azure's default since we removed soft_delete_enabled = false), the protected item goes into SoftDeleted state instead of being hard-deleted
  4. AzureRM then calls Delete Protected Item → Backup service refuses with BMSUserErrorContainerNotUndeleted because the item is soft-deleted
  5. purge_protected_items_from_vault_on_destroy = true is meant to handle this but doesn't reliably fire — see hashicorp/terraform-provider-azurerm#28264

Polling can never recover from this — the container won't spontaneously leave SoftDeleted. The vault-level soft delete behaviour is the cause.

Fix

Replace the passive poll with an active pre-destroy step that:

  1. Disables soft delete on the vault (allowed post-creation via az backup vault backup-properties set --soft-delete-feature-state Disable) so AzureRM's Disable Protection + Delete Protected Item calls hard-delete rather than soft-delete
  2. If the container is already SoftDeleted from a previous failed destroy attempt, re-registers it (az backup container re-register) so the next delete can proceed — this self-heals workspaces stuck from earlier failures
  3. Then lets AzureRM's normal delete path run — it will succeed

Replace wait_for_backup_container_ready in templates/workspaces/base/terraform/storage.tf with:

resource "terraform_data" "prepare_backup_for_destroy" {
  count = var.enable_backup ? 1 : 0

  input = {
    resource_group_name  = azurerm_resource_group.ws.name
    vault_name           = module.backup[0].vault_name
    storage_account_name = azurerm_storage_account.stg.name
    subscription_id      = data.azurerm_client_config.current.subscription_id
  }

  provisioner "local-exec" {
    when        = destroy
    interpreter = ["/bin/bash", "-c"]
    command     = <<EOT
set -euo pipefail
az login --identity
az account set --subscription "${self.input.subscription_id}"

vault="${self.input.vault_name}"
rg="${self.input.resource_group_name}"
container_name="StorageContainer;storage;${self.input.resource_group_name};${self.input.storage_account_name}"

echo "Disabling soft delete on Recovery Services vault '$vault' so destroy can hard-delete protected items..."
az backup vault backup-properties set \
  --name "$vault" --resource-group "$rg" \
  --soft-delete-feature-state Disable --output none

for attempt in 1 2 3 4 5 6; do
  state=$(az backup vault backup-properties show \
    --name "$vault" --resource-group "$rg" \
    --query softDeleteFeatureState -o tsv || echo "")
  echo "Attempt $attempt: softDeleteFeatureState='$state'"
  [ "$state" = "Disabled" ] && break
  sleep 10
done

status=$(az backup container show \
  --name "$container_name" --resource-group "$rg" --vault-name "$vault" \
  --backup-management-type AzureStorage \
  --query properties.registrationStatus -o tsv 2>/dev/null || echo "NotFound")
echo "Container status before destroy: '$status'"

if [ "$status" = "SoftDeleted" ]; then
  echo "Container is soft-deleted; re-registering before AzureRM destroy."
  az backup container re-register \
    --resource-group "$rg" --vault-name "$vault" \
    --backup-management-type AzureStorage --workload-type AzureFileShare \
    --container-name "$container_name" --yes --output none || echo "re-register failed, continuing"

  for attempt in 1 2 3 4 5 6 7 8 9 10; do
    status=$(az backup container show \
      --name "$container_name" --resource-group "$rg" --vault-name "$vault" \
      --backup-management-type AzureStorage \
      --query properties.registrationStatus -o tsv 2>/dev/null || echo "NotFound")
    echo "Attempt $attempt: container status='$status'"
    [ "$status" = "Registered" ] || [ "$status" = "NotFound" ] && break
    sleep 30
  done
fi

echo "Backup vault prepared for destroy."
EOT
  }

  depends_on = [
    azurerm_backup_container_storage_account.storage_account,
    azurerm_backup_protected_file_share.file_share,
  ]
}

The depends_on direction is the same as the current code — destroy reverses it so this provisioner runs before the file share delete.

The provider recovery_service feature flags can stay as belt-and-braces, but they become redundant once vault soft delete is actively disabled before destroy.

Disable vault soft delete before destroy and re-register soft-deleted
containers so AzureRM's Disable Protection + Delete Protected Item can
hard-delete rather than soft-delete, preventing BMSUserErrorContainerNotUndeleted.
@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 28, 2026

/test-extended ac12799

@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26575855619 (with refid 14403c22)

(in response to this comment from @marrobi)

@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 28, 2026

@copilot the token is expiring during tests.

On the e2e harness side: refresh the bearer token before each check_deployment poll. We migyt revisot this once PR is complete.

Add optional token_fn callable to wait_for, post_resource, and
disable_and_delete_resource. When provided, token_fn() is called before
each poll to obtain fresh auth headers, preventing token expiry failures
during long-running operations (e.g. workspace destroy with backup).

Update disable_and_delete_tre_resource and disable_and_delete_ws_resource
in conftest.py to pass a token_fn that re-fetches a token for the correct
scope before each poll.
Copilot AI changed the title Enable soft delete for workspace backup Recovery Services vaults and fix core public IP lifecycle issues Enable soft delete for workspace backup Recovery Services vaults, fix core public IP lifecycle issues, and fix e2e token expiry May 28, 2026
@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 28, 2026

/test-extended 11334d7

@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26593745738 (with refid 14403c22)

(in response to this comment from @marrobi)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Workspace deployment fails on Recovery Service Vault - TF Provider issue.

4 participants