Skip to content

fix(openbao): improve install command#430

Open
Jcing95 wants to merge 10 commits into
mainfrom
openbao/improve-idempotency
Open

fix(openbao): improve install command#430
Jcing95 wants to merge 10 commits into
mainfrom
openbao/improve-idempotency

Conversation

@Jcing95

@Jcing95 Jcing95 commented May 19, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add --namespace / -n flag to oms install openbao to allow deploying OpenBao to a custom Kubernetes namespace (defaults to vault for backward compatibility)
  • Fix multi-namespace support: the Bank-Vaults Operator is cluster-scoped, so skip re-deployment if it already exists in another namespace (prevents Helm ownership conflicts on ClusterRole resources)
  • Ensure the target namespace is created before applying the Vault CR, since Helm no longer implicitly creates it when the operator lives elsewhere
  • Fix unchecked os.Setenv return value flagged by the linter

@Jcing95 Jcing95 self-assigned this May 19, 2026
@Jcing95 Jcing95 changed the title fix(openbao): improve backup strategy fix(openbao): improve install command May 19, 2026
@Jcing95 Jcing95 requested a review from Copilot May 21, 2026 09:11

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the oms install openbao workflow to support deploying OpenBao into a configurable Kubernetes namespace while avoiding Helm ownership conflicts when the (cluster-scoped) Bank-Vaults Operator is already installed elsewhere. It also adjusts the install flow to create the target namespace explicitly, improves DR-restore handling of unseal keys, and introduces an interactive confirmation for destructive fresh initialization.

Changes:

  • Add --namespace/-n, --age-key-file/-k, and --yes/-y flags to the CLI and document them.
  • Update the OpenBao installer to be namespace-aware, skip operator redeploy when already present cluster-wide, ensure namespace creation before applying the Vault CR, and refine initialization/cleanup logic.
  • Enable preFlightChecks: true in the Vault CR manifest and update installer tests accordingly.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
NOTICE Fix jsonpatch/v2 license URL.
internal/tmpl/NOTICE Keep NOTICE template in sync with corrected license URL.
internal/installer/openbao.go Namespace-aware install flow, operator redeploy avoidance, DR restore secret handling, new pod readiness wait, expanded cleanup.
internal/installer/openbao_test.go Update/add tests for new operator deployment behavior and manifest expectations.
internal/installer/manifests/openbao/vault-cr.yaml Enable bank-vaults unseal preFlightChecks.
docs/oms_install_openbao.md Document new CLI flags (--namespace, --age-key-file, --yes).
cli/cmd/install_openbao.go Wire new flags, set SOPS key file env var, add destructive-operation confirmation prompt.
Comments suppressed due to low confidence (1)

internal/installer/openbao.go:618

  • hasExistingDeployment lists PVCs in the target namespace. If the namespace doesn't exist yet (common for a new --namespace), the List call can return a NotFound error and currently bubbles up, aborting the install during the confirmation check. It should treat a NotFound namespace as “no existing deployment” and return (false, nil).
	// Vault CR gone but PVCs may linger (e.g. CR was manually deleted).
	pvcList, err := o.Clientset.CoreV1().PersistentVolumeClaims(o.Config.Namespace).List(
		o.ctx, metav1.ListOptions{LabelSelector: "vault_cr=openbao"},
	)
	if err != nil {
		return false, fmt.Errorf("listing PVCs: %w", err)
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/installer/openbao.go Outdated
Comment thread internal/installer/openbao.go Outdated
Comment thread internal/installer/openbao.go
Comment thread cli/cmd/install_openbao.go
Comment thread internal/installer/openbao.go
@Jcing95 Jcing95 force-pushed the openbao/improve-idempotency branch 2 times, most recently from 3614c52 to 8ac7141 Compare May 27, 2026 14:00
@Jcing95 Jcing95 requested a review from Copilot May 27, 2026 14:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment thread internal/installer/openbao.go
Comment thread internal/installer/openbao.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comment thread internal/installer/openbao.go
Comment thread internal/installer/openbao.go Outdated
Comment thread internal/installer/openbao.go
@Jcing95 Jcing95 force-pushed the openbao/improve-idempotency branch from 42eba86 to 6044373 Compare May 28, 2026 15:45
@Jcing95 Jcing95 requested a review from Copilot May 28, 2026 15:46

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment thread internal/installer/openbao.go
Comment thread internal/installer/openbao.go
Comment thread cli/cmd/install_openbao.go Outdated
Comment thread docs/oms_install_openbao.md Outdated
- Fix retry_join and BAO_CLUSTER_ADDR/BAO_API_ADDR DNS format: the
  headless service segment was missing, producing addresses of the form
  openbao-N.<namespace>.svc.cluster.local instead of the correct
  openbao-N.openbao.<namespace>.svc.cluster.local. Introduce
  openBaoHeadlessService constant to avoid repeating the string.

- Default cfg.Namespace to DefaultOpenBaoNamespace in NewOpenBaoInstaller
  before constructing the Helm client, so the client is never initialised
  with an empty namespace when --namespace is omitted.

- Wait for stale PVCs to be fully deleted after issuing the Delete calls
  in CleanStaleInstallState. Add waitForPVCsGone helper (mirrors
  waitForVaultPodsGone) to poll until no vault_cr=openbao PVCs remain,
  preventing races between terminating PVCs and new StatefulSet creation.

- Correct --yes flag description: confirmation is only prompted when an
  existing deployment is detected without a DR backup, not on every fresh
  install. Update both the CLI flag and the docs page.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment thread internal/installer/openbao.go Outdated
Comment thread internal/installer/openbao.go
Jcing95 and others added 2 commits June 10, 2026 10:45
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Track which resources were actually deleted and report them. On a fresh
install into a new namespace where nothing existed, the previous message
'Stale install state cleaned (CR, pods, PVCs, unseal secret)' was
unconditionally printed even when every delete was a no-op (NotFound).

Now logs either:
  'Cleaned stale resources: Vault CR, 2 PVC(s), unseal secret'
  or
  'No stale install state found in namespace "my-new-ns"'

@OliverTrautvetter OliverTrautvetter left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments to clean the code a bit

Comment on lines +449 to +452
_, err = secretsClient.Create(o.ctx, secret, metav1.CreateOptions{})
if err != nil && !k8serrors.IsAlreadyExists(err) {
return fmt.Errorf("creating unseal secret from backup: %w", err)
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When Create fails with AlreadyExists, we return nil without updating the existing secret's data. But I think the secret can have wrong/empty data

// 2. Wait for all vault pods to terminate
// 3. Delete PVCs (removes stale Raft data that would confuse initialization)
// 4. Delete the unseal-keys Secret
func (o *OpenBaoInstaller) CleanStaleInstallState() error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has not tests?

Comment on lines +177 to +183
err = o.Logger.Step("Cleaning stale install state", o.CleanStaleInstallState)
if err != nil {
return fmt.Errorf("failed to remove stale unseal keys: %w", err)
return fmt.Errorf("failed to clean stale install state: %w", err)
}
}

err = o.Logger.Step("Ensuring namespace exists", func() error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we clean before we ensure that it exists? Sounds slightly weird 🤔 maybe we should ensure first?

if nsErr == nil {
rel, err := o.Helm.FindRelease(o.Config.Namespace, cfg.ReleaseName)
if err != nil {
return err

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should wrap it like

return fmt.Errorf("finding release %s in namespace %s: %w", cfg.ReleaseName, o.Config.Namespace, err)

return nil
}

fmt.Printf("\nWARNING: No DR backup found at: %s\n", c.Opts.DRBackupPath)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use log. Can we change all the fmt.Print stuff to log.Print

corev1 "k8s.io/api/core/v1"
)

// Test helpers — exported only during `go test` so external test packages

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like at least backupUnsealKeys is missing and not tested

// handles the case where the bank-vaults operator deletes or recreates the
// secret during Vault CR reconciliation — we simply re-apply it until the
// operator settles and the sidecar can successfully unseal.
func (o *OpenBaoInstaller) WaitForInitialization() error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no new tests for the new logic in "WaitForInitialization"

return fmt.Errorf("failed to read confirmation: %w", err)
}
if strings.TrimSpace(strings.ToLower(input)) != "yes" {
return fmt.Errorf("aborted: type 'yes' to continue or pass --yes")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confusing message. If it is aborted, how can we continue. Aborted sounds like a big error, but it is just a typo / failed confirm

Comment on lines +54 to +56
if c.Opts.AgeKeyFile != "" {
if err := os.Setenv("SOPS_AGE_KEY_FILE", c.Opts.AgeKeyFile); err != nil {
return fmt.Errorf("setting SOPS_AGE_KEY_FILE: %w", err)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes global process state permanently.

We should pass the key file path through the OpenBaoInstallerConfig

Comment on lines 579 to 588
if delErr != nil && !k8serrors.IsNotFound(delErr) {
return fmt.Errorf("deleting Vault CR: %w", delErr)
}
if delErr == nil {
cleaned = append(cleaned, "Vault CR")
}

if err := o.waitForVaultPodsGone(); err != nil {
return err
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to wait for a clear if Vault CR didn't exist.

We should only wait for pods when a Vault CR was actually deleted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants