Skip to content

ops(deploy): rolling-update via GitHub Actions over Tailscale#616

Open
bootjp wants to merge 15 commits into
mainfrom
ops/deploy-via-tailscale
Open

ops(deploy): rolling-update via GitHub Actions over Tailscale#616
bootjp wants to merge 15 commits into
mainfrom
ops/deploy-via-tailscale

Conversation

@bootjp

@bootjp bootjp commented Apr 24, 2026

Copy link
Copy Markdown
Owner

Summary

Move rolling deploys from "operator SSHes from workstation" to a GitHub-Actions workflow_dispatch that joins the Tailnet and invokes the existing scripts/rolling-update.sh unchanged. Adds audit trail, approval gate, dry-run preview, and centralized secret handling.

Precondition (operator responsibility): Tailscale already installed + logged in on every cluster node with tag:elastickv-node. Runbook documents this in §1.

Motivation

  • No audit trail for production rollouts today (each operator's local shell history is the record).
  • No approval gate — a typo in NODES takes out the cluster before anyone else sees it.
  • No dry-run — operators can't preview targeting without reading the script.
  • Secret sprawl: SSH keys, Tailscale auth, S3 creds live on operator workstations; rotation == manual shuffling.

What the workflow does

  • workflow_dispatch only (no auto-deploy on push/tag).
  • Inputs: ref (required), image_tag (rollback override), nodes (raft-id subset filter), dry_run (bool, default true).
  • Joins Tailnet via tailscale/github-action@v3 (ephemeral OAuth, tag:ci-deploy).
  • Renders NODES + SSH_TARGETS from production environment vars; filters to requested subset.
  • Verifies image manifest on ghcr.io before starting the roll.
  • tailscale ping on every target.
  • Dry-run stops there and prints the plan; non-dry-run invokes scripts/rolling-update.sh.
  • environment: production → required reviewers on non-dry-run.
  • concurrency: rolling-update with cancel-in-progress: false.

Design and operator docs

  • Design: docs/design/2026_04_24_proposed_deploy_via_tailscale.md
  • Runbook: docs/deploy_via_tailscale_runbook.md — ACL config, OAuth client, required GitHub secrets/variables, first-run sequence, rollback, troubleshooting.

What's NOT changed

  • No changes to scripts/rolling-update.sh. The script's env-var contract is the integration boundary; keeping the script stable means operators can still invoke it manually from a workstation during incidents.
  • No changes to cluster nodes. Nodes keep their existing authorized_keys entries. v1 is plain SSH over Tailscale; Tailscale SSH (keyless) is called out as a follow-up (§2.4 of design).

Test plan

  • actionlint passes (0 errors)
  • YAML parses
  • Run with dry_run: true, nodes: n1 — confirms tailnet join, SSH config, image availability, mapping render without touching prod
  • Run with dry_run: false, nodes: n1 — single-node roll, verify cluster stays healthy and correct image lands
  • Run with dry_run: false, nodes: (empty) — full roll

Follow-ups (out of scope for v1, tracked in the design doc)

  • Post-deploy health verification beyond tailnet reachability (probe Prometheus / Redis after each node)
  • Auto-rollback on script failure
  • Jepsen-gating before allowing deploy
  • cosign verify on the image
  • Tailscale SSH (option A in design §2.4)
  • Shared deploy user with restricted sudo

Before merge

Operator setup is in the runbook (§3, §4). PR doesn't unblock the workflow by itself — it's dead until the secrets/variables in the production environment are populated. That's a deliberate choice so merging is harmless.

Summary by CodeRabbit

  • New Features
    • Added a manual production rollouts workflow using Tailscale for node access, including image existence checks, rollout plan rendering, and SSH reachability preflight.
    • Introduced --dry-run support for the rolling update script to preview rollout order and targets without performing remote container changes.
  • Documentation
    • Added an operator runbook for Tailscale + GitHub Actions production setup, approvals, rollback, and troubleshooting.
    • Added a design document for the proposed automated deployment approach.
  • Chores
    • Updated Docker image publishing to push both latest and commit-specific tags to GHCR.

Today's rolling-update flow is manual: operators SSH from a workstation
with required env vars and invoke scripts/rolling-update.sh. That has
no audit trail, no approval gate, no dry-run, and relies on per-operator
secret handling.

This change adds a workflow_dispatch workflow that joins the Tailnet
via tailscale/github-action (ephemeral OAuth node, tag:ci-deploy),
SSHes into each cluster node over MagicDNS, and invokes the existing
scripts/rolling-update.sh — unchanged. Nodes stay as they are; the
script's env-var contract is the integration boundary.

Highlights:
- workflow_dispatch only (no auto-deploy); production GitHub environment
  gates non-dry-run runs on required reviewers.
- inputs: ref (required), image_tag (rollback override), nodes (subset
  filter), dry_run (default true).
- Dry-run does everything up to the container touch: renders NODES +
  SSH_TARGETS from env variables, verifies the image exists on ghcr.io,
  and tailscale-pings every target. Catches typo'd inputs before any
  production effect.
- Concurrency group "rolling-update", cancel-in-progress: false, so
  parallel invocations queue rather than stomp.
- No node-side changes required: nodes are assumed to already run
  tailscaled and expose authorized_keys for the deploy user. Plain SSH
  over Tailscale; Tailscale SSH (keyless) is called out as a follow-up.

Design doc: docs/design/2026_04_24_proposed_deploy_via_tailscale.md
Operator runbook: docs/deploy_via_tailscale_runbook.md

Validation:
- actionlint passes (0 errors)
- YAML parses
- No changes to scripts/rolling-update.sh; the workflow calls it with
  the same env var contract already documented in the script's usage().

Out of scope (follow-ups): post-deploy health verification, auto-
rollback on script failure, Jepsen-gating, image-signature verification,
Tailscale SSH (keyless), shared deploy user.
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@coderabbitai

coderabbitai Bot commented Apr 24, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c2201d9e-089b-4d2d-87f9-dc140e568cc5

📥 Commits

Reviewing files that changed from the base of the PR and between e96ac64 and bee82be.

📒 Files selected for processing (4)
  • .github/workflows/docker-image.yml
  • .github/workflows/rolling-update.yml
  • docs/deploy_via_tailscale_runbook.md
  • docs/design/2026_04_24_proposed_deploy_via_tailscale.md
💤 Files with no reviewable changes (2)
  • docs/deploy_via_tailscale_runbook.md
  • docs/design/2026_04_24_proposed_deploy_via_tailscale.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/workflows/docker-image.yml
  • .github/workflows/rolling-update.yml

📝 Walkthrough

Walkthrough

A new manual rolling-deploy workflow is added for Tailscale-based production rollouts, with trusted checkout, image verification, rollout planning, reachability checks, and dry-run/live execution paths. The Docker workflow now emits SHA tags, and the script and docs add dry-run and operator guidance.

Changes

Rolling Deploy via Tailscale + GitHub Actions

Layer / File(s) Summary
Docker image SHA tagging
.github/workflows/docker-image.yml
Computes a lowercase GHCR image name and publishes both latest and commit-SHA image tags.
rolling-update.sh dry-run support
scripts/rolling-update.sh
Adds --dry-run handling, DRY_RUN validation, dry-run plan output, and an early exit before helper setup or SSH/container updates.
Workflow definition, trusted ref, image verification, and Tailscale/SSH setup
.github/workflows/rolling-update.yml
Defines dispatch inputs, permissions, concurrency, production environment settings, trusted checkout resolution, GHCR manifest checks, Tailscale join, and SSH key/known-hosts setup.
Rollout planning, checks, and execution
.github/workflows/rolling-update.yml
Validates rollout maps, filters node subsets, computes order and SSH targets, checks SSH reachability, optionally preflights S3 access, logs the rendered configuration, and branches into dry-run or live execution.
Design doc and operator runbook
docs/design/2026_04_24_proposed_deploy_via_tailscale.md, docs/deploy_via_tailscale_runbook.md
Adds the deployment design document and operator runbook covering Tailscale setup, ACLs, OAuth client setup, GitHub environment configuration, workflow use, rollback, cancellation recovery, limitations, and troubleshooting.

Sequence Diagram(s)

sequenceDiagram
  participant Operator
  participant GitHubActions as GitHub Actions
  participant GHCR
  participant TailscaleNetwork as Tailscale Tailnet
  participant ClusterNodes as Cluster Nodes

  Operator->>GitHubActions: workflow_dispatch (ref, image_tag, nodes, dry_run)
  GitHubActions->>GitHubActions: enforce default-branch ref via gh api
  GitHubActions->>GHCR: docker manifest inspect IMAGE_BASE:TAG
  GHCR-->>GitHubActions: manifest present / fail-fast on 404
  GitHubActions->>TailscaleNetwork: join tailnet
  TailscaleNetwork-->>GitHubActions: connected
  GitHubActions->>GitHubActions: validate maps, compute rollout order and targets
  loop each SSH target
    GitHubActions->>ClusterNodes: ssh reachability check
    ClusterNodes-->>GitHubActions: OK / fail
  end
  alt dry_run == true
    GitHubActions->>GitHubActions: rolling-update.sh --dry-run
  else dry_run == false
    GitHubActions->>ClusterNodes: rolling-update.sh live rollout
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I hop through tags both shiny and new,
Dry runs first, then the live rollout too.
Tailscale twinkles, the tailnet hums bright,
And carrots of SHA tags guide me tonight.
With checks and runbooks neatly in place,
This little rabbit deploys with grace.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: moving rolling-update deployments to GitHub Actions over Tailscale.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ops/deploy-via-tailscale

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design proposal and an operational runbook for migrating the manual rolling update process to an automated GitHub Actions workflow utilizing Tailscale for secure connectivity. The review feedback identifies critical configuration errors in the documentation examples, specifically regarding the inclusion of ports in the NODES_RAFT_MAP which would cause deployment script failures, and a mismatch in the required Tailscale OAuth scopes.

entries. Prevents the first-connect TOFU prompt.

**Variables (non-secret):**
- `NODES_RAFT_MAP` — `n1=kv01:50051,n2=kv02:50051,...` (advertised addresses

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The example for NODES_RAFT_MAP includes the port, which is incompatible with the scripts/rolling-update.sh script as it appends the RAFT_PORT automatically. It is better to use the hostnames only, preferably the MagicDNS FQDNs for reliable resolution. Furthermore, as this design document outlines a production-grade deployment and migration, it must detail strategies for live or zero-downtime cutovers and provide mitigation strategies for operational risks, such as potential failures during rolling upgrades.

Suggested change
- `NODES_RAFT_MAP``n1=kv01:50051,n2=kv02:50051,...` (advertised addresses
- NODES_RAFT_MAP — n1=kv01.<tailnet>.ts.net,n2=kv02.<tailnet>.ts.net,... (advertised addresses
References
  1. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
  2. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.

Comment thread docs/deploy_via_tailscale_runbook.md Outdated
Admin console → Settings → OAuth clients → New client:

- Description: `elastickv GitHub Actions deploy`
- Scopes: `auth_keys` (write)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The OAuth client scope should be devices (write) instead of auth_keys (write). The tailscale/github-action requires the devices scope to register the ephemeral node in the tailnet using OAuth. This is correctly identified in the design document (§2.3) but misstated here.

Suggested change
- Scopes: `auth_keys` (write)
- Scopes: devices (write)

Comment thread docs/deploy_via_tailscale_runbook.md Outdated
|------|-------|---------|
| `IMAGE_BASE` | Container image path (no tag) | `ghcr.io/bootjp/elastickv` |
| `SSH_USER` | SSH login on every node | `bootjp` |
| `NODES_RAFT_MAP` | Comma-separated `raftId=host:port` | `n1=kv01:50051,n2=kv02:50051,n3=kv03:50051,n4=kv04:50051,n5=kv05:50051` |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example for NODES_RAFT_MAP includes the port (e.g., :50051), but the scripts/rolling-update.sh script automatically appends the RAFT_PORT to the host value. Providing the port here will result in an invalid address like host:50051:50051 when the script constructs the --address flag. The value should only contain the hostnames or IPs. Additionally, using the MagicDNS FQDN is recommended for consistency with the SSH_TARGETS_MAP example.

- NODES_RAFT_MAP examples no longer include ":50051". scripts/rolling-update.sh
  auto-appends RAFT_PORT to each entry, so a port in NODES produces
  host:port:port when the script constructs --address (Gemini HIGH).
- OAuth client scope: design doc §2.3 said "devices:write" while the
  runbook §3 said "auth_keys (write)". Aligned both on "auth_keys (write)"
  which is what tailscale/github-action requires to mint the ephemeral
  join key. Added a note that newer action versions may additionally
  need "devices:core" (write) and to consult the action's README as
  the authoritative scope list (addresses the Gemini MEDIUM comment
  while hedging against action-version drift).
@bootjp

bootjp commented Apr 24, 2026

Copy link
Copy Markdown
Owner Author

Addressed review feedback in 6322748:

  • HIGH: removed :50051 from NODES_RAFT_MAP examples in both design doc and runbook. The script auto-appends RAFT_PORT, so including the port produced host:50051:50051 in --address.
  • MEDIUM: aligned OAuth-scope guidance. Design doc §2.3 (was "devices:write") now matches runbook §3 on "auth_keys (write)", which is what tailscale/github-action actually needs to mint the ephemeral key. Added a note that newer action versions may additionally require devices:core, pointing operators to the action README as the authoritative source.

Note on the CI failure (test job, ubuntu-latest): the failing tests are TestLua_ZSCORE_DelThenZScoreReturnsNil, TestLua_ZSCORE_TTLExpired, TestLua_ZRANGEBYSCORE_FastPathHitWithScores — unrelated to this PR (docs + YAML only, no Go code). Likely pre-existing flake on main; will re-run after the review pass.

/gemini review
@codex review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes a design and runbook for automating rolling updates via GitHub Actions and Tailscale. The feedback highlights a missing workflow file, a contradiction in dry-run implementation, and the need for more detailed mitigation and cutover strategies. Additionally, improvements are suggested for the runbook regarding SSH configuration clarity, security hygiene for host scanning, and environment variable consistency.

Comment on lines +127 to +128
- The script is invoked with `--help` + the rendered env is printed as a
collapsed log group.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a contradiction regarding dry-run support; using --help is insufficient for validation. A native --dry-run flag should be implemented. Furthermore, as this design document covers a production-grade migration and rolling upgrades, it must include mitigation strategies for operational risks (e.g., 'bridge' or 'proxy' mode) and detail zero-downtime cutover strategies (e.g., blue/green or dual-write proxy) to comply with repository requirements.

References
  1. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary 'bridge' or 'proxy' mode.
  2. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.


## 5. Implementation plan

1. Write `.github/workflows/rolling-update.yml` implementing §2.1.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The implementation plan lists the creation of .github/workflows/rolling-update.yml as the first step, and the PR description indicates that the workflow has been verified with actionlint. However, the workflow file itself is missing from this pull request. Please include the workflow file to ensure the proposed deployment mechanism can be reviewed and tested.

Comment on lines +14 to +17
sudo tailscale up \
--ssh=false \
--advertise-tags=tag:elastickv-node \
--accept-routes=false

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The instructions specify --ssh=false, which disables Tailscale SSH. This approach requires the standard system SSH daemon (sshd) to be running and correctly configured to accept connections on the Tailscale interface. It would be safer to explicitly mention this requirement to prevent operators from being locked out of the deployment flow if they rely solely on Tailscale SSH for access.

Comment thread docs/deploy_via_tailscale_runbook.md Outdated
| `TS_OAUTH_CLIENT_ID` | Tailscale OAuth client ID from step 3 |
| `TS_OAUTH_SECRET` | Tailscale OAuth secret from step 3 |
| `DEPLOY_SSH_PRIVATE_KEY` | OpenSSH private key, authorized on every node under the deploy user |
| `DEPLOY_KNOWN_HOSTS` | `ssh-keyscan kv01.<tailnet>.ts.net kv02.<tailnet>.ts.net …` output (one host per line) |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Using ssh-keyscan without the -H flag stores hostnames in plain text in the known_hosts file. For better security hygiene, it is recommended to use the -H flag to hash hostnames, which helps prevent host discovery if the secrets or the runner environment are compromised.

Comment thread docs/deploy_via_tailscale_runbook.md Outdated
Comment on lines +89 to +90
| `NODES_RAFT_MAP` | Comma-separated `raftId=host` (no port — the script appends `RAFT_PORT`) | `n1=kv01,n2=kv02,n3=kv03,n4=kv04,n5=kv05` |
| `SSH_TARGETS_MAP` | Comma-separated `raftId=ssh-host` | `n1=kv01.<tailnet>.ts.net,n2=kv02.<tailnet>.ts.net,...` |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The environment variable names NODES_RAFT_MAP and SSH_TARGETS_MAP differ from the names expected by the scripts/rolling-update.sh script (NODES and SSH_TARGETS). While the workflow likely performs a mapping, this discrepancy can lead to confusion if an operator attempts to run the script manually using the variables defined in this runbook. Aligning the names or explicitly noting the mapping would improve maintainability and reduce friction during manual troubleshooting.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6322748fe4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +30 to +32
permissions:
contents: read
id-token: write # required by tailscale/github-action OIDC flow

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add packages:read for GHCR manifest check

This workflow narrows GITHUB_TOKEN to contents and id-token, which implicitly removes package scope, but the Verify image exists on ghcr.io step authenticates to GHCR and inspects a manifest with that token. In environments where the image is private (or package auth is otherwise required), docker login/docker manifest inspect will fail with authorization errors before rollout begins, so deploys are blocked even when the image exists.

Useful? React with 👍 / 👎.

- workflow: add `packages: read` to the job permissions so the
  `Verify image exists on ghcr.io` step's `docker manifest inspect`
  call works against private ghcr.io images (Codex P1).
- runbook §1: explain that `--ssh=false` disables Tailscale SSH and
  the workflow relies on the system sshd — operators who use
  Tailscale SSH elsewhere need to keep that in mind (Gemini Medium).
- runbook §4: change `ssh-keyscan` example + troubleshooting to
  `ssh-keyscan -H` so known_hosts entries are hashed and the secret
  does not leak tailnet topology in plaintext (Gemini Security
  Medium).
- runbook §4 variables: document that `NODES_RAFT_MAP` /
  `SSH_TARGETS_MAP` are workflow-side names the render step maps to
  the script's `NODES` / `SSH_TARGETS`; manual invocation from a
  workstation must use the script-side names (Gemini Medium).

Not addressed: Gemini HIGH claim that the workflow file is missing
(line 187) — it IS included at `.github/workflows/rolling-update.yml`
in this PR; the reviewer misread the file list.

Not addressed: Gemini HIGH re native --dry-run flag + zero-downtime
strategy (line 128) — dry-run is deliberately a workflow-level
input, not a script-level flag, so the script stays invokable from
a workstation without CI-specific options; zero-downtime cutover is
outside the scope of a CI wrapper and is tracked in the
resilience-roadmap follow-ups.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad00bdc0ba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread .github/workflows/rolling-update.yml Outdated
Comment on lines +133 to +135
if [[ -z "$NODES_RAFT_MAP" ]]; then
echo "::error::nodes filter '$NODES_FILTER' matches nothing in NODES_RAFT_MAP"
exit 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject partially invalid node filters

The filter logic silently drops unknown raft IDs and only errors when no IDs match, so an input like nodes: n1,n9 proceeds as a rollout of n1 only. That can leave operators believing multiple nodes were upgraded when one was skipped, which is risky for staged deploys/rollbacks and can leave the cluster in an unintended mixed-version state. Treat inputs.nodes as an exact set and fail if any requested ID is missing from the configured map(s).

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Apr 24, 2026

Copy link
Copy Markdown
Owner Author

Addressed round-2 in ad00bdc:

  • Codex P1 packages:read: added to workflow permissions for ghcr.io private manifest inspect.
  • Gemini Medium sshd note: documented that --ssh=false requires the system sshd.
  • Gemini Security-Medium ssh-keyscan -H: runbook now uses -H so known_hosts entries are hashed.
  • Gemini Medium variable-name discrepancy: documented that NODES_RAFT_MAP/SSH_TARGETS_MAP are workflow-side names the render step maps to the script's NODES/SSH_TARGETS; manual workstation invocation uses the script-side names directly.

Not addressed: HIGH claim that the workflow file is missing — it IS in this PR at .github/workflows/rolling-update.yml (reviewer misread file list). HIGH re native --dry-run flag + zero-downtime — dry-run is deliberately a workflow-level input to keep the script clean for workstation use; zero-downtime cutover is outside the CI wrapper's scope.

/gemini review
@codex review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design proposal and an operational runbook for migrating manual rolling updates to a GitHub Actions workflow using Tailscale. The feedback identifies that the workflow file itself is missing from the PR and points out technical issues regarding GitHub environment protection rules, DNS resolution for Raft nodes, and OAuth scope requirements. Additionally, it suggests adding mitigation strategies for risks associated with workflow cancellation.


## 5. Implementation plan

1. Write `.github/workflows/rolling-update.yml` implementing §2.1.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The pull request title, description, and test plan indicate that the GitHub Actions workflow has been implemented and verified (e.g., actionlint passes). However, the .github/workflows/rolling-update.yml file is missing from the changes in this pull request. If this PR is intended to implement the deployment flow, please include the workflow file.

Comment on lines +138 to +139
reviewers. A non-dry-run deploy will pause until approved; the dry-run run
itself does not need approval (it only needs the tailnet join).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a contradiction between the workflow definition (line 54) and the dry-run semantics described here. If the job is associated with the production environment, GitHub will enforce the environment's protection rules (such as required reviewers) for every run, including dry-runs. To allow dry-runs to proceed without approval, the workflow should dynamically set the environment or use a separate job for the dry-run/preview phase.

Comment on lines +153 to +156
**Gap:** there is no "stop mid-rollout" control today. If the workflow is
cancelled via GitHub UI during a roll, the in-flight node may be mid-recreate.
`rolling-update.sh` is supposed to be idempotent and crash-safe, but this
should be verified before we call the workflow production-ready.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The design identifies a significant operational risk regarding workflow cancellation during a rollout. Per the general rules, such risks should be accompanied by detailed mitigation strategies. Consider adding a section to the runbook for manual recovery or updating the deployment script to handle termination signals gracefully to ensure the cluster is not left in an inconsistent state.

References
  1. When a design document identifies a significant operational risk, it must also detail potential mitigation strategies.

Comment thread docs/deploy_via_tailscale_runbook.md Outdated
|------|-------|---------|
| `IMAGE_BASE` | Container image path (no tag) | `ghcr.io/bootjp/elastickv` |
| `SSH_USER` | SSH login on every node | `bootjp` |
| `NODES_RAFT_MAP` | Comma-separated `raftId=host` (no port — the script appends `RAFT_PORT`). The workflow renders this into the script's `NODES` env var. | `n1=kv01,n2=kv02,n3=kv03,n4=kv04,n5=kv05` |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example for NODES_RAFT_MAP uses short hostnames (e.g., kv01). For reliable cluster communication over Tailscale, it is recommended to use the full MagicDNS names (e.g., kv01.<tailnet>.ts.net). Short names may not be resolvable by the nodes themselves depending on their local DNS search domains, which could prevent the Raft cluster from forming correctly.

Suggested change:
| NODES_RAFT_MAP | Comma-separated raftId=host (no port — the script appends RAFT_PORT). The workflow renders this into the script's NODES env var. | n1=kv01.<tailnet>.ts.net,n2=kv02.<tailnet>.ts.net,n3=kv03.<tailnet>.ts.net,n4=kv04.<tailnet>.ts.net,n5=kv05.<tailnet>.ts.net |

Comment thread docs/deploy_via_tailscale_runbook.md Outdated
Comment on lines +60 to +63
- Scopes: `auth_keys` (write). Recent `tailscale/github-action` versions
may additionally require `devices:core` (write); enable that if the
join step fails with an authorization error. The action's README is
the definitive source for current scope requirements.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When using an OAuth client with the tailscale/github-action, the devices scope with write access is generally required to allow the action to register the ephemeral runner as a node. I recommend updating the instructions to include this scope as a requirement to avoid authorization errors during the tailnet join step.

Suggested change:
- Scopes: auth_keys (write) and devices (write). The devices scope (or devices:core in newer API versions) is required for the action to register the ephemeral runner as a node in your tailnet.

Comment thread docs/deploy_via_tailscale_runbook.md Outdated
Comment on lines +75 to +76
rules" to auto-approve if the workflow input `dry_run == true` (optional; cuts
friction for previews).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

GitHub's native environment protection rules do not support conditional auto-approval based on workflow inputs. Implementing this would require a custom GitHub Action for Deployment Protection Rules. It would be more accurate to state that dry-runs will still require manual approval if they target the production environment, or suggest using a separate, unprotected environment for previews.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad00bdc0ba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread .github/workflows/rolling-update.yml Outdated
Comment on lines +133 to +135
if [[ -z "$NODES_RAFT_MAP" ]]; then
echo "::error::nodes filter '$NODES_FILTER' matches nothing in NODES_RAFT_MAP"
exit 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fail when requested node IDs are only partially matched

The nodes filter currently errors only when zero IDs match, so a typo like n1,n9 (or whitespace variant n1, n2) silently drops unmatched IDs and proceeds with a partial rollout. That means operators can believe they updated a specific subset while one or more intended nodes were skipped, which is risky during security or incident-driven deploys. Please validate that every requested ID is present in NODES_RAFT_MAP and fail if any are missing.

Useful? React with 👍 / 👎.

Comment on lines +152 to +154
IFS=',' read -r -a entries <<< "$SSH_TARGETS"
failed=0
for e in "${entries[@]}"; do

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check reachability for all rollout nodes, not only SSH map entries

The reachability step iterates only over SSH_TARGETS, but rolling-update.sh resolves missing SSH mappings by falling back to each node host from NODES (ssh_target_by_id). If SSH_TARGETS_MAP is incomplete, dry-run can report success without probing some actual rollout targets, and the job can then fail mid-roll when it reaches an unvalidated node. Preflight should derive targets from NODES + SSH_TARGETS using the same fallback semantics or enforce one-to-one mapping coverage first.

Useful? React with 👍 / 👎.

- workflow nodes filter (Codex P2): reject any raft ID in the
  `nodes` input that does not appear in NODES_RAFT_MAP. Previously
  a typo like `n1,n9` silently rolled n1 only; now the workflow
  fails fast with a list of known IDs so the operator sees the
  typo before touching prod.
- runbook section 4 (Gemini Medium x2): GitHub's native environment
  protection rules cannot be made conditional on workflow inputs,
  so the previous "auto-approve dry-run" guidance was wrong.
  Documented the three workable options: accept the prompt for
  dry-runs too (v1 default), split into a second unprotected
  environment, or install a deployment-protection-rule GitHub App.
- runbook section 4 NODES_RAFT_MAP example (Gemini Medium): use full
  MagicDNS FQDNs instead of short hostnames so every node can
  resolve its peers regardless of local DNS search domains.
- runbook section 6 (Gemini Medium): added "If a running workflow is
  cancelled mid-rollout" recovery steps — find the in-flight node
  from logs, finish the recreate by hand, confirm leader, rerun
  scoped. Filed as a tracked gap to teach the workflow per-node
  start-markers in a follow-up.

Not addressed: Gemini HIGH line 187 claiming the workflow file is
missing — the file IS present at .github/workflows/rolling-update.yml
and has been since the first push of this PR. Third time the bot has
flagged this (same finding in rounds 1 and 2); leaving as-is since
responding further would just be repeating the same correction.
@bootjp

bootjp commented Apr 24, 2026

Copy link
Copy Markdown
Owner Author

Round 3 in 894bce9:

  • Codex P2: workflow rejects unknown raft IDs in nodes filter — n1,n9 now fails fast with the known-IDs list instead of silently rolling only n1.
  • Gemini Medium ×2: corrected dry_run environment-approval guidance (GitHub can't make protection rules conditional on inputs). Listed the three workable options; v1 ships with approach 1 (prompt on every run).
  • Gemini Medium: NODES_RAFT_MAP example now uses MagicDNS FQDNs, not short names.
  • Gemini Medium: added "If a running workflow is cancelled mid-rollout" recovery runbook section.

Not addressed: Gemini HIGH line 187 (workflow file missing) — the file IS in this PR; third round the bot has flagged it despite the response; leaving as-is.

/gemini review
@codex review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design document and a comprehensive runbook for migrating the manual rolling-update process to an automated GitHub Actions workflow using Tailscale for secure connectivity. The documentation covers setup, security configurations, and operational procedures. Feedback focuses on correcting technical inaccuracies, such as fixing a YAML syntax error in the workflow snippet, updating Tailscale OAuth scope names, and ensuring that log patterns and command visibility descriptions in the runbook accurately reflect the behavior of the underlying deployment scripts. Additionally, an inconsistency regarding the use of FQDNs for node mapping was identified between the design and operational docs.

entries. Prevents the first-connect TOFU prompt.

**Variables (non-secret):**
- `NODES_RAFT_MAP` — `n1=kv01,n2=kv02,...` (advertised hostnames as seen

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is an inconsistency between the design document and the runbook. The runbook correctly recommends using full MagicDNS FQDNs for NODES_RAFT_MAP to ensure reliable resolution across nodes with different search domains, whereas this example uses short hostnames. Furthermore, as this design document proposes a production-grade system migration, it must detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.

Suggested change
- `NODES_RAFT_MAP``n1=kv01,n2=kv02,...` (advertised hostnames as seen
- NODES_RAFT_MAP — n1=kv01.tailnet.ts.net,n2=kv02.tailnet.ts.net,... (advertised hostnames as seen
References
  1. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.

Comment thread docs/deploy_via_tailscale_runbook.md Outdated

- Description: `elastickv GitHub Actions deploy`
- Scopes: `auth_keys` (write). Recent `tailscale/github-action` versions
may additionally require `devices:core` (write); enable that if the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The scope devices:core is not a standard Tailscale OAuth scope. For the tailscale/github-action to successfully join the tailnet and manage ephemeral nodes, it typically requires the devices:write scope (along with auth_keys:write). Using an incorrect scope name will cause the authentication step to fail.

Suggested change
may additionally require `devices:core` (write); enable that if the
may additionally require devices (write); enable that if the

Comment thread docs/deploy_via_tailscale_runbook.md Outdated
hazard that needs manual cleanup.

1. **Look at the last log line from the `Roll cluster` step.** The script
logs `[rolling-update] rolling n<id>: docker stop/rm/run ...` before

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The runbook describes a log line that does not exist in the current version of scripts/rolling-update.sh. The script actually logs ==> [$node_id@$node_host] start at the beginning of each node's update process. Providing the correct log pattern is important for operators to accurately identify the state of a cancelled rollout.

Suggested change
logs `[rolling-update] rolling n<id>: docker stop/rm/run ...` before
logs ==> [n<id>@<host>] start before each node recreate. Whatever

Comment thread docs/deploy_via_tailscale_runbook.md Outdated
flight when the cancel signal landed.
2. **SSH into that node** over Tailscale and run `docker ps`. If the
container is absent or `Exited`, finish the recreate by hand with the
docker run arguments the script emitted (which you can see in the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script scripts/rolling-update.sh does not emit the full docker run command to the logs; it redirects the output of the docker run command to /dev/null. Operators will not be able to see the arguments in the workflow log for the Roll cluster step. They should instead refer to the environment variables rendered in the preceding workflow step to reconstruct the command.

Suggested change
docker run arguments the script emitted (which you can see in the
docker run arguments (which you can reconstruct from the rendered


```
name: Rolling update
on: workflow_dispatch:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The YAML snippet contains a syntax error. on: workflow_dispatch: on a single line is invalid when followed by nested keys. It should be split across lines to be a valid GitHub Actions workflow trigger definition.

Suggested change
on: workflow_dispatch:
on:
workflow_dispatch:

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 894bce93d9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs/deploy_via_tailscale_runbook.md Outdated
Comment on lines +188 to +189
Expected for non-dry-run deploys — a reviewer from the `production` environment
must click Approve. Check the "Required reviewers" list in the environment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fix approval troubleshooting text for dry-run workflows

This troubleshooting guidance is incorrect for the workflow as implemented: .github/workflows/rolling-update.yml sets environment: production unconditionally, so dry-run and non-dry-run executions both pause for environment approval in v1. Telling operators this is only expected for non-dry-runs can send them down the wrong debugging path when a dry-run is legitimately waiting for approval.

Useful? React with 👍 / 👎.

- Gemini HIGH (design line 82): switch NODES_RAFT_MAP example to full
  MagicDNS FQDNs so it matches the runbook; bare hostnames resolve
  differently per node.
- Gemini Medium (design line 45): fix YAML — on/workflow_dispatch/inputs
  must be nested, not on a single line, and the fence is labelled yaml.
- Gemini Medium (runbook §3, design §2.3): retract devices:core — not a
  valid Tailscale OAuth scope; note devices:write as the standard one.
- Gemini Medium (runbook §6, line 153-156): correct the cancelled-job
  log pattern to what the script actually emits (`==> [<id>@<host>]
  start`, scripts/rolling-update.sh:398), not the fictitious
  `[rolling-update] rolling n<id>: ...`.
- Gemini Medium (runbook §6, line 156-160): clarify that docker run
  stdout/stderr is redirected to /dev/null, so operators reconstruct
  the invocation from the step-level env log, not from the docker-run
  argv.
- Codex P2 (runbook §8 approval troubleshooting): clarify that both
  dry-run and non-dry-run runs pause for approval in v1 because
  `environment: production` is unconditional; reference §4 for the
  second-environment upgrade path.
@bootjp

bootjp commented Apr 24, 2026

Copy link
Copy Markdown
Owner Author

Round 4 in 165116b:

  • Gemini HIGH (design line 82): NODES_RAFT_MAP example now uses full MagicDNS FQDNs to match the runbook; a cross-reference between the two docs keeps the convention visible.
  • Gemini Medium (design line 45): fixed YAML — on: / workflow_dispatch: / inputs: are now correctly nested, and the fence is tagged yaml.
  • Gemini Medium (runbook §3 + design §2.3): retracted devices:core as invalid; noted devices:write as the standard scope and explicitly called out the prior wrong scope so future readers don't cargo-cult it.
  • Gemini Medium (runbook §6 lines 153-160): corrected the cancelled-job log pattern to ==> [<raft-id>@<host>] start (what the script actually emits at scripts/rolling-update.sh:398) and noted that docker run output is redirected to /dev/null, so operators reconstruct the invocation from the step-level env log, not from the docker-run argv.
  • Codex P2 (runbook §8 approval troubleshooting): clarified that both dry-run and non-dry-run runs pause for approval in v1 because environment: production is unconditional; reference to §4 for the second-environment upgrade path.

Doc-only PR; no build needed. The test (ubuntu-latest) CI failure on round-3's commit was unrelated to docs; will re-run after this push.

No other actionable feedback this round.

/gemini review
@codex review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design document and a runbook for automating rolling updates via Tailscale and GitHub Actions. The feedback highlights that the actual workflow file is missing from the PR, identifies a contradiction regarding dry-run approvals in the environment settings, and corrects inaccuracies in the documentation concerning manual recovery variables and script idempotency.


## 5. Implementation plan

1. Write `.github/workflows/rolling-update.yml` implementing §2.1.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The implementation plan lists the creation of .github/workflows/rolling-update.yml as the first step, but this file is missing from the current pull request. Since this workflow is the core of the proposed automation, it should be included for review.

Comment on lines +145 to +146
reviewers. A non-dry-run deploy will pause until approved; the dry-run run
itself does not need approval (it only needs the tailnet join).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a contradiction between this design document and the runbook regarding dry-run approvals. The runbook (Line 76) correctly notes that GitHub environment protections are not conditional and will pause dry-runs for approval. This section should be updated to reflect that limitation.

Suggested change
reviewers. A non-dry-run deploy will pause until approved; the dry-run run
itself does not need approval (it only needs the tailnet join).
reviewers. Note that in v1, both dry-run and non-dry-run executions will pause
for approval because GitHub's environment protection rules are not conditional.

Comment thread docs/deploy_via_tailscale_runbook.md Outdated
Comment on lines +162 to +164
the resolved env instead: the step logs `NODES_RAFT_MAP`,
`EXTRA_ENV`, `GOMEMLIMIT`, `CONTAINER_MEMORY_LIMIT`, `IMAGE`, and
`DATA_DIR` before invoking the script — those are sufficient to

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The list of variables mentioned here is insufficient for manually reconstructing the docker run command. It misses critical networking and S3 configuration (e.g., RAFT_PORT, REDIS_PORT, S3_PORT, ENABLE_S3) required by the script. Also, GOMEMLIMIT and CONTAINER_MEMORY_LIMIT are mentioned but are not directly used by scripts/rolling-update.sh.

Suggested change
the resolved env instead: the step logs `NODES_RAFT_MAP`,
`EXTRA_ENV`, `GOMEMLIMIT`, `CONTAINER_MEMORY_LIMIT`, `IMAGE`, and
`DATA_DIR` before invoking the script — those are sufficient to
the resolved env instead: the step logs the rendered NODES,
SSH_TARGETS, IMAGE, DATA_DIR, EXTRA_ENV, and all PORT
settings before invoking the script — those are sufficient to

Comment thread docs/deploy_via_tailscale_runbook.md Outdated
Comment on lines +175 to +178
The script is idempotent for the "container exists and is up" case, so
re-running the workflow with the same `ref` after confirming the
interrupted node is healthy is safe — the script will stop+recreate
each node in turn regardless of whether it was touched before.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The description of the script's behavior on re-run is inaccurate. scripts/rolling-update.sh (lines 794-798) skips nodes where the image ID is unchanged and the gRPC endpoint is healthy. It does not "stop+recreate each node in turn regardless".

Suggested change
The script is idempotent for the "container exists and is up" case, so
re-running the workflow with the same `ref` after confirming the
interrupted node is healthy is safe — the script will stop+recreate
each node in turn regardless of whether it was touched before.
The script is idempotent: if the image ID is unchanged and the gRPC
endpoint is healthy, it skips the node. Re-running the workflow with
the same ref after confirming the interrupted node is healthy is
safe — the script will skip completed nodes and only process those
requiring an update.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Delightful!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

bootjp added a commit that referenced this pull request Apr 24, 2026
…#617)

## Summary
Add two OOM-defense defaults to `scripts/rolling-update.sh`:

- `GOMEMLIMIT=1800MiB` (via new `DEFAULT_EXTRA_ENV`, merged into the
existing `EXTRA_ENV` plumbing)
- `--memory=2500m` on the remote `docker run` (via new
`CONTAINER_MEMORY_LIMIT`)

Both are env-var-controlled with empty-string opt-out (`${VAR-default}`
so unset uses the default, but an explicit empty string disables it).

## Motivation
2026-04-24 incident: all 4 live nodes were kernel-OOM-SIGKILLed 22-169
times in 24h under a traffic spike. Each kill risked WAL-tail truncation
and triggered election storms, cascading into p99 GET spikes to 6-8s.
The runtime defense was applied by hand during the incident; this PR
makes it the script default so future rollouts inherit it.

- `GOMEMLIMIT` — Go runtime GCs aggressively as heap approaches the
limit, keeping RSS below the container ceiling.
- `--memory` (cgroup hard limit) — if Go can't keep up (e.g. non-heap
growth), the kill is scoped to the container, not host processes like
`qemu-guest-agent` or `systemd`.

## Behavior changes

| Variable | Default | Opt-out |
|----------|---------|---------|
| `DEFAULT_EXTRA_ENV` | `GOMEMLIMIT=1800MiB` | `DEFAULT_EXTRA_ENV=""` |
| `CONTAINER_MEMORY_LIMIT`| `2500m` | `CONTAINER_MEMORY_LIMIT=""` |

Operator-supplied `EXTRA_ENV` keys override matching keys in
`DEFAULT_EXTRA_ENV` (e.g., `EXTRA_ENV="GOMEMLIMIT=3000MiB"` wins over
the default).

## Related
Companion PRs (defense-in-depth):
- #612 `memwatch` — graceful shutdown before kernel OOM (prevents WAL
corruption in the first place)
- #613 WAL auto-repair — recovers on startup when the above fails
- #616 rolling-update via GitHub Actions over Tailscale — consumes this
script

## Test plan
- [x] `bash -n scripts/rolling-update.sh` passes
- [x] Deployed equivalents manually on all 4 live nodes during the
incident (2026-04-24T07:44Z - 07:46Z); no OOM recurrence since
- [ ] Next rolling-update invocation should produce `docker run ...
--memory=2500m ... -e GOMEMLIMIT=1800MiB ...` on each node

## Design doc reference
`docs/design/2026_04_24_proposed_resilience_roadmap.md` (item 1 —
capacity/runtime defenses).
- Gemini Medium (design §2.6 line 146): the design doc contradicted
  the runbook by claiming dry-runs do NOT need approval. GitHub's
  environment-protection rules cannot be made conditional on workflow
  inputs, so `environment: production` pauses BOTH dry-run and
  non-dry-run executions in v1. Aligned the design-doc wording with
  the runbook and cross-referenced §4 for the second-environment
  upgrade path.
- Gemini Medium (runbook §6 line 164 env list): the list of
  reconstruction vars was incomplete. Listed every env var the
  workflow actually exports (IMAGE, DATA_DIR, RAFT_PORT, REDIS_PORT,
  S3_PORT, ENABLE_S3, NODES, SSH_TARGETS, EXTRA_ENV) and called out
  the script-level defaults for anything not overridden, plus noted
  GOMEMLIMIT / CONTAINER_MEMORY_LIMIT are propagated via EXTRA_ENV
  once PR #617 lands.
- Gemini Medium (runbook §6 line 178 idempotency): corrected the
  "stop+recreate every node regardless" claim. The script
  (scripts/rolling-update.sh:794-798) skips nodes whose running
  image id matches the target AND whose gRPC endpoint is healthy,
  so re-running after a partial roll is safe because already-rolled
  nodes are no-ops, not stops.

Declining again Gemini HIGH "workflow file missing" — the file IS
in this PR at .github/workflows/rolling-update.yml; this is the
fourth round the bot has flagged its own misread. See prior rounds
for rationale; no change.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 24383b2db1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread .github/workflows/rolling-update.yml Outdated
Comment on lines +52 to +54
uses: actions/checkout@v6
with:
ref: ${{ inputs.ref }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict deploy script checkout to trusted refs

The workflow checks out ${{ inputs.ref }} and later executes ./scripts/rolling-update.sh after loading production credentials (SSH key, Tailscale OAuth secret) and joining the tailnet, so ref is effectively arbitrary code execution, not just a deployment selector. If someone can dispatch runs and supply a branch/tag they control, a modified script in that ref can exfiltrate secrets or perform unintended actions once the environment is approved. Keep execution on a trusted protected revision (or strictly validate allowed refs) and use the input only for image selection.

Useful? React with 👍 / 👎.

Codex P1: the workflow hands the checked-out tree a Tailscale
OAuth secret and an SSH key, then executes scripts/rolling-update.sh
from that tree. Anyone who can dispatch runs could previously point
inputs.ref at a branch containing a malicious script modification
and exfiltrate the secrets.

Mitigations:
- New 'Validate ref is default branch or a tag' step rejects any
  ref that is not the default branch (by name or HEAD sha) or an
  existing tag. A sha reachable from elsewhere is still accepted
  (the subsequent checkout does its own verification) but non-
  default branches fail closed with an operator-visible error.
- actions/checkout now pins persist-credentials: false so the
  GITHUB_TOKEN is not left in the runner's git config for the
  deploy script to harvest. The token is still explicitly exposed
  to the ghcr verification step via env:, which is the only place
  it needs to be readable.
@bootjp

bootjp commented Apr 24, 2026

Copy link
Copy Markdown
Owner Author

Addressed in 3ddcd42:

Codex P1 (arbitrary code execution via inputs.ref): new pre-checkout step validates that ref resolves to the repo's default branch (by name or HEAD sha), an existing tag, or a sha; any other branch fails closed with an operator-visible error before the tailnet join / SSH key load. Also pinned actions/checkout to persist-credentials: false so the GITHUB_TOKEN is not left in the runner's git config for the deploy script to harvest.

Gemini HIGH / Medium comments already addressed in previous rounds (bot is reading stale state):

  • HIGH line 200 (workflow file missing): present at .github/workflows/rolling-update.yml since this PR's first push. Fourth round the bot has flagged this despite the correction; declining silently.
  • Medium: dry-run approval contradiction: §2.6 of the design doc and §4 of the runbook both now state GitHub's native protection rules do not condition on inputs, so dry-run pauses for approval in v1. Fixed in round 4 (commit 165116b).
  • Medium: vars list insufficient to reconstruct docker run: runbook §6 recovery section now lists the full set (IMAGE, DATA_DIR, RAFT_PORT, REDIS_PORT, S3_PORT, ENABLE_S3, NODES, SSH_TARGETS, merged EXTRA_ENV) with the script's defaults for what the workflow doesn't export. Fixed in round 4.
  • Medium: re-run description inaccurate: runbook now says scripts/rolling-update.sh:794-798 skips a node when its image id matches the target and the gRPC endpoint is healthy. Fixed in round 4.

/gemini review
@codex review

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b2598fc8e5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- name: Verify image exists on ghcr.io
env:
IMAGE_BASE: ${{ vars.IMAGE_BASE }}
IMAGE_TAG: ${{ inputs.image_tag || inputs.ref }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Publish immutable image tags before deploying refs

When operators follow the documented default and enter a commit SHA/tag in ref, this resolves the deployment image tag from that value, but I checked the only image publishing workflow and it pushes only ghcr.io/${{ github.REPOSITORY }}:latest (.github/workflows/docker-image.yml:43). As a result docker manifest inspect fails for SHA/tag deploys before any rollout, and the documented rollback path using a previous SHA cannot work unless someone manually creates those tags; publish immutable tags in the build workflow or stop defaulting the deploy tag to ref.

Useful? React with 👍 / 👎.

Comment thread .github/workflows/rolling-update.yml Outdated
host="${host%%:*}"
# strip user@ if present
host="${host##*@}"
if tailscale ping --c 2 --timeout 3s "$host" >/dev/null 2>&1; then

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preflight SSH instead of Tailscale ping

This preflight only runs tailscale ping, but Tailscale documents default ping as DISCO path discovery, while tailscale nc <hostname-or-IP> <port> is the command that actually connects to a port (https://tailscale.com/docs/reference/ping-types#disco, https://tailscale.com/docs/reference/tailscale-cli#nc). When sshd, TCP/22 ACLs, host keys, or authorized keys are wrong, dry-run can still report ok here and a live rollout can fail only after earlier nodes have already been updated; preflight the rendered SSH targets with the same SSH options the script will use, or at least check TCP/22.

Useful? React with 👍 / 👎.

Comment thread .github/workflows/rolling-update.yml Outdated
echo "::error::nodes filter '$NODES_FILTER' references unknown raft IDs: $unknown. Known IDs: $(known_ids_csv "$NODES_RAFT_MAP")"
exit 1
fi
NODES_RAFT_MAP="$(filter_csv "$NODES_RAFT_MAP" "$NODES_FILTER")"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep full cluster map when targeting a subset

When nodes is set for a staged rollout, this replaces NODES with only the requested subset, but rolling-update.sh treats NODES as the full cluster definition: it derives RAFT_TO_REDIS_MAP/RAFT_TO_S3_MAP from it and uses the full host list to find leadership-transfer candidates. A documented nodes: n1 run on a multi-node cluster therefore starts n1 with leader-routing maps that omit the rest of the cluster, and if n1 is currently leader the script has no healthy peer candidate and aborts before rolling it. Keep the full NODES_RAFT_MAP and pass the requested subset as ROLLING_ORDER/a separate rollout target instead of filtering the cluster map itself.

Useful? React with 👍 / 👎.

{
"action": "accept",
"src": ["tag:ci-deploy"],
"dst": ["tag:elastickv-node:22"],

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Allow node-to-node tailnet traffic for Raft

This ACL only grants the CI runner SSH access to the nodes, but the same runbook tells operators to put MagicDNS FQDNs in NODES_RAFT_MAP, and rolling-update.sh uses those hosts as each container's --address and to derive RAFT_TO_REDIS_MAP/RAFT_TO_S3_MAP. In a least-privilege tailnet following this snippet, tag:elastickv-node has no access to tag:elastickv-node on the Raft/adapter ports, so leader-transfer probes and the restarted node's cluster traffic can fail once the first node is rolled; add explicit node-to-node grants for the cluster ports while keeping tag:ci-deploy limited to SSH.

Useful? React with 👍 / 👎.

NODES: ${{ steps.render.outputs.NODES }}
SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }}
SSH_USER: ${{ vars.SSH_USER }}
IMAGE: ${{ vars.IMAGE_BASE }}:${{ inputs.image_tag || inputs.ref }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Provide registry auth to remote pulls

The workflow authenticates only the Actions runner to GHCR for manifest inspection, then the live rollout passes just IMAGE to rolling-update.sh; the actual docker pull runs later on each remote host. For private GHCR packages, which this workflow explicitly accounts for with packages: read, nodes without preexisting Docker credentials will fail on the first target even though verification passed, blocking approved deploys; either document/enforce node-side registry login or perform a remote login with a deploy-scoped token before invoking the script.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

@codex review

対応しました。

  • 部分ロールアウト時も NODES は full cluster map のまま維持し、対象ノードは ROLLING_ORDER として分離しました。nodes 入力の順序も保持します。
  • preflight を tailscale ping ではなく、実際の rollout 対象への ssh -o BatchMode=yes ... true に変更しました。
  • Docker image workflow で latest に加えて commit SHA tag も publish するようにしました。
  • runbook に node-to-node ACL、private GHCR の node 側 login 前提、SSH reachability の説明を追記しました。

検証:

  • actionlint .github/workflows/rolling-update.yml .github/workflows/docker-image.yml
  • bash -n scripts/rolling-update.sh
  • NODES=... SSH_TARGETS=... ROLLING_ORDER=n3,n1 IMAGE=... DRY_RUN=true ./scripts/rolling-update.sh --dry-run
  • git diff --check

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ee2213ad68

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +65 to +69
- name: Checkout trusted deploy script
uses: actions/checkout@v6
with:
ref: ${{ steps.trusted-ref.outputs.checkout_ref }}
persist-credentials: false

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require dispatches to run from the trusted branch

When a manual workflow_dispatch run is started from a non-default branch (via the UI branch selector or gh workflow run --ref), the workflow YAML from that branch is already what GitHub executes; this checkout only changes the workspace contents afterward. Under the documented production environment setup, an unreviewed branch could modify later steps to use DEPLOY_SSH_PRIVATE_KEY/Tailscale secrets after approval, so checking out the default branch here does not actually keep the deploy workflow trusted. Add an early guard that fails unless the dispatch ref is the default branch, or require the production environment to allow deployments only from that branch.

Useful? React with 👍 / 👎.

Comment on lines +327 to +331
env:
NODES: ${{ steps.render.outputs.NODES }}
SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }}
ROLLING_ORDER: ${{ steps.render.outputs.ROLLING_ORDER }}
SSH_USER: ${{ vars.SSH_USER }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Forward S3 security settings into the rollout

For deployments that currently set ENABLE_S3=false or S3_CREDENTIALS_FILE in the manual rollout environment, this workflow does not forward those script settings into rolling-update.sh. The script defaults ENABLE_S3=true, and when S3_CREDENTIALS_FILE is empty it omits --s3CredentialsFile, causing a restarted node to expose the S3 adapter without the configured SigV4 credential file; forward these vars from the production environment (or fail closed) before making this workflow the canonical deploy path.

Useful? React with 👍 / 👎.

Comment thread .github/workflows/rolling-update.yml Outdated
if [[ "$target" != *@* ]]; then
target="${SSH_USER:-$(id -un)}@$target"
fi
if ssh -o BatchMode=yes -o ConnectTimeout=10 -o StrictHostKeyChecking=yes "$target" true; then

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Retry SSH reachability after joining Tailnet

In tailnets where the ephemeral GitHub runner has not propagated immediately after the Tailscale join step, this single SSH attempt can fail the entire dry run or live rollout even though the node becomes reachable seconds later; Tailscale's action docs call out this eventual consistency and recommend connectivity checks that wait (their ping option waits up to 3 minutes). Wrap the ssh ... true probe in a retry/backoff instead of one ConnectTimeout=10 attempt.

Useful? React with 👍 / 👎.

Comment on lines +52 to +55
"dst": [
"tag:elastickv-node:50051", // Raft / raftadmin
"tag:elastickv-node:6379", // Redis adapter, if enabled
"tag:elastickv-node:9000", // S3 adapter, if enabled

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Allow node-to-node DynamoDB traffic

When the DynamoDB-compatible adapter is in use, the rollout script starts it on DYNAMO_PORT (default 8000) and follower requests are proxied to the current leader, but this ACL example only permits Raft, Redis, and S3 between tag:elastickv-node peers. Operators who copy this rule will block node-to-node DynamoDB leader proxying after the workflow rolls nodes, so include the DynamoDB port or call out that DynamoDB deployments need an additional ACL entry.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Addressed the current P1 deployment findings:

  • added an early guard requiring rolling-update workflow_dispatch runs to start from the repository default branch, and documented the production environment branch restriction as the actual trust boundary.
  • forwarded ENABLE_S3 and S3_CREDENTIALS_FILE from the production environment into both dry-run and live rollout paths, defaulting missing ENABLE_S3 to false and failing closed when ENABLE_S3=true without S3_CREDENTIALS_FILE.

Validation: actionlint .github/workflows/rolling-update.yml; bash -n scripts/rolling-update.sh; git diff --check; commit hook golangci-lint --config=.golangci.yaml run --fix.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (6)
docs/deploy_via_tailscale_runbook.md (4)

12-18: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Add language specifier to fenced code block (MD040).

The shell script block starting at line 12 is missing a language identifier. Add bash or shell for syntax highlighting and consistency:

-```
+```bash
 # on each kv0X node
 sudo tailscale up \
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/deploy_via_tailscale_runbook.md` around lines 12 - 18, The fenced code
block containing the tailscale command sequence is missing a language
identifier, which is required for markdown linting (MD040). Add the language
specifier `bash` immediately after the opening triple backticks on the line that
currently reads just ``` before the comment about the kv0X node, so it becomes
```bash to enable proper syntax highlighting and pass the linting check.

29-32: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Add language specifier to fenced code block (MD040).

The shell command block starting at line 29 is missing a language identifier. Add bash or shell:

-```
+```bash
 tailscale status | grep kv0X
 ping kv0X.<tailnet>.ts.net
-```
+```
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/deploy_via_tailscale_runbook.md` around lines 29 - 32, The fenced code
block containing the tailscale commands (tailscale status and ping) is missing a
language specifier, which violates markdown linting rules. Add the bash language
identifier by changing the opening triple backticks from ``` to ```bash on the
line before the tailscale status command.

176-180: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low value

Emphasize the node-level GHCR credential requirement for private images.

Line 176–180 correctly documents that each node must be logged into ghcr.io with a deploy-scoped read token. This is critical and easy to overlook. Consider making it more prominent in the "Rollback" section or in a dedicated "Prerequisites" subsection, as it is a silent failure mode: the workflow's manifest check succeeds (runner is logged in) but the remote docker pull on the node fails (node is not logged in).

[optional]

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/deploy_via_tailscale_runbook.md` around lines 176 - 180, The
documentation about node-level GHCR credential requirements for private images
is currently in the main text flow and could be easily missed by readers,
especially since it represents a silent failure mode where the workflow's
manifest check succeeds but the actual docker pull on the node fails. Move or
reorganize the content discussing the requirement that each node must be logged
into ghcr.io with a deploy-scoped read token to make it more prominent. Consider
adding a dedicated "Prerequisites" subsection before the deployment steps, or
reorganizing the "Rollback" section to highlight this requirement with emphasis
formatting to ensure operators understand this critical configuration step and
the silent failure behavior it prevents.

115-122: 🔒 Security & Privacy | 🔵 Trivial | 💤 Low value

Clarify DEPLOY_KNOWN_HOSTS hashing and update guidance.

Line 122 correctly notes:

Use -H to hash hostnames so the secret's contents don't leak the tailnet topology if the runner environment is compromised.

This is good security guidance. However, consider also noting that operators should regenerate this secret whenever the node list changes (e.g., adding a new node, decommissioning a node) or when known-hosts entries are out of sync with current Tailscale topology. A brief note in the runbook like "Regenerate when the node list changes or if SSH Host key verification failed" (which is already in troubleshooting §8) would help operators maintain this over time.

[optional]

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/deploy_via_tailscale_runbook.md` around lines 115 - 122, The
DEPLOY_KNOWN_HOSTS environment secret documentation in the table correctly
advises using the `-H` flag for security, but lacks maintenance guidance for
operators. Add clarification to the DEPLOY_KNOWN_HOSTS value description noting
that this secret should be regenerated whenever the node list changes (such as
adding or removing nodes from the Tailscale topology) or if SSH Host key
verification failed errors are encountered during deployment. This guidance
helps operators maintain consistency between the known_hosts entries and the
actual current Tailscale infrastructure.
docs/design/2026_04_24_proposed_deploy_via_tailscale.md (1)

128-142: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Clarify the scope of the dry-run and its limitations.

Section 2.5 describes the dry-run flow well, but consider adding a note that the script's dry-run mode validates node maps, rollout order, and service maps without executing any container operations. The design already states "The actual docker stop/rm/run loop does NOT execute," which is clear, but operators may benefit from knowing that the script also skips pulling the image and other side effects.

[optional]

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/2026_04_24_proposed_deploy_via_tailscale.md` around lines 128 -
142, The dry-run semantics section in 2.5 clearly states that the docker
stop/rm/run loop does not execute, but it lacks explicit clarification about
other skipped operations. Add a note after the existing docker operation
statement that explicitly lists other side effects that are skipped during
dry-run mode (such as image pulling and other non-validation operations), so
operators understand the full scope of what is and is not executed when running
with dry_run: true.
.github/workflows/docker-image.yml (1)

43-45: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Optional: normalize the image path to lowercase.

ghcr.io rejects image references containing uppercase characters. github.repository echoes the owner/repo as registered, so if the owner or repo name ever contains uppercase letters the docker push fails. This is safe for the current bootjp/elastickv slug, but lowercasing makes it robust to org/repo renames. The github.REPOSITORY casing is also non-idiomatic (expression context keys are case-insensitive, so it works, but github.repository is the convention).

Consider deriving a lowercased image name in an earlier step (e.g. via ${IMAGE,,} in a run step writing to $GITHUB_OUTPUT) and referencing it here. SHA tagging itself looks correct.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/docker-image.yml around lines 43 - 45, The tags section in
the docker-image.yml workflow uses the raw github.REPOSITORY variable which may
contain uppercase characters that ghcr.io rejects. Add a preceding step that
normalizes the repository name to lowercase using bash parameter expansion
(${IMAGE,,} or similar) and writes it to GITHUB_OUTPUT, then update both tag
references in the tags section to use this new lowercased output variable
instead of github.REPOSITORY while keeping the github.sha reference unchanged.
Also use the conventional casing of github.repository instead of
github.REPOSITORY for consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/rolling-update.yml:
- Around line 81-97: The docker login command in the run script directly
interpolates `${{ github.actor }}` which creates a template injection
vulnerability. Move the github.actor value to an environment variable by adding
`GITHUB_ACTOR: ${{ github.actor }}` to the env section at the top of the step,
then in the docker login command replace the template expansion `${{
github.actor }}` with the quoted shell variable `"$GITHUB_ACTOR"` to safely
reference it.
- Around line 78-79: The "Install jq" step at lines 78-79 appears to be unused
since the file only uses GitHub CLI's built-in jq processor (gh api --jq) and
contains no other jq invocations. Either remove this step entirely as
unnecessary, or if it is genuinely needed for future use cases, add an apt-get
update command before the apt-get install command to prevent intermittent
failures from stale package indexes that would abort the deployment.
- Around line 72-73: The actions checkout and tailscale/github-action are
referenced by version tags rather than immutable commit SHAs, creating a
security vulnerability in this privileged deployment workflow that accesses
production secrets. Replace the version tag references with their corresponding
commit SHAs: change actions/checkout@v6 to use commit
df4cb1c069e1874edd31b4311f1884172cec0e10 and tailscale/github-action@v3 to use
commit 6cae46e2d796f265265cfcf628b72a32b4d7cade. This pins the actions to
immutable commits and prevents tag-moving attacks that could inject malicious
code into the deployment process.

In `@docs/deploy_via_tailscale_runbook.md`:
- Line 25: The runbook contains inconsistent spelling of the word "authorize"
using both UK English "authorised" (line 25) and US English "authorized" (line
121). Replace all instances of "authorised" with "authorized" throughout the
entire runbook to maintain consistent US English spelling convention that
matches the codebase. Specifically, update the phrase on line 25 that currently
reads "authorised the system sshd" to use "authorized" instead, and verify that
the remaining occurrences throughout the document (such as the reference on line
121 and any others) all use the "authorized" spelling consistently.

In `@docs/design/2026_04_24_proposed_deploy_via_tailscale.md`:
- Around line 128-142: In section 2.5 Dry-run semantics, the reachability check
description is outdated. Replace the bullet point that states "tailscale ping is
run against each SSH target to confirm reachability" with an accurate
description of the SSH-based reachability check, clarifying that it uses SSH
connectivity validation (such as ssh -o BatchMode=yes -o ConnectTimeout=10
<target> true) rather than a network-layer ping, and explain that this approach
catches SSH authentication and sshd availability failures that ICMP ping cannot
detect.

---

Nitpick comments:
In @.github/workflows/docker-image.yml:
- Around line 43-45: The tags section in the docker-image.yml workflow uses the
raw github.REPOSITORY variable which may contain uppercase characters that
ghcr.io rejects. Add a preceding step that normalizes the repository name to
lowercase using bash parameter expansion (${IMAGE,,} or similar) and writes it
to GITHUB_OUTPUT, then update both tag references in the tags section to use
this new lowercased output variable instead of github.REPOSITORY while keeping
the github.sha reference unchanged. Also use the conventional casing of
github.repository instead of github.REPOSITORY for consistency.

In `@docs/deploy_via_tailscale_runbook.md`:
- Around line 12-18: The fenced code block containing the tailscale command
sequence is missing a language identifier, which is required for markdown
linting (MD040). Add the language specifier `bash` immediately after the opening
triple backticks on the line that currently reads just ``` before the comment
about the kv0X node, so it becomes ```bash to enable proper syntax highlighting
and pass the linting check.
- Around line 29-32: The fenced code block containing the tailscale commands
(tailscale status and ping) is missing a language specifier, which violates
markdown linting rules. Add the bash language identifier by changing the opening
triple backticks from ``` to ```bash on the line before the tailscale status
command.
- Around line 176-180: The documentation about node-level GHCR credential
requirements for private images is currently in the main text flow and could be
easily missed by readers, especially since it represents a silent failure mode
where the workflow's manifest check succeeds but the actual docker pull on the
node fails. Move or reorganize the content discussing the requirement that each
node must be logged into ghcr.io with a deploy-scoped read token to make it more
prominent. Consider adding a dedicated "Prerequisites" subsection before the
deployment steps, or reorganizing the "Rollback" section to highlight this
requirement with emphasis formatting to ensure operators understand this
critical configuration step and the silent failure behavior it prevents.
- Around line 115-122: The DEPLOY_KNOWN_HOSTS environment secret documentation
in the table correctly advises using the `-H` flag for security, but lacks
maintenance guidance for operators. Add clarification to the DEPLOY_KNOWN_HOSTS
value description noting that this secret should be regenerated whenever the
node list changes (such as adding or removing nodes from the Tailscale topology)
or if SSH Host key verification failed errors are encountered during deployment.
This guidance helps operators maintain consistency between the known_hosts
entries and the actual current Tailscale infrastructure.

In `@docs/design/2026_04_24_proposed_deploy_via_tailscale.md`:
- Around line 128-142: The dry-run semantics section in 2.5 clearly states that
the docker stop/rm/run loop does not execute, but it lacks explicit
clarification about other skipped operations. Add a note after the existing
docker operation statement that explicitly lists other side effects that are
skipped during dry-run mode (such as image pulling and other non-validation
operations), so operators understand the full scope of what is and is not
executed when running with dry_run: true.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2c29d949-a9e4-4ab0-8c53-1f5ca4952ae4

📥 Commits

Reviewing files that changed from the base of the PR and between 8e3bb37 and e96ac64.

📒 Files selected for processing (5)
  • .github/workflows/docker-image.yml
  • .github/workflows/rolling-update.yml
  • docs/deploy_via_tailscale_runbook.md
  • docs/design/2026_04_24_proposed_deploy_via_tailscale.md
  • scripts/rolling-update.sh

Comment thread .github/workflows/rolling-update.yml Outdated
Comment thread .github/workflows/rolling-update.yml Outdated
Comment thread .github/workflows/rolling-update.yml
Comment thread docs/deploy_via_tailscale_runbook.md
Comment thread docs/design/2026_04_24_proposed_deploy_via_tailscale.md

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e96ac64ca7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +340 to +344
env:
NODES: ${{ steps.render.outputs.NODES }}
SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }}
ROLLING_ORDER: ${{ steps.render.outputs.ROLLING_ORDER }}
SSH_USER: ${{ vars.SSH_USER }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Forward non-default rollout settings

In deployments that rely on any non-default rolling-update.sh settings (for example RAFT_PORT, REDIS_PORT, DYNAMO_PORT, DATA_DIR, RAFT_ENGINE, S3_REGION, or EXTRA_ENV), this env block drops those overrides and the script falls back to its built-in defaults when recreating the container. That can make a node advertise the wrong ports or use a different data directory/engine than the current manual rollout configuration, so the workflow should either forward the existing script contract from environment vars or fail closed when unsupported overrides are required.

Useful? React with 👍 / 👎.

Comment thread scripts/rolling-update.sh
Comment on lines +923 to +925
if [[ "$DRY_RUN" == "true" ]]; then
print_dry_run_plan
exit 0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate extra env before dry-run exits

When operators use the new --dry-run with DEFAULT_EXTRA_ENV or EXTRA_ENV set, this early exit skips the normalization/validation that the live path performs later. A malformed default can pass dry-run and fail only when the real rollout starts, and malformed user EXTRA_ENV can still reach the remote run_container path after the old container is removed, so the preview can give false confidence for exactly the deploy configuration it is meant to validate.

Useful? React with 👍 / 👎.

Comment on lines +351 to +353
if [[ "$ENABLE_S3" == "true" && -z "$S3_CREDENTIALS_FILE" ]]; then
echo "::error::ENABLE_S3=true requires S3_CREDENTIALS_FILE in the production environment"
exit 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check remote S3 credential files before rollout

When ENABLE_S3=true but the configured S3_CREDENTIALS_FILE is absent or unreadable on one target, this only checks that the path string is non-empty. The script's actual readability check happens inside run_container after stop_container, so a live rollout can remove a healthy container and then abort before starting its replacement; use the SSH preflight to run test -r on each rollout target (or move the script check before stopping).

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Addressed the latest CodeRabbit review on the current head:

  • pinned the privileged deployment workflow actions to immutable SHAs (actions/checkout and tailscale/github-action)
  • removed the unused jq install step
  • moved github.actor into an env var before docker login
  • added retry/backoff to the SSH reachability pre-check
  • updated the design/runbook docs for SSH-based reachability, node GHCR credentials, DynamoDB node-to-node ACLs, known_hosts maintenance, and markdown fence/spelling nits
  • normalized docker image tags through a lowercased image output

Validation: actionlint .github/workflows/rolling-update.yml .github/workflows/docker-image.yml, bash -n scripts/rolling-update.sh, and git diff --check all pass.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8fcaac654b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +359 to +361
if [[ "$ENABLE_S3" == "true" && -z "$S3_CREDENTIALS_FILE" ]]; then
echo "::error::ENABLE_S3=true requires S3_CREDENTIALS_FILE in the production environment"
exit 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preflight remote S3 credential files

When ENABLE_S3=true and S3_CREDENTIALS_FILE is set but the file is missing or unreadable on a node, this check passes and the workflow proceeds to scripts/rolling-update.sh. That script validates the file inside run_container only after it has already removed the existing container, so a stale path in GitHub vars can leave the first targeted node down before the rollout aborts. Since this workflow already renders ROLLING_SSH_TARGETS, add a preflight SSH test -r "$S3_CREDENTIALS_FILE" for each target before invoking the script.

Useful? React with 👍 / 👎.

Comment thread .github/workflows/rolling-update.yml Outdated
echo "::error::configure the production environment to allow deployments only from the default branch"
exit 1
fi
echo "checkout_ref=$default_branch" >> "$GITHUB_OUTPUT"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pin checkout to the dispatch SHA

When a production run waits for environment approval, the default branch can advance before these steps execute; because this output is the branch name, actions/checkout later fetches whatever is at main at that moment rather than the workflow run's github.sha. That can make an approved/audited rollout execute a different scripts/rolling-update.sh than the one associated with the dispatched workflow, so resolve and checkout the immutable dispatch SHA instead of the moving branch ref.

Useful? React with 👍 / 👎.

Comment thread docs/deploy_via_tailscale_runbook.md Outdated
Comment on lines +200 to +202
environment — the workflow exports `IMAGE`, `DATA_DIR`,
`RAFT_PORT`, `REDIS_PORT`, `S3_PORT`, `ENABLE_S3`, `NODES`,
`SSH_TARGETS`, and the merged `EXTRA_ENV` before invoking the

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Point recovery at values the workflow actually emits

When a rollout is canceled mid-node, this recovery step tells operators to reconstruct the docker run from the Roll cluster step's rendered environment, but that workflow step does not export or log DATA_DIR, RAFT_PORT, REDIS_PORT, S3_PORT, or the merged EXTRA_ENV it names here. In the cancellation scenario described here, especially if production uses non-default ports or directories, following the runbook can leave the operator without the values needed to restart the removed container; either emit those values in the workflow or point the instructions to the actual defaults/config source.

Useful? React with 👍 / 👎.

Comment thread docs/deploy_via_tailscale_runbook.md Outdated
Comment on lines +78 to +80
for current scope requirements. `devices:core` is NOT a valid
Tailscale OAuth scope — earlier drafts of this runbook named it and
would have produced an auth failure.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Correct the Tailscale OAuth scope guidance

The Tailscale OAuth docs list devices:core as a valid Devices Core scope, so this warning can send operators away from a real permission while suggesting devices:write, which may not match the current UI. If the join step needs device permissions beyond auth_keys, following these lines can leave tailscale/github-action failing with 403 during first setup; remove the "not valid" warning or name the exact scope required by the pinned action docs.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Updated in fdb65de:

  • Checkout now pins the trusted workflow/script checkout to github.sha instead of a moving default-branch ref after approval.
  • Added remote S3 credential preflight over SSH for every rollout target before invoking rolling-update.sh.
  • Added a rollout configuration log step and updated cancellation recovery docs to point at values the workflow actually emits.
  • Reworded Tailscale OAuth scope guidance to defer device-scope names to the pinned action docs/Tailscale UI.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. What shall we delve into next?

Reviewed commit: fdb65de92f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@bootjp

bootjp commented Jun 25, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bee82be34e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

S3_REGION: ${{ vars.S3_REGION || 'us-east-1' }}
S3_CREDENTIALS_FILE: ${{ vars.S3_CREDENTIALS_FILE }}
S3_PATH_STYLE_ONLY: ${{ vars.S3_PATH_STYLE_ONLY || 'true' }}
DEFAULT_EXTRA_ENV: ${{ vars.DEFAULT_EXTRA_ENV || 'GOMEMLIMIT=1800MiB' }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Forward opt-in service settings

Fresh evidence in the current diff is that the job-level environment stops after the S3 settings and never maps vars.ENABLE_SQS/SQS_* (nor the other script-supported opt-in service knobs) into scripts/rolling-update.sh. In a production environment that already runs the SQS adapter via the manual rollout env, this Actions path will restart each node with the script default ENABLE_SQS=false, so the SQS listener and its leader-proxy map disappear during an otherwise routine roll; add those vars here or fail closed when they are configured but unsupported.

Useful? React with 👍 / 👎.

Comment on lines +53 to +56
"tag:elastickv-node:50051", // Raft / raftadmin
"tag:elastickv-node:6379", // Redis adapter, if enabled
"tag:elastickv-node:8000", // DynamoDB adapter, if enabled
"tag:elastickv-node:9000", // S3 adapter, if enabled

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Allow the SQS peer port in the Tailnet ACL

When the SQS adapter is enabled, rolling-update.sh starts it on SQS_PORT (default 9324) and derives RAFT_TO_SQS_MAP for follower-to-leader proxying, but this node-to-node ACL example only permits Raft, Redis, DynamoDB, and S3 ports. Operators copying this runbook for an SQS deployment will block node-to-node SQS proxy traffic even though the restarted containers advertise those peer endpoints, so add the SQS port here or call out that SQS needs an extra ACL entry.

Useful? React with 👍 / 👎.

echo "::error::ENABLE_S3=true requires S3_CREDENTIALS_FILE in the production environment"
exit 1
fi
./scripts/rolling-update.sh

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Set up Go before invoking the live rollout

When dry_run=false, this step invokes rolling-update.sh; I checked the script's live path and it builds ./cmd/raftadmin with go build before rolling nodes, while the dry-run path exits before that build. This workflow never runs actions/setup-go for the go.mod toolchain (go1.26.4), unlike the repo's test workflows, so an approved live deploy can fail on a fresh ubuntu-latest runner because the required Go toolchain is missing or must be fetched uncached even though the dry-run passed; install Go before this step or provide a prebuilt RAFTADMIN_BIN.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant