From 5351546e46edfca07f1cfcc8aba3f57bd3f404c0 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Fri, 24 Apr 2026 19:57:37 +0900 Subject: [PATCH 01/12] ops(deploy): rolling-update via GitHub Actions over Tailscale MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Today's rolling-update flow is manual: operators SSH from a workstation with required env vars and invoke scripts/rolling-update.sh. That has no audit trail, no approval gate, no dry-run, and relies on per-operator secret handling. This change adds a workflow_dispatch workflow that joins the Tailnet via tailscale/github-action (ephemeral OAuth node, tag:ci-deploy), SSHes into each cluster node over MagicDNS, and invokes the existing scripts/rolling-update.sh — unchanged. Nodes stay as they are; the script's env-var contract is the integration boundary. Highlights: - workflow_dispatch only (no auto-deploy); production GitHub environment gates non-dry-run runs on required reviewers. - inputs: ref (required), image_tag (rollback override), nodes (subset filter), dry_run (default true). - Dry-run does everything up to the container touch: renders NODES + SSH_TARGETS from env variables, verifies the image exists on ghcr.io, and tailscale-pings every target. Catches typo'd inputs before any production effect. - Concurrency group "rolling-update", cancel-in-progress: false, so parallel invocations queue rather than stomp. - No node-side changes required: nodes are assumed to already run tailscaled and expose authorized_keys for the deploy user. Plain SSH over Tailscale; Tailscale SSH (keyless) is called out as a follow-up. Design doc: docs/design/2026_04_24_proposed_deploy_via_tailscale.md Operator runbook: docs/deploy_via_tailscale_runbook.md Validation: - actionlint passes (0 errors) - YAML parses - No changes to scripts/rolling-update.sh; the workflow calls it with the same env var contract already documented in the script's usage(). Out of scope (follow-ups): post-deploy health verification, auto- rollback on script failure, Jepsen-gating, image-signature verification, Tailscale SSH (keyless), shared deploy user. --- .github/workflows/rolling-update.yml | 199 ++++++++++++++++++ docs/deploy_via_tailscale_runbook.md | 150 +++++++++++++ ...026_04_24_proposed_deploy_via_tailscale.md | 191 +++++++++++++++++ 3 files changed, 540 insertions(+) create mode 100644 .github/workflows/rolling-update.yml create mode 100644 docs/deploy_via_tailscale_runbook.md create mode 100644 docs/design/2026_04_24_proposed_deploy_via_tailscale.md diff --git a/.github/workflows/rolling-update.yml b/.github/workflows/rolling-update.yml new file mode 100644 index 000000000..25a1c992f --- /dev/null +++ b/.github/workflows/rolling-update.yml @@ -0,0 +1,199 @@ +name: Rolling update + +# Manually-triggered production rollout. Joins the Tailnet, SSHes over +# MagicDNS into each node, and invokes scripts/rolling-update.sh. +# See docs/design/2026_04_24_proposed_deploy_via_tailscale.md. + +on: + workflow_dispatch: + inputs: + ref: + description: Git ref (tag or sha) to deploy. Also used as the image tag unless image_tag is set. + required: true + type: string + image_tag: + description: Override the image tag (default = ref). Used for rollbacks. + required: false + type: string + default: "" + nodes: + description: Comma-separated raft IDs to roll (e.g. "n1,n2"). Empty = all nodes in NODES_RAFT_MAP. + required: false + type: string + default: "" + dry_run: + description: Render the plan and run a reachability check only; do NOT touch containers. + required: true + type: boolean + default: true + +permissions: + contents: read + id-token: write # required by tailscale/github-action OIDC flow + +concurrency: + group: rolling-update + cancel-in-progress: false + +jobs: + deploy: + runs-on: ubuntu-latest + # Approval gate — see GitHub environment settings for required reviewers. + # Dry-runs also use this environment so the secret wiring is identical; + # the environment's approval rule should be configured to auto-approve + # dry-runs if that distinction is desired (GitHub UI: "Deployment + # protection rules"). + environment: production + timeout-minutes: 60 + + steps: + - name: Checkout + uses: actions/checkout@v6 + with: + ref: ${{ inputs.ref }} + + - name: Install jq + run: sudo apt-get install -y --no-install-recommends jq + + - name: Verify image exists on ghcr.io + env: + IMAGE_BASE: ${{ vars.IMAGE_BASE }} + IMAGE_TAG: ${{ inputs.image_tag || inputs.ref }} + GHCR_TOKEN: ${{ secrets.GITHUB_TOKEN }} + run: | + set -euo pipefail + if [[ -z "$IMAGE_BASE" ]]; then + echo "::error::IMAGE_BASE repository variable is not set" + exit 1 + fi + echo "Checking $IMAGE_BASE:$IMAGE_TAG" + echo "$GHCR_TOKEN" | docker login ghcr.io -u "${{ github.actor }}" --password-stdin >/dev/null + if ! docker manifest inspect "$IMAGE_BASE:$IMAGE_TAG" >/dev/null; then + echo "::error::image $IMAGE_BASE:$IMAGE_TAG not found on ghcr.io" + exit 1 + fi + + - name: Join Tailnet (ephemeral) + uses: tailscale/github-action@v3 + with: + oauth-client-id: ${{ secrets.TS_OAUTH_CLIENT_ID }} + oauth-secret: ${{ secrets.TS_OAUTH_SECRET }} + tags: tag:ci-deploy + + - name: Configure SSH + env: + SSH_KEY: ${{ secrets.DEPLOY_SSH_PRIVATE_KEY }} + KNOWN_HOSTS: ${{ secrets.DEPLOY_KNOWN_HOSTS }} + run: | + set -euo pipefail + mkdir -p ~/.ssh + chmod 700 ~/.ssh + printf '%s\n' "$SSH_KEY" > ~/.ssh/id_ed25519 + chmod 600 ~/.ssh/id_ed25519 + printf '%s\n' "$KNOWN_HOSTS" > ~/.ssh/known_hosts + chmod 644 ~/.ssh/known_hosts + # Sanity: no stray CRLF in the key, no empty file. + test -s ~/.ssh/id_ed25519 || { echo "::error::DEPLOY_SSH_PRIVATE_KEY is empty"; exit 1; } + ssh-keygen -lf ~/.ssh/id_ed25519 >/dev/null + + - name: Render NODES and SSH_TARGETS + id: render + env: + NODES_RAFT_MAP: ${{ vars.NODES_RAFT_MAP }} + SSH_TARGETS_MAP: ${{ vars.SSH_TARGETS_MAP }} + NODES_FILTER: ${{ inputs.nodes }} + run: | + set -euo pipefail + if [[ -z "$NODES_RAFT_MAP" || -z "$SSH_TARGETS_MAP" ]]; then + echo "::error::NODES_RAFT_MAP or SSH_TARGETS_MAP is not set in the production environment variables" + exit 1 + fi + if [[ -n "$NODES_FILTER" ]]; then + # Filter NODES_RAFT_MAP and SSH_TARGETS_MAP to the requested subset. + filter_csv() { + local all="$1" + local filter="$2" + local out="" + IFS=',' read -r -a entries <<< "$all" + IFS=',' read -r -a wanted <<< "$filter" + for e in "${entries[@]}"; do + key="${e%%=*}" + for w in "${wanted[@]}"; do + if [[ "$key" == "$w" ]]; then + out+="${e}," + break + fi + done + done + echo "${out%,}" + } + NODES_RAFT_MAP="$(filter_csv "$NODES_RAFT_MAP" "$NODES_FILTER")" + SSH_TARGETS_MAP="$(filter_csv "$SSH_TARGETS_MAP" "$NODES_FILTER")" + if [[ -z "$NODES_RAFT_MAP" ]]; then + echo "::error::nodes filter '$NODES_FILTER' matches nothing in NODES_RAFT_MAP" + exit 1 + fi + fi + { + echo "NODES=$NODES_RAFT_MAP" + echo "SSH_TARGETS=$SSH_TARGETS_MAP" + } >> "$GITHUB_OUTPUT" + echo "::group::Deploy plan" + echo "NODES=$NODES_RAFT_MAP" + echo "SSH_TARGETS=$SSH_TARGETS_MAP" + echo "::endgroup::" + + - name: Tailscale reachability check + env: + SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }} + run: | + set -euo pipefail + IFS=',' read -r -a entries <<< "$SSH_TARGETS" + failed=0 + for e in "${entries[@]}"; do + host="${e##*=}" + host="${host%%:*}" + # strip user@ if present + host="${host##*@}" + if tailscale ping --c 2 --timeout 3s "$host" >/dev/null 2>&1; then + echo " ok $host" + else + echo "::error::$host not reachable over tailnet" + failed=1 + fi + done + if [[ "$failed" -ne 0 ]]; then + exit 1 + fi + + - name: Dry-run summary + if: ${{ inputs.dry_run }} + env: + NODES: ${{ steps.render.outputs.NODES }} + SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }} + IMAGE_BASE: ${{ vars.IMAGE_BASE }} + IMAGE_TAG: ${{ inputs.image_tag || inputs.ref }} + SSH_USER: ${{ vars.SSH_USER }} + run: | + set -euo pipefail + cat <.ts.net +``` + +## 2. Tailscale ACL + +In the Tailscale admin console, add the deploy rule to the tailnet ACL: + +```jsonc +"tagOwners": { + "tag:ci-deploy": ["autogroup:admin"], + "tag:elastickv-node": ["autogroup:admin"], +}, +"acls": [ + { + "action": "accept", + "src": ["tag:ci-deploy"], + "dst": ["tag:elastickv-node:22"], + }, +], +``` + +`tag:ci-deploy` must NOT have access to any other port on the tailnet. The +deploy workflow only needs SSH. + +## 3. Tailscale OAuth client + +Admin console → Settings → OAuth clients → New client: + +- Description: `elastickv GitHub Actions deploy` +- Scopes: `auth_keys` (write) +- Tags: `tag:ci-deploy` + +Copy the client ID and secret; they go into GitHub in the next step. + +## 4. GitHub environment: `production` + +Repo → Settings → Environments → New environment: `production`. + +### Required reviewers +Configure "Required reviewers" on the environment. Non-dry-run deploys will +pause until one of the reviewers approves. Configure "Deployment protection +rules" to auto-approve if the workflow input `dry_run == true` (optional; cuts +friction for previews). + +### Environment secrets + +| Name | Value | +|------|-------| +| `TS_OAUTH_CLIENT_ID` | Tailscale OAuth client ID from step 3 | +| `TS_OAUTH_SECRET` | Tailscale OAuth secret from step 3 | +| `DEPLOY_SSH_PRIVATE_KEY` | OpenSSH private key, authorized on every node under the deploy user | +| `DEPLOY_KNOWN_HOSTS` | `ssh-keyscan kv01..ts.net kv02..ts.net …` output (one host per line) | + +The SSH key should be ed25519, dedicated to CI (not a reused developer key). +Regenerate on operator rotation. + +### Environment variables + +| Name | Value | Example | +|------|-------|---------| +| `IMAGE_BASE` | Container image path (no tag) | `ghcr.io/bootjp/elastickv` | +| `SSH_USER` | SSH login on every node | `bootjp` | +| `NODES_RAFT_MAP` | Comma-separated `raftId=host:port` | `n1=kv01:50051,n2=kv02:50051,n3=kv03:50051,n4=kv04:50051,n5=kv05:50051` | +| `SSH_TARGETS_MAP` | Comma-separated `raftId=ssh-host` | `n1=kv01..ts.net,n2=kv02..ts.net,...` | + +## 5. Running a deploy + +Actions tab → "Rolling update" → Run workflow. + +Inputs: + +- `ref` — the git tag or sha to deploy (also used as the container image tag) +- `image_tag` — override only for rollbacks (e.g., deploy tag `v1.2.3` of a + commit that was also `v1.2.3`) +- `nodes` — subset of raft IDs, e.g., `n1,n2`. Empty rolls all nodes. +- `dry_run` — default `true`. Renders the plan and checks reachability without + touching containers. + +Recommended first-run sequence: + +1. `dry_run: true`, `nodes: n1`, `ref: ` — confirms tailnet join, + SSH config, image availability, target mapping. No production impact. +2. `dry_run: false`, `nodes: n1` — roll a single node, verify the cluster + stays healthy and the image is correct. +3. `dry_run: false`, `nodes:` (empty) — full roll. + +## 6. Rollback + +Re-run the workflow with `image_tag` set to the previous-known-good sha. The +`nodes` input can target specific nodes if only some carry the bad image. + +## 7. What the workflow does NOT do (yet) + +- **No post-deploy health verification beyond tailnet reachability.** The + script itself blocks on `raftadmin` leadership transfer and health-gate + timeouts, but the workflow does not independently probe Prometheus or + Redis after the roll. Add this when we have a canonical post-deploy + assertion suite. +- **No auto-rollback on failure.** If the script exits non-zero mid-roll, + the cluster is left in whatever state the script reached. The operator + must inspect and either re-roll or roll back manually. +- **No Jepsen gate.** The deploy does not require a green Jepsen run on + `ref` before proceeding. +- **No image-signature check.** `cosign verify` on the image is a follow-up. + +## 8. Troubleshooting + +### Job pauses indefinitely at "Waiting for approval" +Expected for non-dry-run deploys — a reviewer from the `production` environment +must click Approve. Check the "Required reviewers" list in the environment +settings. + +### `tailscale ping` fails for a node +The node may not be running `tailscaled`, not tagged `tag:elastickv-node`, or +the tailnet ACL may have drifted. `tailscale status` on the node should show +the tag; the admin console should show the IP in the `tag:elastickv-node` +group. + +### `image ... not found on ghcr.io` +The verification step hit the ghcr manifest API and got a 404. Either the +image tag was not pushed (check the `Docker Image CI` workflow for `ref`) or +the tag is a moving tag (`latest`) that the verification step can't +distinguish from stale. Specify an immutable tag. + +### SSH `Host key verification failed` +`DEPLOY_KNOWN_HOSTS` is stale. Re-run `ssh-keyscan` against every node and +update the secret. diff --git a/docs/design/2026_04_24_proposed_deploy_via_tailscale.md b/docs/design/2026_04_24_proposed_deploy_via_tailscale.md new file mode 100644 index 000000000..17610ab60 --- /dev/null +++ b/docs/design/2026_04_24_proposed_deploy_via_tailscale.md @@ -0,0 +1,191 @@ +# Deploy via Tailscale + GitHub Actions + +**Status:** Proposed +**Author:** bootjp +**Date:** 2026-04-24 + +--- + +## 1. Background + +Today the rolling-update flow is manual: an operator SSHes to their workstation, +exports the required env vars (`NODES`, `SSH_TARGETS`, image tag, etc.), +invokes `scripts/rolling-update.sh`, and watches it roll the cluster. + +Problems: + +- **No audit trail.** Who rolled what, when, and from which commit is only + visible in each operator's local shell history. +- **Manual secret handling.** SSH keys, Tailscale auth, and S3 creds live on + operator workstations. Joining and leaving the ops rotation requires key + shuffling. +- **No approval gate.** The production cluster is rolled by whoever types the + command. A typo can take out the cluster before anyone else sees it. +- **No dry-run.** The script supports neither `--dry-run` nor a preview mode; + operators who want to verify targeting have to read the script. + +The 2026-04-24 incident compounds the risk: the cluster is fragile enough that +a rolling update executed against the wrong `NODES` list could cascade into an +election storm. + +## 2. Proposal + +Move rolling deploys to a GitHub Actions workflow that joins the Tailnet via +`tailscale/github-action`, SSHes into each node over Tailscale MagicDNS, and +invokes the existing `scripts/rolling-update.sh`. All secrets live in GitHub +environments; every deploy becomes a PR-linked, reviewable event. + +**Precondition (operator responsibility):** Tailscale is already installed and +logged in on every node, with SSH access enabled over the tailnet. + +### 2.1 Workflow shape + +``` +name: Rolling update +on: workflow_dispatch: + inputs: + ref: # git sha/tag of the image to deploy + image_tag: # defaults to $ref; override only for rollbacks + nodes: # subset of raft IDs; empty = full roll + dry_run: # bool, default TRUE — renders plan but doesn't roll + +jobs: + deploy: + environment: production # requires approval + concurrency: + group: rolling-update + cancel-in-progress: false + runs-on: ubuntu-latest + steps: + - checkout + - join tailnet (tailscale/github-action, ephemeral) + - configure SSH (add DEPLOY_SSH_PRIVATE_KEY to agent) + - render NODES + SSH_TARGETS from repo config + - if dry_run: print the derived env and exit + - else: ./scripts/rolling-update.sh +``` + +### 2.2 Secrets and variables + +Stored in a GitHub `production` environment (not repo-wide): + +**Secrets:** +- `TS_OAUTH_CLIENT_ID`, `TS_OAUTH_SECRET` — Tailscale OAuth client scoped to + "devices:write" on a single tag (e.g., `tag:ci-deploy`). Ephemeral nodes; + cleaned up automatically after the job. +- `DEPLOY_SSH_PRIVATE_KEY` — SSH key authorized on every node. Restricted to + the `deploy` user (if we split it out) or `bootjp` (initial). +- `DEPLOY_KNOWN_HOSTS` — pre-populated `known_hosts` with the Tailnet MagicDNS + entries. Prevents the first-connect TOFU prompt. + +**Variables (non-secret):** +- `NODES_RAFT_MAP` — `n1=kv01:50051,n2=kv02:50051,...` (advertised addresses + as seen from inside the tailnet). +- `SSH_TARGETS_MAP` — `n1=kv01.tailnet.ts.net,...` (MagicDNS). +- `IMAGE_BASE` — `ghcr.io/bootjp/elastickv` (tag is appended from the input). +- `SSH_USER` — e.g., `bootjp`. + +### 2.3 Tailscale authentication + +Use OAuth ephemeral nodes (not a long-lived auth key): + +- Create an OAuth client in Tailscale admin console with scope + `devices:write` on tag `tag:ci-deploy`. +- Store client ID + secret in GitHub env secrets. +- `tailscale/github-action@v3` joins the tailnet for the duration of the job + as an ephemeral tagged node; disconnects automatically on job exit. + +ACLs on the Tailnet side should limit `tag:ci-deploy` to SSH (tcp/22) on +`tag:elastickv-node` only. No other ports, no other tags. + +### 2.4 SSH + +Two options: + +- **A. Tailscale SSH.** Lets CI SSH in without managing an SSH keypair: the + Tailnet ACL is the authorization model. Requires the nodes to have + `--ssh` flag on `tailscaled` (or `tailscale up --ssh`) and the Tailnet ACL + to grant `tag:ci-deploy` SSH access to node tag + user. No SSH keys in + GitHub at all. +- **B. Plain SSH over Tailscale.** CI brings an SSH key; nodes continue to + use `~/.ssh/authorized_keys`. Tailscale is just the network layer. + +**Recommendation for v1: B** (plain SSH). Nodes already have `authorized_keys` +for the current manual flow; nothing to change on the node side. Tailscale +SSH (A) can be a follow-up once the key-rotation story is written up. + +### 2.5 Dry-run semantics + +With `dry_run: true` (the default): + +- Everything up to script invocation runs (checkout, tailnet join, SSH agent + load, `NODES`/`SSH_TARGETS` render). +- The script is invoked with `--help` + the rendered env is printed as a + collapsed log group. +- `tailscale ping` is run against each SSH target to confirm reachability. +- The actual `docker stop/rm/run` loop does NOT execute. + +This catches the common failure modes (bad secret, bad env mapping, a node +unreachable over the tailnet) before touching any live container. + +### 2.6 Production environment approval + +Mark the `production` GitHub environment as requiring approval from a list of +reviewers. A non-dry-run deploy will pause until approved; the dry-run run +itself does not need approval (it only needs the tailnet join). + +Alternative: require approval unconditionally and treat the dry-run as a +"preview" that an approver must ack. Simpler policy, slightly more friction. + +**Recommendation:** approval required for non-dry-run only. Dry-runs are +cheap and useful. + +### 2.7 Rollback + +Rolling back uses the same workflow with `image_tag: `. The +script already supports the rollout order env var (`ROLLING_ORDER`) so an +operator can force-roll only the affected nodes. + +**Gap:** there is no "stop mid-rollout" control today. If the workflow is +cancelled via GitHub UI during a roll, the in-flight node may be mid-recreate. +`rolling-update.sh` is supposed to be idempotent and crash-safe, but this +should be verified before we call the workflow production-ready. + +## 3. Open questions + +- **SSH user.** Continue using `bootjp` (personal) or provision a shared + `deploy` user on each node? v1 sticks with `bootjp` to keep scope tight; + follow-up can introduce `deploy` with a limited sudo rule for `docker`. +- **Secret scope.** Environment-scoped secrets (as proposed) vs. + repository-scoped. Environment-scoped wins on blast radius but requires + the GitHub environment to be pre-created. Assume pre-created. +- **Image availability check.** Should the workflow verify the image tag + exists on ghcr.io before starting the roll? Cheap to add (`docker manifest + inspect` in a pre-step) and prevents a half-rolled cluster when the tag is + typo'd. +- **Jepsen gating.** The existing `jepsen-test.yml` workflow exists. + Option: require a green Jepsen run on `ref` within the last N hours before + allowing deploy. Skipped for v1; worth revisiting before rolling this out + to high-traffic periods. + +## 4. Out of scope for v1 + +- Automatic deploys on merge to main (needs more test coverage before we'd + trust it). +- Blue-green or canary strategies (we don't have the traffic-routing layer + for it). +- Metrics-based rollback trigger (watch p99, auto-revert if it jumps). +- Tailscale SSH (option A above). +- A shared `deploy` user with restricted sudo. + +## 5. Implementation plan + +1. Write `.github/workflows/rolling-update.yml` implementing §2.1. +2. Document the secrets/variables setup in + `docs/operations/deploy_runbook.md` (new). +3. Run once with `dry_run: true` on a feature branch to validate secrets + wiring without touching prod. +4. Run once with `dry_run: false` targeting a single node (via the `nodes` + input) to prove the happy path. +5. Cut over: archive the operator-local rolling flow, document the new one + as the canonical path. From 6322748fe47e9cc59d39124bd56249cc9707acb1 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sat, 25 Apr 2026 00:10:01 +0900 Subject: [PATCH 02/12] fix(docs): align deploy-via-tailscale with script's NODES format MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - NODES_RAFT_MAP examples no longer include ":50051". scripts/rolling-update.sh auto-appends RAFT_PORT to each entry, so a port in NODES produces host:port:port when the script constructs --address (Gemini HIGH). - OAuth client scope: design doc §2.3 said "devices:write" while the runbook §3 said "auth_keys (write)". Aligned both on "auth_keys (write)" which is what tailscale/github-action requires to mint the ephemeral join key. Added a note that newer action versions may additionally need "devices:core" (write) and to consult the action's README as the authoritative scope list (addresses the Gemini MEDIUM comment while hedging against action-version drift). --- docs/deploy_via_tailscale_runbook.md | 7 +++++-- .../design/2026_04_24_proposed_deploy_via_tailscale.md | 10 +++++++--- 2 files changed, 12 insertions(+), 5 deletions(-) diff --git a/docs/deploy_via_tailscale_runbook.md b/docs/deploy_via_tailscale_runbook.md index e1ddeb9d2..cdca043e3 100644 --- a/docs/deploy_via_tailscale_runbook.md +++ b/docs/deploy_via_tailscale_runbook.md @@ -50,7 +50,10 @@ deploy workflow only needs SSH. Admin console → Settings → OAuth clients → New client: - Description: `elastickv GitHub Actions deploy` -- Scopes: `auth_keys` (write) +- Scopes: `auth_keys` (write). Recent `tailscale/github-action` versions + may additionally require `devices:core` (write); enable that if the + join step fails with an authorization error. The action's README is + the definitive source for current scope requirements. - Tags: `tag:ci-deploy` Copy the client ID and secret; they go into GitHub in the next step. @@ -83,7 +86,7 @@ Regenerate on operator rotation. |------|-------|---------| | `IMAGE_BASE` | Container image path (no tag) | `ghcr.io/bootjp/elastickv` | | `SSH_USER` | SSH login on every node | `bootjp` | -| `NODES_RAFT_MAP` | Comma-separated `raftId=host:port` | `n1=kv01:50051,n2=kv02:50051,n3=kv03:50051,n4=kv04:50051,n5=kv05:50051` | +| `NODES_RAFT_MAP` | Comma-separated `raftId=host` (no port — the script appends `RAFT_PORT`) | `n1=kv01,n2=kv02,n3=kv03,n4=kv04,n5=kv05` | | `SSH_TARGETS_MAP` | Comma-separated `raftId=ssh-host` | `n1=kv01..ts.net,n2=kv02..ts.net,...` | ## 5. Running a deploy diff --git a/docs/design/2026_04_24_proposed_deploy_via_tailscale.md b/docs/design/2026_04_24_proposed_deploy_via_tailscale.md index 17610ab60..d69216d3f 100644 --- a/docs/design/2026_04_24_proposed_deploy_via_tailscale.md +++ b/docs/design/2026_04_24_proposed_deploy_via_tailscale.md @@ -79,8 +79,9 @@ Stored in a GitHub `production` environment (not repo-wide): entries. Prevents the first-connect TOFU prompt. **Variables (non-secret):** -- `NODES_RAFT_MAP` — `n1=kv01:50051,n2=kv02:50051,...` (advertised addresses - as seen from inside the tailnet). +- `NODES_RAFT_MAP` — `n1=kv01,n2=kv02,...` (advertised hostnames as seen + from inside the tailnet; the script appends `RAFT_PORT` automatically, + so do NOT include a port here). - `SSH_TARGETS_MAP` — `n1=kv01.tailnet.ts.net,...` (MagicDNS). - `IMAGE_BASE` — `ghcr.io/bootjp/elastickv` (tag is appended from the input). - `SSH_USER` — e.g., `bootjp`. @@ -90,7 +91,10 @@ Stored in a GitHub `production` environment (not repo-wide): Use OAuth ephemeral nodes (not a long-lived auth key): - Create an OAuth client in Tailscale admin console with scope - `devices:write` on tag `tag:ci-deploy`. + `auth_keys` (write) on tag `tag:ci-deploy`. (`tailscale/github-action` + uses the OAuth client to mint a short-lived auth key on each run; + recent action versions may also require `devices:core` — consult the + action's README for the current scope list.) - Store client ID + secret in GitHub env secrets. - `tailscale/github-action@v3` joins the tailnet for the duration of the job as an ephemeral tagged node; disconnects automatically on job exit. From ad00bdc0badd0e402be45a27712ec83b761abab2 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sat, 25 Apr 2026 00:34:08 +0900 Subject: [PATCH 03/12] fix(docs,workflow): address round-2 deploy-via-tailscale review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - workflow: add `packages: read` to the job permissions so the `Verify image exists on ghcr.io` step's `docker manifest inspect` call works against private ghcr.io images (Codex P1). - runbook §1: explain that `--ssh=false` disables Tailscale SSH and the workflow relies on the system sshd — operators who use Tailscale SSH elsewhere need to keep that in mind (Gemini Medium). - runbook §4: change `ssh-keyscan` example + troubleshooting to `ssh-keyscan -H` so known_hosts entries are hashed and the secret does not leak tailnet topology in plaintext (Gemini Security Medium). - runbook §4 variables: document that `NODES_RAFT_MAP` / `SSH_TARGETS_MAP` are workflow-side names the render step maps to the script's `NODES` / `SSH_TARGETS`; manual invocation from a workstation must use the script-side names (Gemini Medium). Not addressed: Gemini HIGH claim that the workflow file is missing (line 187) — it IS included at `.github/workflows/rolling-update.yml` in this PR; the reviewer misread the file list. Not addressed: Gemini HIGH re native --dry-run flag + zero-downtime strategy (line 128) — dry-run is deliberately a workflow-level input, not a script-level flag, so the script stays invokable from a workstation without CI-specific options; zero-downtime cutover is outside the scope of a CI wrapper and is tracked in the resilience-roadmap follow-ups. --- .github/workflows/rolling-update.yml | 1 + docs/deploy_via_tailscale_runbook.md | 22 ++++++++++++++++++---- 2 files changed, 19 insertions(+), 4 deletions(-) diff --git a/.github/workflows/rolling-update.yml b/.github/workflows/rolling-update.yml index 25a1c992f..1e691fde5 100644 --- a/.github/workflows/rolling-update.yml +++ b/.github/workflows/rolling-update.yml @@ -30,6 +30,7 @@ on: permissions: contents: read id-token: write # required by tailscale/github-action OIDC flow + packages: read # required by `docker manifest inspect` on ghcr.io private images concurrency: group: rolling-update diff --git a/docs/deploy_via_tailscale_runbook.md b/docs/deploy_via_tailscale_runbook.md index cdca043e3..abdacece5 100644 --- a/docs/deploy_via_tailscale_runbook.md +++ b/docs/deploy_via_tailscale_runbook.md @@ -17,6 +17,13 @@ sudo tailscale up \ --accept-routes=false ``` +`--ssh=false` disables Tailscale SSH, so the node's regular system +sshd must be running and authorised to accept connections on the +tailnet interface. The workflow uses plain SSH over the tailnet +(Tailscale is only the network layer); if you rely on Tailscale SSH +for operator access elsewhere, drop this flag but keep in mind the +workflow still connects to the system sshd. + Verify the node is reachable by MagicDNS from another tailnet peer: ``` @@ -75,7 +82,7 @@ friction for previews). | `TS_OAUTH_CLIENT_ID` | Tailscale OAuth client ID from step 3 | | `TS_OAUTH_SECRET` | Tailscale OAuth secret from step 3 | | `DEPLOY_SSH_PRIVATE_KEY` | OpenSSH private key, authorized on every node under the deploy user | -| `DEPLOY_KNOWN_HOSTS` | `ssh-keyscan kv01..ts.net kv02..ts.net …` output (one host per line) | +| `DEPLOY_KNOWN_HOSTS` | `ssh-keyscan -H kv01..ts.net kv02..ts.net …` output. Use `-H` to hash hostnames so the secret's contents don't leak the tailnet topology if the runner environment is compromised. | The SSH key should be ed25519, dedicated to CI (not a reused developer key). Regenerate on operator rotation. @@ -86,8 +93,15 @@ Regenerate on operator rotation. |------|-------|---------| | `IMAGE_BASE` | Container image path (no tag) | `ghcr.io/bootjp/elastickv` | | `SSH_USER` | SSH login on every node | `bootjp` | -| `NODES_RAFT_MAP` | Comma-separated `raftId=host` (no port — the script appends `RAFT_PORT`) | `n1=kv01,n2=kv02,n3=kv03,n4=kv04,n5=kv05` | -| `SSH_TARGETS_MAP` | Comma-separated `raftId=ssh-host` | `n1=kv01..ts.net,n2=kv02..ts.net,...` | +| `NODES_RAFT_MAP` | Comma-separated `raftId=host` (no port — the script appends `RAFT_PORT`). The workflow renders this into the script's `NODES` env var. | `n1=kv01,n2=kv02,n3=kv03,n4=kv04,n5=kv05` | +| `SSH_TARGETS_MAP` | Comma-separated `raftId=ssh-host`. The workflow renders this into the script's `SSH_TARGETS` env var. | `n1=kv01..ts.net,n2=kv02..ts.net,...` | + +**Why two names?** The workflow uses `NODES_RAFT_MAP` / `SSH_TARGETS_MAP` +in the `production` environment to keep the GitHub-side names +distinct from the script-side env var names it hands to +`rolling-update.sh`. If you run the script by hand from a workstation +you must export `NODES` and `SSH_TARGETS` directly — the workflow-side +names are only understood by the workflow's render step. ## 5. Running a deploy @@ -149,5 +163,5 @@ the tag is a moving tag (`latest`) that the verification step can't distinguish from stale. Specify an immutable tag. ### SSH `Host key verification failed` -`DEPLOY_KNOWN_HOSTS` is stale. Re-run `ssh-keyscan` against every node and +`DEPLOY_KNOWN_HOSTS` is stale. Re-run `ssh-keyscan -H` against every node and update the secret. From 894bce93d94678429c8182ae1a276d9484a5225a Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sat, 25 Apr 2026 00:54:23 +0900 Subject: [PATCH 04/12] fix(workflow,docs): address round-3 deploy-via-tailscale review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - workflow nodes filter (Codex P2): reject any raft ID in the `nodes` input that does not appear in NODES_RAFT_MAP. Previously a typo like `n1,n9` silently rolled n1 only; now the workflow fails fast with a list of known IDs so the operator sees the typo before touching prod. - runbook section 4 (Gemini Medium x2): GitHub's native environment protection rules cannot be made conditional on workflow inputs, so the previous "auto-approve dry-run" guidance was wrong. Documented the three workable options: accept the prompt for dry-runs too (v1 default), split into a second unprotected environment, or install a deployment-protection-rule GitHub App. - runbook section 4 NODES_RAFT_MAP example (Gemini Medium): use full MagicDNS FQDNs instead of short hostnames so every node can resolve its peers regardless of local DNS search domains. - runbook section 6 (Gemini Medium): added "If a running workflow is cancelled mid-rollout" recovery steps — find the in-flight node from logs, finish the recreate by hand, confirm leader, rerun scoped. Filed as a tracked gap to teach the workflow per-node start-markers in a follow-up. Not addressed: Gemini HIGH line 187 claiming the workflow file is missing — the file IS present at .github/workflows/rolling-update.yml and has been since the first push of this PR. Third time the bot has flagged this (same finding in rounds 1 and 2); leaving as-is since responding further would just be repeating the same correction. --- .github/workflows/rolling-update.yml | 30 +++++++++++++--- docs/deploy_via_tailscale_runbook.md | 51 ++++++++++++++++++++++++---- 2 files changed, 71 insertions(+), 10 deletions(-) diff --git a/.github/workflows/rolling-update.yml b/.github/workflows/rolling-update.yml index 1e691fde5..852840a1f 100644 --- a/.github/workflows/rolling-update.yml +++ b/.github/workflows/rolling-update.yml @@ -111,15 +111,37 @@ jobs: fi if [[ -n "$NODES_FILTER" ]]; then # Filter NODES_RAFT_MAP and SSH_TARGETS_MAP to the requested subset. + # Reject any filter ID that does not appear in the map: silently + # dropping unknown IDs would let a typo like "n1,n9" proceed as + # a one-node rollout of n1 alone, which is a staged-deploy + # footgun. + IFS=',' read -r -a wanted <<< "$NODES_FILTER" + IFS=',' read -r -a entries <<< "$NODES_RAFT_MAP" + declare -a known_ids=() + for e in "${entries[@]}"; do + known_ids+=("${e%%=*}") + done + unknown="" + for w in "${wanted[@]}"; do + found=0 + for k in "${known_ids[@]}"; do + if [[ "$k" == "$w" ]]; then found=1; break; fi + done + if [[ $found -eq 0 ]]; then unknown+="${unknown:+, }$w"; fi + done + if [[ -n "$unknown" ]]; then + echo "::error::nodes filter '$NODES_FILTER' references unknown raft IDs: $unknown. Known IDs: ${known_ids[*]}" + exit 1 + fi filter_csv() { local all="$1" local filter="$2" local out="" - IFS=',' read -r -a entries <<< "$all" - IFS=',' read -r -a wanted <<< "$filter" - for e in "${entries[@]}"; do + IFS=',' read -r -a list_entries <<< "$all" + IFS=',' read -r -a list_wanted <<< "$filter" + for e in "${list_entries[@]}"; do key="${e%%=*}" - for w in "${wanted[@]}"; do + for w in "${list_wanted[@]}"; do if [[ "$key" == "$w" ]]; then out+="${e}," break diff --git a/docs/deploy_via_tailscale_runbook.md b/docs/deploy_via_tailscale_runbook.md index abdacece5..1a5fbefb5 100644 --- a/docs/deploy_via_tailscale_runbook.md +++ b/docs/deploy_via_tailscale_runbook.md @@ -70,10 +70,23 @@ Copy the client ID and secret; they go into GitHub in the next step. Repo → Settings → Environments → New environment: `production`. ### Required reviewers -Configure "Required reviewers" on the environment. Non-dry-run deploys will -pause until one of the reviewers approves. Configure "Deployment protection -rules" to auto-approve if the workflow input `dry_run == true` (optional; cuts -friction for previews). +Configure "Required reviewers" on the environment. **Every run that targets +this environment pauses for approval** — including dry-runs, because +GitHub's native environment-protection rules cannot be made conditional on +workflow inputs. Three ways to handle the dry-run-approval friction: + +1. **Accept the prompt for dry-runs too.** A dry-run requires one approver + click before it proceeds; still cheap and keeps the policy simple. +2. **Add a second environment `production-dry-run` without required + reviewers** and change the workflow to pick the environment via + `environment: ${{ inputs.dry_run && 'production-dry-run' || 'production' }}`. + Cleanest but doubles the secrets/vars you must keep in sync. +3. **Install a deployment-protection-rule GitHub App** (custom or + marketplace) that approves runs whose inputs show `dry_run == true`. + Most flexible; most setup. + +v1 ships with approach 1 (single environment, prompt on every run). +Approach 2 is the recommended upgrade once the friction becomes annoying. ### Environment secrets @@ -93,8 +106,8 @@ Regenerate on operator rotation. |------|-------|---------| | `IMAGE_BASE` | Container image path (no tag) | `ghcr.io/bootjp/elastickv` | | `SSH_USER` | SSH login on every node | `bootjp` | -| `NODES_RAFT_MAP` | Comma-separated `raftId=host` (no port — the script appends `RAFT_PORT`). The workflow renders this into the script's `NODES` env var. | `n1=kv01,n2=kv02,n3=kv03,n4=kv04,n5=kv05` | -| `SSH_TARGETS_MAP` | Comma-separated `raftId=ssh-host`. The workflow renders this into the script's `SSH_TARGETS` env var. | `n1=kv01..ts.net,n2=kv02..ts.net,...` | +| `NODES_RAFT_MAP` | Comma-separated `raftId=host` (no port — the script appends `RAFT_PORT`). Use full MagicDNS FQDNs so every node can resolve the advertised address regardless of local DNS search domains. The workflow renders this into the script's `NODES` env var. | `n1=kv01..ts.net,n2=kv02..ts.net,n3=kv03..ts.net,n4=kv04..ts.net,n5=kv05..ts.net` | +| `SSH_TARGETS_MAP` | Comma-separated `raftId=ssh-host`. The workflow renders this into the script's `SSH_TARGETS` env var. Usually identical to `NODES_RAFT_MAP` unless SSH access uses a different hostname. | `n1=kv01..ts.net,n2=kv02..ts.net,...` | **Why two names?** The workflow uses `NODES_RAFT_MAP` / `SSH_TARGETS_MAP` in the `production` environment to keep the GitHub-side names @@ -129,6 +142,32 @@ Recommended first-run sequence: Re-run the workflow with `image_tag` set to the previous-known-good sha. The `nodes` input can target specific nodes if only some carry the bad image. +### If a running workflow is cancelled mid-rollout + +GitHub cancelling the job between node steps is the one operational +hazard that needs manual cleanup. + +1. **Look at the last log line from the `Roll cluster` step.** The script + logs `[rolling-update] rolling n: docker stop/rm/run ...` before + each node recreate. Whatever `n` appears last is the one in + flight when the cancel signal landed. +2. **SSH into that node** over Tailscale and run `docker ps`. If the + container is absent or `Exited`, finish the recreate by hand with the + docker run arguments the script emitted (which you can see in the + workflow log, step `Roll cluster`). +3. **Confirm the new leader via `raftadmin` or metrics** before re-running + the workflow with `nodes:` scoped to the remaining untouched IDs. Do + NOT re-run the full rollout if the partial one is still in flight — + it will stop the same node you are trying to recover. +4. **File a ticket** with the log excerpt so we can eventually teach the + workflow to set a start-marker on each node and fast-skip completed + nodes on re-run. + +The script is idempotent for the "container exists and is up" case, so +re-running the workflow with the same `ref` after confirming the +interrupted node is healthy is safe — the script will stop+recreate +each node in turn regardless of whether it was touched before. + ## 7. What the workflow does NOT do (yet) - **No post-deploy health verification beyond tailnet reachability.** The From 165116bee650aa70072bec829ed20c1875d3e114 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sat, 25 Apr 2026 01:11:04 +0900 Subject: [PATCH 05/12] docs(deploy): round-4 deploy-via-tailscale review fixes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Gemini HIGH (design line 82): switch NODES_RAFT_MAP example to full MagicDNS FQDNs so it matches the runbook; bare hostnames resolve differently per node. - Gemini Medium (design line 45): fix YAML — on/workflow_dispatch/inputs must be nested, not on a single line, and the fence is labelled yaml. - Gemini Medium (runbook §3, design §2.3): retract devices:core — not a valid Tailscale OAuth scope; note devices:write as the standard one. - Gemini Medium (runbook §6, line 153-156): correct the cancelled-job log pattern to what the script actually emits (`==> [@] start`, scripts/rolling-update.sh:398), not the fictitious `[rolling-update] rolling n: ...`. - Gemini Medium (runbook §6, line 156-160): clarify that docker run stdout/stderr is redirected to /dev/null, so operators reconstruct the invocation from the step-level env log, not from the docker-run argv. - Codex P2 (runbook §8 approval troubleshooting): clarify that both dry-run and non-dry-run runs pause for approval in v1 because `environment: production` is unconditional; reference §4 for the second-environment upgrade path. --- docs/deploy_via_tailscale_runbook.md | 38 +++++++++++++------ ...026_04_24_proposed_deploy_via_tailscale.md | 33 +++++++++------- 2 files changed, 46 insertions(+), 25 deletions(-) diff --git a/docs/deploy_via_tailscale_runbook.md b/docs/deploy_via_tailscale_runbook.md index 1a5fbefb5..ca411ec05 100644 --- a/docs/deploy_via_tailscale_runbook.md +++ b/docs/deploy_via_tailscale_runbook.md @@ -58,9 +58,12 @@ Admin console → Settings → OAuth clients → New client: - Description: `elastickv GitHub Actions deploy` - Scopes: `auth_keys` (write). Recent `tailscale/github-action` versions - may additionally require `devices:core` (write); enable that if the - join step fails with an authorization error. The action's README is - the definitive source for current scope requirements. + may additionally require `devices:write` (to register and clean up + the ephemeral node); enable that if the join step fails with an + authorization error. The action's README is the definitive source + for current scope requirements. `devices:core` is NOT a valid + Tailscale OAuth scope — earlier drafts of this runbook named it and + would have produced an auth failure. - Tags: `tag:ci-deploy` Copy the client ID and secret; they go into GitHub in the next step. @@ -147,14 +150,20 @@ Re-run the workflow with `image_tag` set to the previous-known-good sha. The GitHub cancelling the job between node steps is the one operational hazard that needs manual cleanup. -1. **Look at the last log line from the `Roll cluster` step.** The script - logs `[rolling-update] rolling n: docker stop/rm/run ...` before - each node recreate. Whatever `n` appears last is the one in +1. **Look at the last log line from the `Roll cluster` step.** The + script emits `==> [@] start` at the beginning of + each per-node recreate (see `scripts/rolling-update.sh:398`). + Whichever `` appears in the last such line is the one in flight when the cancel signal landed. 2. **SSH into that node** over Tailscale and run `docker ps`. If the - container is absent or `Exited`, finish the recreate by hand with the - docker run arguments the script emitted (which you can see in the - workflow log, step `Roll cluster`). + container is absent or `Exited`, finish the recreate by hand. The + `docker run` invocation itself is redirected to `/dev/null` by the + script, so the workflow log does NOT contain the full argv. Use + the resolved env instead: the step logs `NODES_RAFT_MAP`, + `EXTRA_ENV`, `GOMEMLIMIT`, `CONTAINER_MEMORY_LIMIT`, `IMAGE`, and + `DATA_DIR` before invoking the script — those are sufficient to + reconstruct the same `docker run` you would see if you re-ran with + the same inputs. 3. **Confirm the new leader via `raftadmin` or metrics** before re-running the workflow with `nodes:` scoped to the remaining untouched IDs. Do NOT re-run the full rollout if the partial one is still in flight — @@ -185,9 +194,14 @@ each node in turn regardless of whether it was touched before. ## 8. Troubleshooting ### Job pauses indefinitely at "Waiting for approval" -Expected for non-dry-run deploys — a reviewer from the `production` environment -must click Approve. Check the "Required reviewers" list in the environment -settings. +Expected for **every** run in v1 — `.github/workflows/rolling-update.yml` +sets `environment: production` unconditionally, so both dry-run and +non-dry-run executions pause for approval. A reviewer from the +`production` environment must click Approve. Check the "Required +reviewers" list in the environment settings. See §4 "GitHub +environment" for the dry-run-approval alternatives (approach 2: add a +second `production-dry-run` environment without required reviewers) +if the friction becomes intolerable. ### `tailscale ping` fails for a node The node may not be running `tailscaled`, not tagged `tag:elastickv-node`, or diff --git a/docs/design/2026_04_24_proposed_deploy_via_tailscale.md b/docs/design/2026_04_24_proposed_deploy_via_tailscale.md index d69216d3f..7e65d334d 100644 --- a/docs/design/2026_04_24_proposed_deploy_via_tailscale.md +++ b/docs/design/2026_04_24_proposed_deploy_via_tailscale.md @@ -40,14 +40,15 @@ logged in on every node, with SSH access enabled over the tailnet. ### 2.1 Workflow shape -``` +```yaml name: Rolling update -on: workflow_dispatch: - inputs: - ref: # git sha/tag of the image to deploy - image_tag: # defaults to $ref; override only for rollbacks - nodes: # subset of raft IDs; empty = full roll - dry_run: # bool, default TRUE — renders plan but doesn't roll +on: + workflow_dispatch: + inputs: + ref: # git sha/tag of the image to deploy + image_tag: # defaults to $ref; override only for rollbacks + nodes: # subset of raft IDs; empty = full roll + dry_run: # bool, default TRUE — renders plan but doesn't roll jobs: deploy: @@ -79,10 +80,13 @@ Stored in a GitHub `production` environment (not repo-wide): entries. Prevents the first-connect TOFU prompt. **Variables (non-secret):** -- `NODES_RAFT_MAP` — `n1=kv01,n2=kv02,...` (advertised hostnames as seen - from inside the tailnet; the script appends `RAFT_PORT` automatically, - so do NOT include a port here). -- `SSH_TARGETS_MAP` — `n1=kv01.tailnet.ts.net,...` (MagicDNS). +- `NODES_RAFT_MAP` — `n1=kv01.tailnet.ts.net,n2=kv02.tailnet.ts.net,...` + (full MagicDNS FQDNs; bare short names can resolve differently + depending on each node's search-domain configuration). The script + appends `RAFT_PORT` automatically, so do NOT include a port here. + The runbook (`docs/deploy_via_tailscale_runbook.md`) carries the + same FQDN convention; keep the two in sync if either changes. +- `SSH_TARGETS_MAP` — `n1=kv01.tailnet.ts.net,...` (MagicDNS FQDN). - `IMAGE_BASE` — `ghcr.io/bootjp/elastickv` (tag is appended from the input). - `SSH_USER` — e.g., `bootjp`. @@ -93,8 +97,11 @@ Use OAuth ephemeral nodes (not a long-lived auth key): - Create an OAuth client in Tailscale admin console with scope `auth_keys` (write) on tag `tag:ci-deploy`. (`tailscale/github-action` uses the OAuth client to mint a short-lived auth key on each run; - recent action versions may also require `devices:core` — consult the - action's README for the current scope list.) + recent action versions may also require `devices:write` so the + ephemeral node can register and be cleaned up — consult the action's + README for the current scope list. Earlier drafts of this doc named + `devices:core`, which is not a supported Tailscale OAuth scope and + would fail authentication.) - Store client ID + secret in GitHub env secrets. - `tailscale/github-action@v3` joins the tailnet for the duration of the job as an ephemeral tagged node; disconnects automatically on job exit. From 24383b2db1460fb999d6c50efb91738ca18cd896 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sat, 25 Apr 2026 01:33:06 +0900 Subject: [PATCH 06/12] docs(deploy): round-5 deploy-via-tailscale review fixes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Gemini Medium (design §2.6 line 146): the design doc contradicted the runbook by claiming dry-runs do NOT need approval. GitHub's environment-protection rules cannot be made conditional on workflow inputs, so `environment: production` pauses BOTH dry-run and non-dry-run executions in v1. Aligned the design-doc wording with the runbook and cross-referenced §4 for the second-environment upgrade path. - Gemini Medium (runbook §6 line 164 env list): the list of reconstruction vars was incomplete. Listed every env var the workflow actually exports (IMAGE, DATA_DIR, RAFT_PORT, REDIS_PORT, S3_PORT, ENABLE_S3, NODES, SSH_TARGETS, EXTRA_ENV) and called out the script-level defaults for anything not overridden, plus noted GOMEMLIMIT / CONTAINER_MEMORY_LIMIT are propagated via EXTRA_ENV once PR #617 lands. - Gemini Medium (runbook §6 line 178 idempotency): corrected the "stop+recreate every node regardless" claim. The script (scripts/rolling-update.sh:794-798) skips nodes whose running image id matches the target AND whose gRPC endpoint is healthy, so re-running after a partial roll is safe because already-rolled nodes are no-ops, not stops. Declining again Gemini HIGH "workflow file missing" — the file IS in this PR at .github/workflows/rolling-update.yml; this is the fourth round the bot has flagged its own misread. See prior rounds for rationale; no change. --- docs/deploy_via_tailscale_runbook.md | 30 ++++++++++++------- ...026_04_24_proposed_deploy_via_tailscale.md | 18 +++++++---- 2 files changed, 32 insertions(+), 16 deletions(-) diff --git a/docs/deploy_via_tailscale_runbook.md b/docs/deploy_via_tailscale_runbook.md index ca411ec05..82e6f0511 100644 --- a/docs/deploy_via_tailscale_runbook.md +++ b/docs/deploy_via_tailscale_runbook.md @@ -158,12 +158,19 @@ hazard that needs manual cleanup. 2. **SSH into that node** over Tailscale and run `docker ps`. If the container is absent or `Exited`, finish the recreate by hand. The `docker run` invocation itself is redirected to `/dev/null` by the - script, so the workflow log does NOT contain the full argv. Use - the resolved env instead: the step logs `NODES_RAFT_MAP`, - `EXTRA_ENV`, `GOMEMLIMIT`, `CONTAINER_MEMORY_LIMIT`, `IMAGE`, and - `DATA_DIR` before invoking the script — those are sufficient to - reconstruct the same `docker run` you would see if you re-ran with - the same inputs. + script, so the workflow log does NOT contain the full argv. To + reconstruct it, read the `Roll cluster` step's rendered + environment — the workflow exports `IMAGE`, `DATA_DIR`, + `RAFT_PORT`, `REDIS_PORT`, `S3_PORT`, `ENABLE_S3`, `NODES`, + `SSH_TARGETS`, and the merged `EXTRA_ENV` before invoking the + script. Anything not explicitly set (e.g., `RAFT_PORT` in a + minimally-overridden deploy) falls back to the script's default + (`RAFT_PORT=50051`, `REDIS_PORT=6379`, `S3_PORT=9000`, + `ENABLE_S3=true`). GOMEMLIMIT / CONTAINER_MEMORY_LIMIT (PR #617) + are propagated via `EXTRA_ENV` once that PR lands. Together the + rendered env + the node's `deploy.env` is enough to reconstruct + the same `docker run` you would see if you re-ran with the same + inputs. 3. **Confirm the new leader via `raftadmin` or metrics** before re-running the workflow with `nodes:` scoped to the remaining untouched IDs. Do NOT re-run the full rollout if the partial one is still in flight — @@ -172,10 +179,13 @@ hazard that needs manual cleanup. workflow to set a start-marker on each node and fast-skip completed nodes on re-run. -The script is idempotent for the "container exists and is up" case, so -re-running the workflow with the same `ref` after confirming the -interrupted node is healthy is safe — the script will stop+recreate -each node in turn regardless of whether it was touched before. +The script is idempotent. `scripts/rolling-update.sh:794-798` skips a +node when its running image id equals the target image and its gRPC +endpoint is healthy — an already-rolled node is a no-op, not a +redundant stop/recreate. Re-running the workflow with the same +`ref` after confirming the interrupted node is healthy is therefore +safe: nodes that already match the target image are passed over, +and only the still-stale one gets recreated. ## 7. What the workflow does NOT do (yet) diff --git a/docs/design/2026_04_24_proposed_deploy_via_tailscale.md b/docs/design/2026_04_24_proposed_deploy_via_tailscale.md index 7e65d334d..2eb0168f2 100644 --- a/docs/design/2026_04_24_proposed_deploy_via_tailscale.md +++ b/docs/design/2026_04_24_proposed_deploy_via_tailscale.md @@ -141,15 +141,21 @@ unreachable over the tailnet) before touching any live container. ### 2.6 Production environment approval -Mark the `production` GitHub environment as requiring approval from a list of -reviewers. A non-dry-run deploy will pause until approved; the dry-run run -itself does not need approval (it only needs the tailnet join). +Mark the `production` GitHub environment as requiring approval from a list +of reviewers. GitHub's native environment-protection rules do NOT support +conditioning approval on workflow inputs, so **both** dry-run and non- +dry-run runs will pause for approval when `environment: production` is +declared unconditionally on the job. That is the v1 policy — simpler, +one environment, one approver list; see runbook §4 for the dry-run- +approval alternatives (a second `production-dry-run` environment without +required reviewers, or a deployment-protection-rule GitHub App). Alternative: require approval unconditionally and treat the dry-run as a -"preview" that an approver must ack. Simpler policy, slightly more friction. +"preview" that an approver must ack. This is the v1 shape by default. -**Recommendation:** approval required for non-dry-run only. Dry-runs are -cheap and useful. +**Recommendation:** approval required for every run in v1 (one +environment). Add the second environment only when the dry-run friction +becomes annoying. ### 2.7 Rollback From 3ddcd42e71468a7d9fa1f6cb4e5e5507f2005bf5 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sat, 25 Apr 2026 01:53:21 +0900 Subject: [PATCH 07/12] fix(workflow): validate inputs.ref is default branch or tag Codex P1: the workflow hands the checked-out tree a Tailscale OAuth secret and an SSH key, then executes scripts/rolling-update.sh from that tree. Anyone who can dispatch runs could previously point inputs.ref at a branch containing a malicious script modification and exfiltrate the secrets. Mitigations: - New 'Validate ref is default branch or a tag' step rejects any ref that is not the default branch (by name or HEAD sha) or an existing tag. A sha reachable from elsewhere is still accepted (the subsequent checkout does its own verification) but non- default branches fail closed with an operator-visible error. - actions/checkout now pins persist-credentials: false so the GITHUB_TOKEN is not left in the runner's git config for the deploy script to harvest. The token is still explicitly exposed to the ghcr verification step via env:, which is the only place it needs to be readable. --- .github/workflows/rolling-update.yml | 34 ++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/.github/workflows/rolling-update.yml b/.github/workflows/rolling-update.yml index 852840a1f..626044a80 100644 --- a/.github/workflows/rolling-update.yml +++ b/.github/workflows/rolling-update.yml @@ -48,10 +48,44 @@ jobs: timeout-minutes: 60 steps: + # The deploy script (scripts/rolling-update.sh) is executed from the + # checkout below, after the tailnet join and SSH key load. If `ref` + # were unvalidated, anyone with workflow_dispatch permission could + # point it at a fork commit containing a modified script that + # harvests the SSH key / Tailscale OAuth secret. Validate that + # `ref` resolves to (a) the repository's default branch, or (b) a + # tag on the repo, before we hand it any secret. Branches other + # than the default are rejected so review-gated default is the only + # entry point besides immutable tags. + - name: Validate ref is default branch or a tag + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + REF: ${{ inputs.ref }} + run: | + set -euo pipefail + default_branch=$(gh api "repos/${{ github.repository }}" --jq '.default_branch') + default_sha=$(gh api "repos/${{ github.repository }}/commits/$default_branch" --jq '.sha') + if [[ "$REF" == "$default_branch" || "$REF" == "$default_sha" ]]; then + echo "ref is the default branch ($default_branch / $default_sha)" + exit 0 + fi + if gh api "repos/${{ github.repository }}/git/refs/tags/$REF" >/dev/null 2>&1; then + echo "ref is a tag" + exit 0 + fi + # Also accept a sha that is reachable from the default branch's HEAD + # so historical default-branch commits remain deployable for rollback. + if git -c "http.https://github.com/.extraheader=" ls-remote "https://github.com/${{ github.repository }}.git" | grep -q "^$REF"; then + echo "::error::ref '$REF' is not the default branch or a tag. Branches other than '$default_branch' are disallowed to prevent arbitrary-code execution with production secrets." + exit 1 + fi + echo "ref '$REF' treated as a sha; checkout will fail if it is not reachable." + - name: Checkout uses: actions/checkout@v6 with: ref: ${{ inputs.ref }} + persist-credentials: false - name: Install jq run: sudo apt-get install -y --no-install-recommends jq From b2598fc8e5c09e92f78544844756f31aaaffbadb Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Wed, 24 Jun 2026 02:51:57 +0900 Subject: [PATCH 08/12] ops: harden tailscale rollout workflow --- .github/workflows/rolling-update.yml | 215 ++++++++++++------ docs/deploy_via_tailscale_runbook.md | 15 +- ...026_04_24_proposed_deploy_via_tailscale.md | 46 +++- scripts/rolling-update.sh | 62 ++++- 4 files changed, 246 insertions(+), 92 deletions(-) diff --git a/.github/workflows/rolling-update.yml b/.github/workflows/rolling-update.yml index 626044a80..4d94aaf63 100644 --- a/.github/workflows/rolling-update.yml +++ b/.github/workflows/rolling-update.yml @@ -8,7 +8,7 @@ on: workflow_dispatch: inputs: ref: - description: Git ref (tag or sha) to deploy. Also used as the image tag unless image_tag is set. + description: Image tag/ref to deploy. Workflow code is always checked out from the repository default branch. required: true type: string image_tag: @@ -48,43 +48,24 @@ jobs: timeout-minutes: 60 steps: - # The deploy script (scripts/rolling-update.sh) is executed from the - # checkout below, after the tailnet join and SSH key load. If `ref` - # were unvalidated, anyone with workflow_dispatch permission could - # point it at a fork commit containing a modified script that - # harvests the SSH key / Tailscale OAuth secret. Validate that - # `ref` resolves to (a) the repository's default branch, or (b) a - # tag on the repo, before we hand it any secret. Branches other - # than the default are rejected so review-gated default is the only - # entry point besides immutable tags. - - name: Validate ref is default branch or a tag + # The deploy script is executed after the tailnet join and SSH key load. + # Always take that script from the review-gated default branch; the + # workflow input only selects the image tag/ref to deploy. + - name: Resolve trusted checkout ref + id: trusted-ref env: GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} REF: ${{ inputs.ref }} run: | set -euo pipefail default_branch=$(gh api "repos/${{ github.repository }}" --jq '.default_branch') - default_sha=$(gh api "repos/${{ github.repository }}/commits/$default_branch" --jq '.sha') - if [[ "$REF" == "$default_branch" || "$REF" == "$default_sha" ]]; then - echo "ref is the default branch ($default_branch / $default_sha)" - exit 0 - fi - if gh api "repos/${{ github.repository }}/git/refs/tags/$REF" >/dev/null 2>&1; then - echo "ref is a tag" - exit 0 - fi - # Also accept a sha that is reachable from the default branch's HEAD - # so historical default-branch commits remain deployable for rollback. - if git -c "http.https://github.com/.extraheader=" ls-remote "https://github.com/${{ github.repository }}.git" | grep -q "^$REF"; then - echo "::error::ref '$REF' is not the default branch or a tag. Branches other than '$default_branch' are disallowed to prevent arbitrary-code execution with production secrets." - exit 1 - fi - echo "ref '$REF' treated as a sha; checkout will fail if it is not reachable." + echo "checkout_ref=$default_branch" >> "$GITHUB_OUTPUT" + echo "deploy ref/image tag: $REF" - - name: Checkout + - name: Checkout trusted deploy script uses: actions/checkout@v6 with: - ref: ${{ inputs.ref }} + ref: ${{ steps.trusted-ref.outputs.checkout_ref }} persist-credentials: false - name: Install jq @@ -139,51 +120,147 @@ jobs: NODES_FILTER: ${{ inputs.nodes }} run: | set -euo pipefail - if [[ -z "$NODES_RAFT_MAP" || -z "$SSH_TARGETS_MAP" ]]; then - echo "::error::NODES_RAFT_MAP or SSH_TARGETS_MAP is not set in the production environment variables" + if [[ -z "$NODES_RAFT_MAP" ]]; then + echo "::error::NODES_RAFT_MAP is not set in the production environment variables" + exit 1 + fi + + normalize_csv_map() { + local all="$1" + local out="" + local e key value + if [[ -z "$all" ]]; then + printf '%s' "" + return 0 + fi + IFS=',' read -r -a entries <<< "$all" + for e in "${entries[@]}"; do + e="${e//[[:space:]]/}" + [[ -n "$e" ]] || continue + if [[ "$e" != *=* ]]; then + echo "::error::invalid map entry '$e' (expected raftId=value)" + exit 1 + fi + key="${e%%=*}" + value="${e#*=}" + if [[ -z "$key" || -z "$value" ]]; then + echo "::error::invalid map entry '$e' (empty raft ID or value)" + exit 1 + fi + out+="${out:+,}${key}=${value}" + done + printf '%s' "$out" + } + + lookup_map() { + local key="$1" + local all="$2" + local e entry_key entry_value + [[ -n "$all" ]] || return 1 + IFS=',' read -r -a entries <<< "$all" + for e in "${entries[@]}"; do + e="${e//[[:space:]]/}" + [[ -n "$e" ]] || continue + entry_key="${e%%=*}" + entry_value="${e#*=}" + if [[ "$entry_key" == "$key" ]]; then + printf '%s' "$entry_value" + return 0 + fi + done + return 1 + } + + filter_csv() { + local all="$1" + local filter="$2" + local out="" + local e key w + if [[ -z "$all" ]]; then + printf '%s' "" + return 0 + fi + IFS=',' read -r -a list_entries <<< "$all" + IFS=',' read -r -a list_wanted <<< "$filter" + for e in "${list_entries[@]}"; do + e="${e//[[:space:]]/}" + [[ -n "$e" ]] || continue + key="${e%%=*}" + for w in "${list_wanted[@]}"; do + w="${w//[[:space:]]/}" + if [[ "$key" == "$w" ]]; then + out+="${out:+,}$e" + break + fi + done + done + printf '%s' "$out" + } + + known_ids_csv() { + local all="$1" + local out="" + local e key + IFS=',' read -r -a entries <<< "$all" + for e in "${entries[@]}"; do + e="${e//[[:space:]]/}" + [[ -n "$e" ]] || continue + key="${e%%=*}" + out+="${out:+,}$key" + done + printf '%s' "$out" + } + + materialize_ssh_targets() { + local nodes="$1" + local ssh_targets="$2" + local out="" + local e key host target + if [[ -z "$nodes" ]]; then + printf '%s' "" + return 0 + fi + IFS=',' read -r -a entries <<< "$nodes" + for e in "${entries[@]}"; do + e="${e//[[:space:]]/}" + [[ -n "$e" ]] || continue + key="${e%%=*}" + host="${e#*=}" + target="$(lookup_map "$key" "$ssh_targets" || true)" + if [[ -z "$target" ]]; then + target="$host" + fi + out+="${out:+,}${key}=${target}" + done + printf '%s' "$out" + } + + NODES_RAFT_MAP="$(normalize_csv_map "$NODES_RAFT_MAP")" + SSH_TARGETS_MAP="$(normalize_csv_map "$SSH_TARGETS_MAP")" + if [[ -z "$NODES_RAFT_MAP" ]]; then + echo "::error::NODES_RAFT_MAP did not contain any nodes" exit 1 fi + NODES_FILTER="${NODES_FILTER//[[:space:]]/}" + if [[ -n "$NODES_FILTER" ]]; then # Filter NODES_RAFT_MAP and SSH_TARGETS_MAP to the requested subset. # Reject any filter ID that does not appear in the map: silently # dropping unknown IDs would let a typo like "n1,n9" proceed as # a one-node rollout of n1 alone, which is a staged-deploy # footgun. - IFS=',' read -r -a wanted <<< "$NODES_FILTER" - IFS=',' read -r -a entries <<< "$NODES_RAFT_MAP" - declare -a known_ids=() - for e in "${entries[@]}"; do - known_ids+=("${e%%=*}") - done unknown="" + IFS=',' read -r -a wanted <<< "$NODES_FILTER" for w in "${wanted[@]}"; do - found=0 - for k in "${known_ids[@]}"; do - if [[ "$k" == "$w" ]]; then found=1; break; fi - done - if [[ $found -eq 0 ]]; then unknown+="${unknown:+, }$w"; fi + [[ -n "$w" ]] || continue + if ! lookup_map "$w" "$NODES_RAFT_MAP" >/dev/null; then + unknown+="${unknown:+, }$w" + fi done if [[ -n "$unknown" ]]; then - echo "::error::nodes filter '$NODES_FILTER' references unknown raft IDs: $unknown. Known IDs: ${known_ids[*]}" + echo "::error::nodes filter '$NODES_FILTER' references unknown raft IDs: $unknown. Known IDs: $(known_ids_csv "$NODES_RAFT_MAP")" exit 1 fi - filter_csv() { - local all="$1" - local filter="$2" - local out="" - IFS=',' read -r -a list_entries <<< "$all" - IFS=',' read -r -a list_wanted <<< "$filter" - for e in "${list_entries[@]}"; do - key="${e%%=*}" - for w in "${list_wanted[@]}"; do - if [[ "$key" == "$w" ]]; then - out+="${e}," - break - fi - done - done - echo "${out%,}" - } NODES_RAFT_MAP="$(filter_csv "$NODES_RAFT_MAP" "$NODES_FILTER")" SSH_TARGETS_MAP="$(filter_csv "$SSH_TARGETS_MAP" "$NODES_FILTER")" if [[ -z "$NODES_RAFT_MAP" ]]; then @@ -191,6 +268,7 @@ jobs: exit 1 fi fi + SSH_TARGETS_MAP="$(materialize_ssh_targets "$NODES_RAFT_MAP" "$SSH_TARGETS_MAP")" { echo "NODES=$NODES_RAFT_MAP" echo "SSH_TARGETS=$SSH_TARGETS_MAP" @@ -228,20 +306,15 @@ jobs: env: NODES: ${{ steps.render.outputs.NODES }} SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }} - IMAGE_BASE: ${{ vars.IMAGE_BASE }} - IMAGE_TAG: ${{ inputs.image_tag || inputs.ref }} + IMAGE: ${{ vars.IMAGE_BASE }}:${{ inputs.image_tag || inputs.ref }} SSH_USER: ${{ vars.SSH_USER }} + DRY_RUN: "true" + REF: ${{ inputs.ref }} run: | set -euo pipefail - cat <.ts.net,n2=kv02..ts.net,n3=kv03..ts.net,n4=kv04..ts.net,n5=kv05..ts.net` | -| `SSH_TARGETS_MAP` | Comma-separated `raftId=ssh-host`. The workflow renders this into the script's `SSH_TARGETS` env var. Usually identical to `NODES_RAFT_MAP` unless SSH access uses a different hostname. | `n1=kv01..ts.net,n2=kv02..ts.net,...` | +| `SSH_TARGETS_MAP` | Optional comma-separated `raftId=ssh-host`. The workflow renders this into the script's `SSH_TARGETS` env var. Usually identical to `NODES_RAFT_MAP` unless SSH access uses a different hostname. If the variable is empty or an ID is omitted, the workflow falls back to that ID's `NODES_RAFT_MAP` host so reachability checks still cover every rollout node. | `n1=kv01..ts.net,n2=kv02..ts.net,...` | **Why two names?** The workflow uses `NODES_RAFT_MAP` / `SSH_TARGETS_MAP` in the `production` environment to keep the GitHub-side names @@ -125,12 +125,14 @@ Actions tab → "Rolling update" → Run workflow. Inputs: -- `ref` — the git tag or sha to deploy (also used as the container image tag) +- `ref` — the image tag/ref to deploy. The workflow code itself is always + checked out from the repository default branch. - `image_tag` — override only for rollbacks (e.g., deploy tag `v1.2.3` of a commit that was also `v1.2.3`) - `nodes` — subset of raft IDs, e.g., `n1,n2`. Empty rolls all nodes. -- `dry_run` — default `true`. Renders the plan and checks reachability without - touching containers. +- `dry_run` — default `true`. Checks reachability and runs + `./scripts/rolling-update.sh --dry-run` with the rendered environment, + without touching containers. Recommended first-run sequence: @@ -152,9 +154,8 @@ hazard that needs manual cleanup. 1. **Look at the last log line from the `Roll cluster` step.** The script emits `==> [@] start` at the beginning of - each per-node recreate (see `scripts/rolling-update.sh:398`). - Whichever `` appears in the last such line is the one in - flight when the cancel signal landed. + each per-node recreate. Whichever `` appears in the last + such line is the one in flight when the cancel signal landed. 2. **SSH into that node** over Tailscale and run `docker ps`. If the container is absent or `Exited`, finish the recreate by hand. The `docker run` invocation itself is redirected to `/dev/null` by the diff --git a/docs/design/2026_04_24_proposed_deploy_via_tailscale.md b/docs/design/2026_04_24_proposed_deploy_via_tailscale.md index 2eb0168f2..be9e3492d 100644 --- a/docs/design/2026_04_24_proposed_deploy_via_tailscale.md +++ b/docs/design/2026_04_24_proposed_deploy_via_tailscale.md @@ -131,9 +131,11 @@ With `dry_run: true` (the default): - Everything up to script invocation runs (checkout, tailnet join, SSH agent load, `NODES`/`SSH_TARGETS` render). -- The script is invoked with `--help` + the rendered env is printed as a - collapsed log group. - `tailscale ping` is run against each SSH target to confirm reachability. +- The script is invoked as `DRY_RUN=true ./scripts/rolling-update.sh --dry-run` + with the same rendered env the live rollout would receive. It validates the + node maps, rollout order, derived service maps, image name, and per-node SSH + targets, then prints the plan. - The actual `docker stop/rm/run` loop does NOT execute. This catches the common failure modes (bad secret, bad env mapping, a node @@ -163,10 +165,35 @@ Rolling back uses the same workflow with `image_tag: `. The script already supports the rollout order env var (`ROLLING_ORDER`) so an operator can force-roll only the affected nodes. -**Gap:** there is no "stop mid-rollout" control today. If the workflow is -cancelled via GitHub UI during a roll, the in-flight node may be mid-recreate. -`rolling-update.sh` is supposed to be idempotent and crash-safe, but this -should be verified before we call the workflow production-ready. +The workflow sets `cancel-in-progress: false`, so a newer run cannot +automatically interrupt an active rollout. A human can still cancel the job in +the GitHub UI, and cancellation can land while one node is between `docker rm` +and the replacement container becoming healthy. The runbook therefore treats +mid-rollout cancellation as a manual recovery case: + +- identify the last `==> [@] start` line in the workflow log, +- inspect that node over Tailscale and restore/finish the container if needed, +- run a dry-run with the same image tag after the node is healthy, +- re-run the workflow against the remaining stale node IDs, or against the full + cluster once the interrupted node is confirmed healthy. + +The script is intentionally idempotent for a re-run: a node already running the +target image and passing the gRPC health check is skipped instead of being +stopped again. + +### 2.8 Live cutover and zero-downtime posture + +v1 is a controlled rolling restart of the existing Raft cluster, not a +blue-green deploy. It reduces downtime risk by rolling one node at a time, +transferring leadership before touching a leader when possible, waiting for +gRPC health after each node, requiring image existence before the roll, and +keeping dry-run/reachability checks on the same node map used by the live run. + +That is enough for compatible container updates where quorum remains available. +It is not a universal zero-downtime migration mechanism. Changes that need +dual-write, request shadowing, schema compatibility windows, or traffic +switching must use a bridge/proxy or blue-green plan outside this workflow +before production cutover. ## 3. Open questions @@ -189,8 +216,9 @@ should be verified before we call the workflow production-ready. - Automatic deploys on merge to main (needs more test coverage before we'd trust it). -- Blue-green or canary strategies (we don't have the traffic-routing layer - for it). +- Blue-green or canary strategies in this workflow. They remain the recommended + mitigation for risky incompatible cutovers, but require a traffic-routing + layer outside v1. - Metrics-based rollback trigger (watch p99, auto-revert if it jumps). - Tailscale SSH (option A above). - A shared `deploy` user with restricted sudo. @@ -199,7 +227,7 @@ should be verified before we call the workflow production-ready. 1. Write `.github/workflows/rolling-update.yml` implementing §2.1. 2. Document the secrets/variables setup in - `docs/operations/deploy_runbook.md` (new). + `docs/deploy_via_tailscale_runbook.md`. 3. Run once with `dry_run: true` on a feature branch to validate secrets wiring without touching prod. 4. Run once with `dry_run: false` targeting a single node (via the `nodes` diff --git a/scripts/rolling-update.sh b/scripts/rolling-update.sh index 97e23de69..74ba85dd3 100755 --- a/scripts/rolling-update.sh +++ b/scripts/rolling-update.sh @@ -7,7 +7,7 @@ REPO_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)" usage() { cat <<'EOF' Usage: - NODES="n1=raft-1.internal,n2=raft-2.internal,n3=raft-3.internal" ./scripts/rolling-update.sh + NODES="n1=raft-1.internal,n2=raft-2.internal,n3=raft-3.internal" ./scripts/rolling-update.sh [--dry-run] Required environment: NODES @@ -17,6 +17,11 @@ Optional environment: ROLLING_UPDATE_ENV_FILE Shell env file to source before evaluating the rest of the settings. + DRY_RUN + Set to true, or pass --dry-run, to validate and print the rollout plan + without building helpers, copying files, SSHing to nodes, or touching + containers. + SSH_TARGETS Comma-separated SSH target map when SSH hosts differ from advertised hosts: "=,..." @@ -83,10 +88,24 @@ Notes: EOF } -if [[ "${1:-}" == "--help" || "${1:-}" == "-h" ]]; then - usage - exit 0 -fi +DRY_RUN_ARG=false +while [[ $# -gt 0 ]]; do + case "$1" in + --help|-h) + usage + exit 0 + ;; + --dry-run) + DRY_RUN_ARG=true + shift + ;; + *) + echo "unknown argument: $1" >&2 + usage >&2 + exit 1 + ;; + esac +done if [[ -n "${ROLLING_UPDATE_ENV_FILE:-}" ]]; then if [[ ! -f "$ROLLING_UPDATE_ENV_FILE" ]]; then @@ -125,10 +144,19 @@ SSH_TARGETS="${SSH_TARGETS:-}" ROLLING_ORDER="${ROLLING_ORDER:-}" RAFT_TO_REDIS_MAP="${RAFT_TO_REDIS_MAP:-}" RAFT_TO_S3_MAP="${RAFT_TO_S3_MAP:-}" +DRY_RUN="${DRY_RUN:-false}" +if [[ "$DRY_RUN_ARG" == "true" ]]; then + DRY_RUN=true +fi # Container OOM defenses. See usage() for rationale. Empty string disables. DEFAULT_EXTRA_ENV="${DEFAULT_EXTRA_ENV-GOMEMLIMIT=1800MiB}" CONTAINER_MEMORY_LIMIT="${CONTAINER_MEMORY_LIMIT-2500m}" +if [[ "$DRY_RUN" != "true" && "$DRY_RUN" != "false" ]]; then + echo "DRY_RUN must be true or false" >&2 + exit 1 +fi + if [[ -z "$NODES" ]]; then echo "NODES is required" >&2 usage >&2 @@ -309,6 +337,25 @@ derive_raft_to_s3_map() { ) } +print_dry_run_plan() { + local node_id node_host ssh_target + + echo "[rolling-update] dry run: no remote commands will be executed" + echo "[rolling-update] target image: $IMAGE" + echo "[rolling-update] container: $CONTAINER_NAME" + echo "[rolling-update] raft engine: $RAFT_ENGINE" + echo "[rolling-update] nodes:" + for node_id in "${ROLLING_NODE_IDS[@]}"; do + node_host="$(node_host_by_id "$node_id")" + ssh_target="$(ssh_target_by_id "$node_id")" + echo " - raft_id=$node_id host=$node_host ssh_target=$ssh_target" + done + echo "[rolling-update] RAFT_TO_REDIS_MAP=$RAFT_TO_REDIS_MAP" + if [[ "${ENABLE_S3}" == "true" ]]; then + echo "[rolling-update] RAFT_TO_S3_MAP=$RAFT_TO_S3_MAP" + fi +} + ensure_local_raftadmin() { if [[ -n "$RAFTADMIN_LOCAL_BIN" ]]; then if [[ ! -x "$RAFTADMIN_LOCAL_BIN" ]]; then @@ -873,6 +920,11 @@ if [[ "${ENABLE_S3}" == "true" && -z "$RAFT_TO_S3_MAP" ]]; then RAFT_TO_S3_MAP="$(derive_raft_to_s3_map)" fi +if [[ "$DRY_RUN" == "true" ]]; then + print_dry_run_plan + exit 0 +fi + ensure_local_raftadmin ensure_remote_raftadmin_binaries From ee2213ad688a293c741472b03388b956f071d711 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Wed, 24 Jun 2026 04:20:58 +0900 Subject: [PATCH 09/12] ops: preserve cluster map for subset rollouts --- .github/workflows/docker-image.yml | 4 +- .github/workflows/rolling-update.yml | 59 ++++++++++++++++------------ docs/deploy_via_tailscale_runbook.md | 39 ++++++++++++++---- 3 files changed, 67 insertions(+), 35 deletions(-) diff --git a/.github/workflows/docker-image.yml b/.github/workflows/docker-image.yml index 635f1943b..a50c2d444 100644 --- a/.github/workflows/docker-image.yml +++ b/.github/workflows/docker-image.yml @@ -40,6 +40,8 @@ jobs: platforms: linux/amd64 # platforms: linux/amd64,linux/arm64 push: ${{ github.event_name != 'pull_request' }} - tags: ghcr.io/${{ github.REPOSITORY }}:latest + tags: | + ghcr.io/${{ github.REPOSITORY }}:latest + ghcr.io/${{ github.REPOSITORY }}:${{ github.sha }} # cache-from: type=gha # cache-to: type=gha,mode=max diff --git a/.github/workflows/rolling-update.yml b/.github/workflows/rolling-update.yml index 4d94aaf63..b7bb1ac01 100644 --- a/.github/workflows/rolling-update.yml +++ b/.github/workflows/rolling-update.yml @@ -175,24 +175,19 @@ jobs: local all="$1" local filter="$2" local out="" - local e key w + local w value if [[ -z "$all" ]]; then printf '%s' "" return 0 fi - IFS=',' read -r -a list_entries <<< "$all" IFS=',' read -r -a list_wanted <<< "$filter" - for e in "${list_entries[@]}"; do - e="${e//[[:space:]]/}" - [[ -n "$e" ]] || continue - key="${e%%=*}" - for w in "${list_wanted[@]}"; do - w="${w//[[:space:]]/}" - if [[ "$key" == "$w" ]]; then - out+="${out:+,}$e" - break - fi - done + for w in "${list_wanted[@]}"; do + w="${w//[[:space:]]/}" + [[ -n "$w" ]] || continue + value="$(lookup_map "$w" "$all" || true)" + if [[ -n "$value" ]]; then + out+="${out:+,}${w}=${value}" + fi done printf '%s' "$out" } @@ -243,8 +238,13 @@ jobs: fi NODES_FILTER="${NODES_FILTER//[[:space:]]/}" + ROLLING_ORDER="$(known_ids_csv "$NODES_RAFT_MAP")" if [[ -n "$NODES_FILTER" ]]; then - # Filter NODES_RAFT_MAP and SSH_TARGETS_MAP to the requested subset. + # Keep NODES_RAFT_MAP as the full cluster map. rolling-update.sh + # derives RAFT_TO_REDIS_MAP / RAFT_TO_S3_MAP and transfer + # candidates from NODES, so filtering it for a staged rollout would + # start the target node with an incomplete view of the cluster. + # The requested subset is passed separately as ROLLING_ORDER. # Reject any filter ID that does not appear in the map: silently # dropping unknown IDs would let a typo like "n1,n9" proceed as # a one-node rollout of n1 alone, which is a staged-deploy @@ -261,39 +261,44 @@ jobs: echo "::error::nodes filter '$NODES_FILTER' references unknown raft IDs: $unknown. Known IDs: $(known_ids_csv "$NODES_RAFT_MAP")" exit 1 fi - NODES_RAFT_MAP="$(filter_csv "$NODES_RAFT_MAP" "$NODES_FILTER")" - SSH_TARGETS_MAP="$(filter_csv "$SSH_TARGETS_MAP" "$NODES_FILTER")" - if [[ -z "$NODES_RAFT_MAP" ]]; then + ROLLING_ORDER="$(known_ids_csv "$(filter_csv "$NODES_RAFT_MAP" "$NODES_FILTER")")" + if [[ -z "$ROLLING_ORDER" ]]; then echo "::error::nodes filter '$NODES_FILTER' matches nothing in NODES_RAFT_MAP" exit 1 fi fi SSH_TARGETS_MAP="$(materialize_ssh_targets "$NODES_RAFT_MAP" "$SSH_TARGETS_MAP")" + ROLLING_SSH_TARGETS="$(filter_csv "$SSH_TARGETS_MAP" "$ROLLING_ORDER")" { echo "NODES=$NODES_RAFT_MAP" echo "SSH_TARGETS=$SSH_TARGETS_MAP" + echo "ROLLING_ORDER=$ROLLING_ORDER" + echo "ROLLING_SSH_TARGETS=$ROLLING_SSH_TARGETS" } >> "$GITHUB_OUTPUT" echo "::group::Deploy plan" echo "NODES=$NODES_RAFT_MAP" echo "SSH_TARGETS=$SSH_TARGETS_MAP" + echo "ROLLING_ORDER=$ROLLING_ORDER" + echo "ROLLING_SSH_TARGETS=$ROLLING_SSH_TARGETS" echo "::endgroup::" - - name: Tailscale reachability check + - name: SSH reachability check env: - SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }} + SSH_TARGETS: ${{ steps.render.outputs.ROLLING_SSH_TARGETS }} + SSH_USER: ${{ vars.SSH_USER }} run: | set -euo pipefail IFS=',' read -r -a entries <<< "$SSH_TARGETS" failed=0 for e in "${entries[@]}"; do - host="${e##*=}" - host="${host%%:*}" - # strip user@ if present - host="${host##*@}" - if tailscale ping --c 2 --timeout 3s "$host" >/dev/null 2>&1; then - echo " ok $host" + target="${e##*=}" + if [[ "$target" != *@* ]]; then + target="${SSH_USER:-$(id -un)}@$target" + fi + if ssh -o BatchMode=yes -o ConnectTimeout=10 -o StrictHostKeyChecking=yes "$target" true; then + echo " ok $target" else - echo "::error::$host not reachable over tailnet" + echo "::error::$target not reachable by SSH over tailnet" failed=1 fi done @@ -306,6 +311,7 @@ jobs: env: NODES: ${{ steps.render.outputs.NODES }} SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }} + ROLLING_ORDER: ${{ steps.render.outputs.ROLLING_ORDER }} IMAGE: ${{ vars.IMAGE_BASE }}:${{ inputs.image_tag || inputs.ref }} SSH_USER: ${{ vars.SSH_USER }} DRY_RUN: "true" @@ -321,6 +327,7 @@ jobs: env: NODES: ${{ steps.render.outputs.NODES }} SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }} + ROLLING_ORDER: ${{ steps.render.outputs.ROLLING_ORDER }} SSH_USER: ${{ vars.SSH_USER }} IMAGE: ${{ vars.IMAGE_BASE }}:${{ inputs.image_tag || inputs.ref }} SSH_STRICT_HOST_KEY_CHECKING: "yes" diff --git a/docs/deploy_via_tailscale_runbook.md b/docs/deploy_via_tailscale_runbook.md index 3c27b78b8..e10d42dc2 100644 --- a/docs/deploy_via_tailscale_runbook.md +++ b/docs/deploy_via_tailscale_runbook.md @@ -46,11 +46,24 @@ In the Tailscale admin console, add the deploy rule to the tailnet ACL: "src": ["tag:ci-deploy"], "dst": ["tag:elastickv-node:22"], }, + { + "action": "accept", + "src": ["tag:elastickv-node"], + "dst": [ + "tag:elastickv-node:50051", // Raft / raftadmin + "tag:elastickv-node:6379", // Redis adapter, if enabled + "tag:elastickv-node:9000", // S3 adapter, if enabled + ], + }, ], ``` `tag:ci-deploy` must NOT have access to any other port on the tailnet. The -deploy workflow only needs SSH. +deploy workflow only needs SSH. Node-to-node access is separate: every +`tag:elastickv-node` must be able to reach the cluster ports advertised in +`NODES_RAFT_MAP` / derived adapter maps, otherwise a restarted node can come +back with peer addresses it cannot dial and leader-transfer probes can fail +mid-roll. ## 3. Tailscale OAuth client @@ -109,15 +122,16 @@ Regenerate on operator rotation. |------|-------|---------| | `IMAGE_BASE` | Container image path (no tag) | `ghcr.io/bootjp/elastickv` | | `SSH_USER` | SSH login on every node | `bootjp` | -| `NODES_RAFT_MAP` | Comma-separated `raftId=host` (no port — the script appends `RAFT_PORT`). Use full MagicDNS FQDNs so every node can resolve the advertised address regardless of local DNS search domains. The workflow renders this into the script's `NODES` env var. | `n1=kv01..ts.net,n2=kv02..ts.net,n3=kv03..ts.net,n4=kv04..ts.net,n5=kv05..ts.net` | +| `NODES_RAFT_MAP` | Comma-separated `raftId=host` (no port — the script appends `RAFT_PORT`). Use full MagicDNS FQDNs so every node can resolve the advertised address regardless of local DNS search domains. The workflow always renders the full map into the script's `NODES` env var, even for subset rollouts; the `nodes` input becomes `ROLLING_ORDER` so the script still derives full-cluster peer maps. | `n1=kv01..ts.net,n2=kv02..ts.net,n3=kv03..ts.net,n4=kv04..ts.net,n5=kv05..ts.net` | | `SSH_TARGETS_MAP` | Optional comma-separated `raftId=ssh-host`. The workflow renders this into the script's `SSH_TARGETS` env var. Usually identical to `NODES_RAFT_MAP` unless SSH access uses a different hostname. If the variable is empty or an ID is omitted, the workflow falls back to that ID's `NODES_RAFT_MAP` host so reachability checks still cover every rollout node. | `n1=kv01..ts.net,n2=kv02..ts.net,...` | **Why two names?** The workflow uses `NODES_RAFT_MAP` / `SSH_TARGETS_MAP` in the `production` environment to keep the GitHub-side names distinct from the script-side env var names it hands to `rolling-update.sh`. If you run the script by hand from a workstation -you must export `NODES` and `SSH_TARGETS` directly — the workflow-side -names are only understood by the workflow's render step. +you must export `NODES` and `SSH_TARGETS` directly, plus `ROLLING_ORDER` +when you want a subset rollout — the workflow-side names are only +understood by the workflow's render step. ## 5. Running a deploy @@ -145,8 +159,16 @@ Recommended first-run sequence: ## 6. Rollback Re-run the workflow with `image_tag` set to the previous-known-good sha. The +Docker image workflow publishes both `latest` and the immutable commit-SHA tag +for each main-branch build, so SHA rollback works without a manual retag. The `nodes` input can target specific nodes if only some carry the bad image. +For private GHCR packages, each node must already be logged in to `ghcr.io` +with a deploy-scoped read token, or the remote `docker pull` will fail even +though the workflow runner's manifest check succeeded. Keep that credential +rotation outside this workflow for v1; the workflow only verifies that the tag +exists from the runner side. + ### If a running workflow is cancelled mid-rollout GitHub cancelling the job between node steps is the one operational @@ -190,7 +212,7 @@ and only the still-stale one gets recreated. ## 7. What the workflow does NOT do (yet) -- **No post-deploy health verification beyond tailnet reachability.** The +- **No post-deploy health verification beyond SSH reachability.** The script itself blocks on `raftadmin` leadership transfer and health-gate timeouts, but the workflow does not independently probe Prometheus or Redis after the roll. Add this when we have a canonical post-deploy @@ -214,10 +236,11 @@ environment" for the dry-run-approval alternatives (approach 2: add a second `production-dry-run` environment without required reviewers) if the friction becomes intolerable. -### `tailscale ping` fails for a node +### SSH reachability fails for a node The node may not be running `tailscaled`, not tagged `tag:elastickv-node`, or -the tailnet ACL may have drifted. `tailscale status` on the node should show -the tag; the admin console should show the IP in the `tag:elastickv-node` +the system `sshd` may not be reachable over the tailnet ACL. `tailscale status` +on the node should show the tag; the admin console should show the IP in the +`tag:elastickv-node` group. ### `image ... not found on ghcr.io` From e96ac64ca776b3783f0fa730cc63c508ea2e0c9f Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Wed, 24 Jun 2026 04:53:29 +0900 Subject: [PATCH 10/12] ops: harden rolling update dispatch settings --- .github/workflows/rolling-update.yml | 21 ++++++++++++++++++++- docs/deploy_via_tailscale_runbook.md | 12 +++++++++++- 2 files changed, 31 insertions(+), 2 deletions(-) diff --git a/.github/workflows/rolling-update.yml b/.github/workflows/rolling-update.yml index b7bb1ac01..cae682dfa 100644 --- a/.github/workflows/rolling-update.yml +++ b/.github/workflows/rolling-update.yml @@ -8,7 +8,7 @@ on: workflow_dispatch: inputs: ref: - description: Image tag/ref to deploy. Workflow code is always checked out from the repository default branch. + description: Image tag/ref to deploy. Start this workflow from the repository default branch. required: true type: string image_tag: @@ -56,9 +56,16 @@ jobs: env: GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} REF: ${{ inputs.ref }} + RUN_REF_NAME: ${{ github.ref_name }} + RUN_REF_TYPE: ${{ github.ref_type }} run: | set -euo pipefail default_branch=$(gh api "repos/${{ github.repository }}" --jq '.default_branch') + if [[ "$RUN_REF_TYPE" != "branch" || "$RUN_REF_NAME" != "$default_branch" ]]; then + echo "::error::rolling-update must be dispatched from the trusted default branch '$default_branch' (got ${RUN_REF_TYPE}:${RUN_REF_NAME})" + echo "::error::configure the production environment to allow deployments only from the default branch" + exit 1 + fi echo "checkout_ref=$default_branch" >> "$GITHUB_OUTPUT" echo "deploy ref/image tag: $REF" @@ -314,10 +321,16 @@ jobs: ROLLING_ORDER: ${{ steps.render.outputs.ROLLING_ORDER }} IMAGE: ${{ vars.IMAGE_BASE }}:${{ inputs.image_tag || inputs.ref }} SSH_USER: ${{ vars.SSH_USER }} + ENABLE_S3: ${{ vars.ENABLE_S3 || 'false' }} + S3_CREDENTIALS_FILE: ${{ vars.S3_CREDENTIALS_FILE }} DRY_RUN: "true" REF: ${{ inputs.ref }} run: | set -euo pipefail + if [[ "$ENABLE_S3" == "true" && -z "$S3_CREDENTIALS_FILE" ]]; then + echo "::error::ENABLE_S3=true requires S3_CREDENTIALS_FILE in the production environment" + exit 1 + fi ./scripts/rolling-update.sh --dry-run echo "ref: $REF" echo "Re-run with dry_run=false to apply." @@ -330,7 +343,13 @@ jobs: ROLLING_ORDER: ${{ steps.render.outputs.ROLLING_ORDER }} SSH_USER: ${{ vars.SSH_USER }} IMAGE: ${{ vars.IMAGE_BASE }}:${{ inputs.image_tag || inputs.ref }} + ENABLE_S3: ${{ vars.ENABLE_S3 || 'false' }} + S3_CREDENTIALS_FILE: ${{ vars.S3_CREDENTIALS_FILE }} SSH_STRICT_HOST_KEY_CHECKING: "yes" run: | set -euo pipefail + if [[ "$ENABLE_S3" == "true" && -z "$S3_CREDENTIALS_FILE" ]]; then + echo "::error::ENABLE_S3=true requires S3_CREDENTIALS_FILE in the production environment" + exit 1 + fi ./scripts/rolling-update.sh diff --git a/docs/deploy_via_tailscale_runbook.md b/docs/deploy_via_tailscale_runbook.md index e10d42dc2..b06bc7ce8 100644 --- a/docs/deploy_via_tailscale_runbook.md +++ b/docs/deploy_via_tailscale_runbook.md @@ -104,6 +104,14 @@ workflow inputs. Three ways to handle the dry-run-approval friction: v1 ships with approach 1 (single environment, prompt on every run). Approach 2 is the recommended upgrade once the friction becomes annoying. +### Deployment branch policy + +Restrict the `production` environment to deployments from the repository +default branch only. The workflow also has an early guard that fails when a +manual dispatch is started from any other branch, but the environment policy is +the trust boundary because GitHub executes workflow YAML from the selected +dispatch ref before checkout. + ### Environment secrets | Name | Value | @@ -124,6 +132,8 @@ Regenerate on operator rotation. | `SSH_USER` | SSH login on every node | `bootjp` | | `NODES_RAFT_MAP` | Comma-separated `raftId=host` (no port — the script appends `RAFT_PORT`). Use full MagicDNS FQDNs so every node can resolve the advertised address regardless of local DNS search domains. The workflow always renders the full map into the script's `NODES` env var, even for subset rollouts; the `nodes` input becomes `ROLLING_ORDER` so the script still derives full-cluster peer maps. | `n1=kv01..ts.net,n2=kv02..ts.net,n3=kv03..ts.net,n4=kv04..ts.net,n5=kv05..ts.net` | | `SSH_TARGETS_MAP` | Optional comma-separated `raftId=ssh-host`. The workflow renders this into the script's `SSH_TARGETS` env var. Usually identical to `NODES_RAFT_MAP` unless SSH access uses a different hostname. If the variable is empty or an ID is omitted, the workflow falls back to that ID's `NODES_RAFT_MAP` host so reachability checks still cover every rollout node. | `n1=kv01..ts.net,n2=kv02..ts.net,...` | +| `ENABLE_S3` | `true` to start the S3 adapter, `false` to keep it disabled. The workflow defaults missing values to `false` rather than the script's local default. | `true` | +| `S3_CREDENTIALS_FILE` | Node-local path to the SigV4 credentials file. Required when `ENABLE_S3=true`; the workflow fails before rollout if it is missing. | `/etc/elastickv/s3-credentials.json` | **Why two names?** The workflow uses `NODES_RAFT_MAP` / `SSH_TARGETS_MAP` in the `production` environment to keep the GitHub-side names @@ -140,7 +150,7 @@ Actions tab → "Rolling update" → Run workflow. Inputs: - `ref` — the image tag/ref to deploy. The workflow code itself is always - checked out from the repository default branch. + dispatched and checked out from the repository default branch. - `image_tag` — override only for rollbacks (e.g., deploy tag `v1.2.3` of a commit that was also `v1.2.3`) - `nodes` — subset of raft IDs, e.g., `n1,n2`. Empty rolls all nodes. From 8fcaac654b2359246c5aaea17c894e708e8dd9f4 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Wed, 24 Jun 2026 05:06:30 +0900 Subject: [PATCH 11/12] ops: harden deploy workflow review findings --- .github/workflows/docker-image.yml | 7 +++-- .github/workflows/rolling-update.yml | 26 ++++++++++++------- docs/deploy_via_tailscale_runbook.md | 22 +++++++++------- ...026_04_24_proposed_deploy_via_tailscale.md | 9 +++++-- 4 files changed, 42 insertions(+), 22 deletions(-) diff --git a/.github/workflows/docker-image.yml b/.github/workflows/docker-image.yml index a50c2d444..3e31e2c05 100644 --- a/.github/workflows/docker-image.yml +++ b/.github/workflows/docker-image.yml @@ -32,6 +32,9 @@ jobs: registry: ghcr.io username: ${{ github.repository_owner }} password: ${{ secrets.GITHUB_TOKEN }} + - name: Derive image name + id: image + run: echo "name=ghcr.io/${GITHUB_REPOSITORY,,}" >> "$GITHUB_OUTPUT" - name: Build and push uses: docker/build-push-action@v7 with: @@ -41,7 +44,7 @@ jobs: # platforms: linux/amd64,linux/arm64 push: ${{ github.event_name != 'pull_request' }} tags: | - ghcr.io/${{ github.REPOSITORY }}:latest - ghcr.io/${{ github.REPOSITORY }}:${{ github.sha }} + ${{ steps.image.outputs.name }}:latest + ${{ steps.image.outputs.name }}:${{ github.sha }} # cache-from: type=gha # cache-to: type=gha,mode=max diff --git a/.github/workflows/rolling-update.yml b/.github/workflows/rolling-update.yml index cae682dfa..b1658973c 100644 --- a/.github/workflows/rolling-update.yml +++ b/.github/workflows/rolling-update.yml @@ -70,19 +70,17 @@ jobs: echo "deploy ref/image tag: $REF" - name: Checkout trusted deploy script - uses: actions/checkout@v6 + uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6 with: ref: ${{ steps.trusted-ref.outputs.checkout_ref }} persist-credentials: false - - name: Install jq - run: sudo apt-get install -y --no-install-recommends jq - - name: Verify image exists on ghcr.io env: IMAGE_BASE: ${{ vars.IMAGE_BASE }} IMAGE_TAG: ${{ inputs.image_tag || inputs.ref }} GHCR_TOKEN: ${{ secrets.GITHUB_TOKEN }} + ACTOR: ${{ github.actor }} run: | set -euo pipefail if [[ -z "$IMAGE_BASE" ]]; then @@ -90,14 +88,14 @@ jobs: exit 1 fi echo "Checking $IMAGE_BASE:$IMAGE_TAG" - echo "$GHCR_TOKEN" | docker login ghcr.io -u "${{ github.actor }}" --password-stdin >/dev/null + echo "$GHCR_TOKEN" | docker login ghcr.io -u "$ACTOR" --password-stdin >/dev/null if ! docker manifest inspect "$IMAGE_BASE:$IMAGE_TAG" >/dev/null; then echo "::error::image $IMAGE_BASE:$IMAGE_TAG not found on ghcr.io" exit 1 fi - name: Join Tailnet (ephemeral) - uses: tailscale/github-action@v3 + uses: tailscale/github-action@6cae46e2d796f265265cfcf628b72a32b4d7cade # v3 with: oauth-client-id: ${{ secrets.TS_OAUTH_CLIENT_ID }} oauth-secret: ${{ secrets.TS_OAUTH_SECRET }} @@ -302,9 +300,19 @@ jobs: if [[ "$target" != *@* ]]; then target="${SSH_USER:-$(id -un)}@$target" fi - if ssh -o BatchMode=yes -o ConnectTimeout=10 -o StrictHostKeyChecking=yes "$target" true; then - echo " ok $target" - else + ok=0 + for attempt in 1 2 3 4 5 6; do + if ssh -o BatchMode=yes -o ConnectTimeout=10 -o StrictHostKeyChecking=yes "$target" true; then + echo " ok $target" + ok=1 + break + fi + if [[ "$attempt" -lt 6 ]]; then + echo " wait $target (attempt $attempt failed; retrying)" + sleep 10 + fi + done + if [[ "$ok" -ne 1 ]]; then echo "::error::$target not reachable by SSH over tailnet" failed=1 fi diff --git a/docs/deploy_via_tailscale_runbook.md b/docs/deploy_via_tailscale_runbook.md index b06bc7ce8..85c174a8e 100644 --- a/docs/deploy_via_tailscale_runbook.md +++ b/docs/deploy_via_tailscale_runbook.md @@ -9,7 +9,7 @@ runbook is for operators: what to configure on GitHub and Tailscale so the Each cluster node must have `tailscale` installed, logged into the tailnet, and tagged so the CI runner's ACL can reach it. -``` +```bash # on each kv0X node sudo tailscale up \ --ssh=false \ @@ -18,7 +18,7 @@ sudo tailscale up \ ``` `--ssh=false` disables Tailscale SSH, so the node's regular system -sshd must be running and authorised to accept connections on the +sshd must be running and authorized to accept connections on the tailnet interface. The workflow uses plain SSH over the tailnet (Tailscale is only the network layer); if you rely on Tailscale SSH for operator access elsewhere, drop this flag but keep in mind the @@ -26,7 +26,7 @@ workflow still connects to the system sshd. Verify the node is reachable by MagicDNS from another tailnet peer: -``` +```bash tailscale status | grep kv0X ping kv0X..ts.net ``` @@ -52,6 +52,7 @@ In the Tailscale admin console, add the deploy rule to the tailnet ACL: "dst": [ "tag:elastickv-node:50051", // Raft / raftadmin "tag:elastickv-node:6379", // Redis adapter, if enabled + "tag:elastickv-node:8000", // DynamoDB adapter, if enabled "tag:elastickv-node:9000", // S3 adapter, if enabled ], }, @@ -119,7 +120,7 @@ dispatch ref before checkout. | `TS_OAUTH_CLIENT_ID` | Tailscale OAuth client ID from step 3 | | `TS_OAUTH_SECRET` | Tailscale OAuth secret from step 3 | | `DEPLOY_SSH_PRIVATE_KEY` | OpenSSH private key, authorized on every node under the deploy user | -| `DEPLOY_KNOWN_HOSTS` | `ssh-keyscan -H kv01..ts.net kv02..ts.net …` output. Use `-H` to hash hostnames so the secret's contents don't leak the tailnet topology if the runner environment is compromised. | +| `DEPLOY_KNOWN_HOSTS` | `ssh-keyscan -H kv01..ts.net kv02..ts.net …` output. Use `-H` to hash hostnames so the secret's contents don't leak the tailnet topology if the runner environment is compromised. Regenerate this secret when the node list changes or if SSH reports `Host key verification failed`. | The SSH key should be ed25519, dedicated to CI (not a reused developer key). Regenerate on operator rotation. @@ -135,6 +136,11 @@ Regenerate on operator rotation. | `ENABLE_S3` | `true` to start the S3 adapter, `false` to keep it disabled. The workflow defaults missing values to `false` rather than the script's local default. | `true` | | `S3_CREDENTIALS_FILE` | Node-local path to the SigV4 credentials file. Required when `ENABLE_S3=true`; the workflow fails before rollout if it is missing. | `/etc/elastickv/s3-credentials.json` | +For private GHCR packages, log every node in to `ghcr.io` with a +deploy-scoped read token before the first rollout. The workflow's manifest check +proves the runner can see the image tag; it does not install Docker credentials +on the remote nodes that execute `docker pull`. + **Why two names?** The workflow uses `NODES_RAFT_MAP` / `SSH_TARGETS_MAP` in the `production` environment to keep the GitHub-side names distinct from the script-side env var names it hands to @@ -173,11 +179,9 @@ Docker image workflow publishes both `latest` and the immutable commit-SHA tag for each main-branch build, so SHA rollback works without a manual retag. The `nodes` input can target specific nodes if only some carry the bad image. -For private GHCR packages, each node must already be logged in to `ghcr.io` -with a deploy-scoped read token, or the remote `docker pull` will fail even -though the workflow runner's manifest check succeeded. Keep that credential -rotation outside this workflow for v1; the workflow only verifies that the tag -exists from the runner side. +For private GHCR packages, keep the node-level Docker credential rotation +outside this workflow for v1; the workflow only verifies that the tag exists +from the runner side. ### If a running workflow is cancelled mid-rollout diff --git a/docs/design/2026_04_24_proposed_deploy_via_tailscale.md b/docs/design/2026_04_24_proposed_deploy_via_tailscale.md index be9e3492d..069a819fd 100644 --- a/docs/design/2026_04_24_proposed_deploy_via_tailscale.md +++ b/docs/design/2026_04_24_proposed_deploy_via_tailscale.md @@ -131,12 +131,17 @@ With `dry_run: true` (the default): - Everything up to script invocation runs (checkout, tailnet join, SSH agent load, `NODES`/`SSH_TARGETS` render). -- `tailscale ping` is run against each SSH target to confirm reachability. +- An SSH reachability pre-check (`ssh -o BatchMode=yes -o ConnectTimeout=10 + true`) is retried against each SSH target. This confirms that + network routing, host-key trust, the system sshd, and deploy-key + authorization all work, which a network-layer ping cannot prove. - The script is invoked as `DRY_RUN=true ./scripts/rolling-update.sh --dry-run` with the same rendered env the live rollout would receive. It validates the node maps, rollout order, derived service maps, image name, and per-node SSH targets, then prints the plan. -- The actual `docker stop/rm/run` loop does NOT execute. +- The actual `docker stop/rm/run` loop does NOT execute. Dry-run also skips + remote image pulls and any other container side effects; only validation and + plan rendering run. This catches the common failure modes (bad secret, bad env mapping, a node unreachable over the tailnet) before touching any live container. From fdb65de92fe294ae8399e5429225b65de74ee3cf Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Wed, 24 Jun 2026 05:51:53 +0900 Subject: [PATCH 12/12] ops: harden deploy workflow preflights --- .github/workflows/rolling-update.yml | 72 +++++++++++++++++-- docs/deploy_via_tailscale_runbook.md | 38 +++++----- ...026_04_24_proposed_deploy_via_tailscale.md | 12 ++-- 3 files changed, 92 insertions(+), 30 deletions(-) diff --git a/.github/workflows/rolling-update.yml b/.github/workflows/rolling-update.yml index b1658973c..4c4b0c88b 100644 --- a/.github/workflows/rolling-update.yml +++ b/.github/workflows/rolling-update.yml @@ -46,6 +46,22 @@ jobs: # protection rules"). environment: production timeout-minutes: 60 + env: + CONTAINER_NAME: ${{ vars.CONTAINER_NAME || 'elastickv' }} + DATA_DIR: ${{ vars.DATA_DIR || '/var/lib/elastickv' }} + SERVER_ENTRYPOINT: ${{ vars.SERVER_ENTRYPOINT || '/app' }} + RAFT_ENGINE: ${{ vars.RAFT_ENGINE || 'etcd' }} + RAFT_PORT: ${{ vars.RAFT_PORT || '50051' }} + REDIS_PORT: ${{ vars.REDIS_PORT || '6379' }} + DYNAMO_PORT: ${{ vars.DYNAMO_PORT || '8000' }} + S3_PORT: ${{ vars.S3_PORT || '9000' }} + ENABLE_S3: ${{ vars.ENABLE_S3 || 'false' }} + S3_REGION: ${{ vars.S3_REGION || 'us-east-1' }} + S3_CREDENTIALS_FILE: ${{ vars.S3_CREDENTIALS_FILE }} + S3_PATH_STYLE_ONLY: ${{ vars.S3_PATH_STYLE_ONLY || 'true' }} + DEFAULT_EXTRA_ENV: ${{ vars.DEFAULT_EXTRA_ENV || 'GOMEMLIMIT=1800MiB' }} + EXTRA_ENV: ${{ vars.EXTRA_ENV }} + CONTAINER_MEMORY_LIMIT: ${{ vars.CONTAINER_MEMORY_LIMIT || '2500m' }} steps: # The deploy script is executed after the tailnet join and SSH key load. @@ -58,6 +74,7 @@ jobs: REF: ${{ inputs.ref }} RUN_REF_NAME: ${{ github.ref_name }} RUN_REF_TYPE: ${{ github.ref_type }} + RUN_SHA: ${{ github.sha }} run: | set -euo pipefail default_branch=$(gh api "repos/${{ github.repository }}" --jq '.default_branch') @@ -66,8 +83,9 @@ jobs: echo "::error::configure the production environment to allow deployments only from the default branch" exit 1 fi - echo "checkout_ref=$default_branch" >> "$GITHUB_OUTPUT" + echo "checkout_ref=$RUN_SHA" >> "$GITHUB_OUTPUT" echo "deploy ref/image tag: $REF" + echo "trusted workflow sha: $RUN_SHA" - name: Checkout trusted deploy script uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6 @@ -321,6 +339,54 @@ jobs: exit 1 fi + - name: Remote S3 credentials preflight + if: ${{ env.ENABLE_S3 == 'true' }} + env: + SSH_TARGETS: ${{ steps.render.outputs.ROLLING_SSH_TARGETS }} + SSH_USER: ${{ vars.SSH_USER }} + run: | + set -euo pipefail + if [[ -z "$S3_CREDENTIALS_FILE" ]]; then + echo "::error::ENABLE_S3=true requires S3_CREDENTIALS_FILE in the production environment" + exit 1 + fi + printf -v remote_path '%q' "$S3_CREDENTIALS_FILE" + IFS=',' read -r -a entries <<< "$SSH_TARGETS" + failed=0 + for e in "${entries[@]}"; do + target="${e##*=}" + if [[ "$target" != *@* ]]; then + target="${SSH_USER:-$(id -un)}@$target" + fi + if ssh -o BatchMode=yes -o ConnectTimeout=10 -o StrictHostKeyChecking=yes "$target" "test -r $remote_path"; then + echo " ok $target:$S3_CREDENTIALS_FILE" + else + echo "::error::$target cannot read S3_CREDENTIALS_FILE=$S3_CREDENTIALS_FILE" + failed=1 + fi + done + if [[ "$failed" -ne 0 ]]; then + exit 1 + fi + + - name: Log rollout configuration + env: + NODES: ${{ steps.render.outputs.NODES }} + SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }} + ROLLING_ORDER: ${{ steps.render.outputs.ROLLING_ORDER }} + IMAGE: ${{ vars.IMAGE_BASE }}:${{ inputs.image_tag || inputs.ref }} + run: | + set -euo pipefail + echo "::group::Rollout runtime configuration" + for name in \ + IMAGE CONTAINER_NAME DATA_DIR SERVER_ENTRYPOINT RAFT_ENGINE \ + RAFT_PORT REDIS_PORT DYNAMO_PORT S3_PORT ENABLE_S3 S3_REGION \ + S3_CREDENTIALS_FILE S3_PATH_STYLE_ONLY DEFAULT_EXTRA_ENV EXTRA_ENV \ + CONTAINER_MEMORY_LIMIT NODES SSH_TARGETS ROLLING_ORDER; do + printf '%s=%s\n' "$name" "${!name-}" + done + echo "::endgroup::" + - name: Dry-run summary if: ${{ inputs.dry_run }} env: @@ -329,8 +395,6 @@ jobs: ROLLING_ORDER: ${{ steps.render.outputs.ROLLING_ORDER }} IMAGE: ${{ vars.IMAGE_BASE }}:${{ inputs.image_tag || inputs.ref }} SSH_USER: ${{ vars.SSH_USER }} - ENABLE_S3: ${{ vars.ENABLE_S3 || 'false' }} - S3_CREDENTIALS_FILE: ${{ vars.S3_CREDENTIALS_FILE }} DRY_RUN: "true" REF: ${{ inputs.ref }} run: | @@ -351,8 +415,6 @@ jobs: ROLLING_ORDER: ${{ steps.render.outputs.ROLLING_ORDER }} SSH_USER: ${{ vars.SSH_USER }} IMAGE: ${{ vars.IMAGE_BASE }}:${{ inputs.image_tag || inputs.ref }} - ENABLE_S3: ${{ vars.ENABLE_S3 || 'false' }} - S3_CREDENTIALS_FILE: ${{ vars.S3_CREDENTIALS_FILE }} SSH_STRICT_HOST_KEY_CHECKING: "yes" run: | set -euo pipefail diff --git a/docs/deploy_via_tailscale_runbook.md b/docs/deploy_via_tailscale_runbook.md index 85c174a8e..132539734 100644 --- a/docs/deploy_via_tailscale_runbook.md +++ b/docs/deploy_via_tailscale_runbook.md @@ -71,13 +71,12 @@ mid-roll. Admin console → Settings → OAuth clients → New client: - Description: `elastickv GitHub Actions deploy` -- Scopes: `auth_keys` (write). Recent `tailscale/github-action` versions - may additionally require `devices:write` (to register and clean up - the ephemeral node); enable that if the join step fails with an - authorization error. The action's README is the definitive source - for current scope requirements. `devices:core` is NOT a valid - Tailscale OAuth scope — earlier drafts of this runbook named it and - would have produced an auth failure. +- Scopes: `auth_keys` (write). The pinned `tailscale/github-action` + version uses this OAuth client to mint the ephemeral auth key. If the + join step fails with a 403 during device registration or cleanup, + add the exact Devices scope named by the action README and the + Tailscale OAuth UI for that action version; do not guess from older + drafts of this runbook. - Tags: `tag:ci-deploy` Copy the client ID and secret; they go into GitHub in the next step. @@ -196,18 +195,19 @@ hazard that needs manual cleanup. container is absent or `Exited`, finish the recreate by hand. The `docker run` invocation itself is redirected to `/dev/null` by the script, so the workflow log does NOT contain the full argv. To - reconstruct it, read the `Roll cluster` step's rendered - environment — the workflow exports `IMAGE`, `DATA_DIR`, - `RAFT_PORT`, `REDIS_PORT`, `S3_PORT`, `ENABLE_S3`, `NODES`, - `SSH_TARGETS`, and the merged `EXTRA_ENV` before invoking the - script. Anything not explicitly set (e.g., `RAFT_PORT` in a - minimally-overridden deploy) falls back to the script's default - (`RAFT_PORT=50051`, `REDIS_PORT=6379`, `S3_PORT=9000`, - `ENABLE_S3=true`). GOMEMLIMIT / CONTAINER_MEMORY_LIMIT (PR #617) - are propagated via `EXTRA_ENV` once that PR lands. Together the - rendered env + the node's `deploy.env` is enough to reconstruct - the same `docker run` you would see if you re-ran with the same - inputs. + reconstruct it, open the `Log rollout configuration` step: it emits + the actual `IMAGE`, `CONTAINER_NAME`, `DATA_DIR`, `SERVER_ENTRYPOINT`, + `RAFT_ENGINE`, `RAFT_PORT`, `REDIS_PORT`, `DYNAMO_PORT`, `S3_PORT`, + `ENABLE_S3`, `S3_REGION`, `S3_CREDENTIALS_FILE`, + `S3_PATH_STYLE_ONLY`, `DEFAULT_EXTRA_ENV`, `EXTRA_ENV`, + `CONTAINER_MEMORY_LIMIT`, `NODES`, `SSH_TARGETS`, and + `ROLLING_ORDER` values used by the following `Roll cluster` step. + The workflow sets the same defaults as `scripts/rolling-update.sh` + except for `ENABLE_S3`, which defaults to `false` in the workflow + unless the `production` environment variable explicitly enables it. + Together those logged values plus the node's current Docker state + are enough to reconstruct the same `docker run` you would get from + re-running the workflow with the same inputs. 3. **Confirm the new leader via `raftadmin` or metrics** before re-running the workflow with `nodes:` scoped to the remaining untouched IDs. Do NOT re-run the full rollout if the partial one is still in flight — diff --git a/docs/design/2026_04_24_proposed_deploy_via_tailscale.md b/docs/design/2026_04_24_proposed_deploy_via_tailscale.md index 069a819fd..c7dc3acd8 100644 --- a/docs/design/2026_04_24_proposed_deploy_via_tailscale.md +++ b/docs/design/2026_04_24_proposed_deploy_via_tailscale.md @@ -96,12 +96,12 @@ Use OAuth ephemeral nodes (not a long-lived auth key): - Create an OAuth client in Tailscale admin console with scope `auth_keys` (write) on tag `tag:ci-deploy`. (`tailscale/github-action` - uses the OAuth client to mint a short-lived auth key on each run; - recent action versions may also require `devices:write` so the - ephemeral node can register and be cleaned up — consult the action's - README for the current scope list. Earlier drafts of this doc named - `devices:core`, which is not a supported Tailscale OAuth scope and - would fail authentication.) + uses the OAuth client to mint a short-lived auth key on each run.) + If the pinned action starts returning 403 during device registration + or cleanup, add the exact Devices scope named by the action README + and the Tailscale OAuth UI for that action version; keep this doc as + the rollout contract, not the authority for future Tailscale scope + names. - Store client ID + secret in GitHub env secrets. - `tailscale/github-action@v3` joins the tailnet for the duration of the job as an ephemeral tagged node; disconnects automatically on job exit.