ops(deploy): rolling-update via GitHub Actions over Tailscale#616
ops(deploy): rolling-update via GitHub Actions over Tailscale#616bootjp wants to merge 15 commits into
Conversation
Today's rolling-update flow is manual: operators SSH from a workstation with required env vars and invoke scripts/rolling-update.sh. That has no audit trail, no approval gate, no dry-run, and relies on per-operator secret handling. This change adds a workflow_dispatch workflow that joins the Tailnet via tailscale/github-action (ephemeral OAuth node, tag:ci-deploy), SSHes into each cluster node over MagicDNS, and invokes the existing scripts/rolling-update.sh — unchanged. Nodes stay as they are; the script's env-var contract is the integration boundary. Highlights: - workflow_dispatch only (no auto-deploy); production GitHub environment gates non-dry-run runs on required reviewers. - inputs: ref (required), image_tag (rollback override), nodes (subset filter), dry_run (default true). - Dry-run does everything up to the container touch: renders NODES + SSH_TARGETS from env variables, verifies the image exists on ghcr.io, and tailscale-pings every target. Catches typo'd inputs before any production effect. - Concurrency group "rolling-update", cancel-in-progress: false, so parallel invocations queue rather than stomp. - No node-side changes required: nodes are assumed to already run tailscaled and expose authorized_keys for the deploy user. Plain SSH over Tailscale; Tailscale SSH (keyless) is called out as a follow-up. Design doc: docs/design/2026_04_24_proposed_deploy_via_tailscale.md Operator runbook: docs/deploy_via_tailscale_runbook.md Validation: - actionlint passes (0 errors) - YAML parses - No changes to scripts/rolling-update.sh; the workflow calls it with the same env var contract already documented in the script's usage(). Out of scope (follow-ups): post-deploy health verification, auto- rollback on script failure, Jepsen-gating, image-signature verification, Tailscale SSH (keyless), shared deploy user.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
💤 Files with no reviewable changes (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughA new manual rolling-deploy workflow is added for Tailscale-based production rollouts, with trusted checkout, image verification, rollout planning, reachability checks, and dry-run/live execution paths. The Docker workflow now emits SHA tags, and the script and docs add dry-run and operator guidance. ChangesRolling Deploy via Tailscale + GitHub Actions
Sequence Diagram(s)sequenceDiagram
participant Operator
participant GitHubActions as GitHub Actions
participant GHCR
participant TailscaleNetwork as Tailscale Tailnet
participant ClusterNodes as Cluster Nodes
Operator->>GitHubActions: workflow_dispatch (ref, image_tag, nodes, dry_run)
GitHubActions->>GitHubActions: enforce default-branch ref via gh api
GitHubActions->>GHCR: docker manifest inspect IMAGE_BASE:TAG
GHCR-->>GitHubActions: manifest present / fail-fast on 404
GitHubActions->>TailscaleNetwork: join tailnet
TailscaleNetwork-->>GitHubActions: connected
GitHubActions->>GitHubActions: validate maps, compute rollout order and targets
loop each SSH target
GitHubActions->>ClusterNodes: ssh reachability check
ClusterNodes-->>GitHubActions: OK / fail
end
alt dry_run == true
GitHubActions->>GitHubActions: rolling-update.sh --dry-run
else dry_run == false
GitHubActions->>ClusterNodes: rolling-update.sh live rollout
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a design proposal and an operational runbook for migrating the manual rolling update process to an automated GitHub Actions workflow utilizing Tailscale for secure connectivity. The review feedback identifies critical configuration errors in the documentation examples, specifically regarding the inclusion of ports in the NODES_RAFT_MAP which would cause deployment script failures, and a mismatch in the required Tailscale OAuth scopes.
| entries. Prevents the first-connect TOFU prompt. | ||
|
|
||
| **Variables (non-secret):** | ||
| - `NODES_RAFT_MAP` — `n1=kv01:50051,n2=kv02:50051,...` (advertised addresses |
There was a problem hiding this comment.
The example for NODES_RAFT_MAP includes the port, which is incompatible with the scripts/rolling-update.sh script as it appends the RAFT_PORT automatically. It is better to use the hostnames only, preferably the MagicDNS FQDNs for reliable resolution. Furthermore, as this design document outlines a production-grade deployment and migration, it must detail strategies for live or zero-downtime cutovers and provide mitigation strategies for operational risks, such as potential failures during rolling upgrades.
| - `NODES_RAFT_MAP` — `n1=kv01:50051,n2=kv02:50051,...` (advertised addresses | |
| - NODES_RAFT_MAP — n1=kv01.<tailnet>.ts.net,n2=kv02.<tailnet>.ts.net,... (advertised addresses |
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
- When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
| Admin console → Settings → OAuth clients → New client: | ||
|
|
||
| - Description: `elastickv GitHub Actions deploy` | ||
| - Scopes: `auth_keys` (write) |
There was a problem hiding this comment.
The OAuth client scope should be devices (write) instead of auth_keys (write). The tailscale/github-action requires the devices scope to register the ephemeral node in the tailnet using OAuth. This is correctly identified in the design document (§2.3) but misstated here.
| - Scopes: `auth_keys` (write) | |
| - Scopes: devices (write) |
| |------|-------|---------| | ||
| | `IMAGE_BASE` | Container image path (no tag) | `ghcr.io/bootjp/elastickv` | | ||
| | `SSH_USER` | SSH login on every node | `bootjp` | | ||
| | `NODES_RAFT_MAP` | Comma-separated `raftId=host:port` | `n1=kv01:50051,n2=kv02:50051,n3=kv03:50051,n4=kv04:50051,n5=kv05:50051` | |
There was a problem hiding this comment.
The example for NODES_RAFT_MAP includes the port (e.g., :50051), but the scripts/rolling-update.sh script automatically appends the RAFT_PORT to the host value. Providing the port here will result in an invalid address like host:50051:50051 when the script constructs the --address flag. The value should only contain the hostnames or IPs. Additionally, using the MagicDNS FQDN is recommended for consistency with the SSH_TARGETS_MAP example.
- NODES_RAFT_MAP examples no longer include ":50051". scripts/rolling-update.sh auto-appends RAFT_PORT to each entry, so a port in NODES produces host:port:port when the script constructs --address (Gemini HIGH). - OAuth client scope: design doc §2.3 said "devices:write" while the runbook §3 said "auth_keys (write)". Aligned both on "auth_keys (write)" which is what tailscale/github-action requires to mint the ephemeral join key. Added a note that newer action versions may additionally need "devices:core" (write) and to consult the action's README as the authoritative scope list (addresses the Gemini MEDIUM comment while hedging against action-version drift).
|
Addressed review feedback in 6322748:
Note on the CI failure (test job, ubuntu-latest): the failing tests are /gemini review |
There was a problem hiding this comment.
Code Review
This pull request proposes a design and runbook for automating rolling updates via GitHub Actions and Tailscale. The feedback highlights a missing workflow file, a contradiction in dry-run implementation, and the need for more detailed mitigation and cutover strategies. Additionally, improvements are suggested for the runbook regarding SSH configuration clarity, security hygiene for host scanning, and environment variable consistency.
| - The script is invoked with `--help` + the rendered env is printed as a | ||
| collapsed log group. |
There was a problem hiding this comment.
There is a contradiction regarding dry-run support; using --help is insufficient for validation. A native --dry-run flag should be implemented. Furthermore, as this design document covers a production-grade migration and rolling upgrades, it must include mitigation strategies for operational risks (e.g., 'bridge' or 'proxy' mode) and detail zero-downtime cutover strategies (e.g., blue/green or dual-write proxy) to comply with repository requirements.
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary 'bridge' or 'proxy' mode.
- When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
|
|
||
| ## 5. Implementation plan | ||
|
|
||
| 1. Write `.github/workflows/rolling-update.yml` implementing §2.1. |
There was a problem hiding this comment.
The implementation plan lists the creation of .github/workflows/rolling-update.yml as the first step, and the PR description indicates that the workflow has been verified with actionlint. However, the workflow file itself is missing from this pull request. Please include the workflow file to ensure the proposed deployment mechanism can be reviewed and tested.
| sudo tailscale up \ | ||
| --ssh=false \ | ||
| --advertise-tags=tag:elastickv-node \ | ||
| --accept-routes=false |
There was a problem hiding this comment.
The instructions specify --ssh=false, which disables Tailscale SSH. This approach requires the standard system SSH daemon (sshd) to be running and correctly configured to accept connections on the Tailscale interface. It would be safer to explicitly mention this requirement to prevent operators from being locked out of the deployment flow if they rely solely on Tailscale SSH for access.
| | `TS_OAUTH_CLIENT_ID` | Tailscale OAuth client ID from step 3 | | ||
| | `TS_OAUTH_SECRET` | Tailscale OAuth secret from step 3 | | ||
| | `DEPLOY_SSH_PRIVATE_KEY` | OpenSSH private key, authorized on every node under the deploy user | | ||
| | `DEPLOY_KNOWN_HOSTS` | `ssh-keyscan kv01.<tailnet>.ts.net kv02.<tailnet>.ts.net …` output (one host per line) | |
There was a problem hiding this comment.
| | `NODES_RAFT_MAP` | Comma-separated `raftId=host` (no port — the script appends `RAFT_PORT`) | `n1=kv01,n2=kv02,n3=kv03,n4=kv04,n5=kv05` | | ||
| | `SSH_TARGETS_MAP` | Comma-separated `raftId=ssh-host` | `n1=kv01.<tailnet>.ts.net,n2=kv02.<tailnet>.ts.net,...` | |
There was a problem hiding this comment.
The environment variable names NODES_RAFT_MAP and SSH_TARGETS_MAP differ from the names expected by the scripts/rolling-update.sh script (NODES and SSH_TARGETS). While the workflow likely performs a mapping, this discrepancy can lead to confusion if an operator attempts to run the script manually using the variables defined in this runbook. Aligning the names or explicitly noting the mapping would improve maintainability and reduce friction during manual troubleshooting.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6322748fe4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| permissions: | ||
| contents: read | ||
| id-token: write # required by tailscale/github-action OIDC flow |
There was a problem hiding this comment.
Add packages:read for GHCR manifest check
This workflow narrows GITHUB_TOKEN to contents and id-token, which implicitly removes package scope, but the Verify image exists on ghcr.io step authenticates to GHCR and inspects a manifest with that token. In environments where the image is private (or package auth is otherwise required), docker login/docker manifest inspect will fail with authorization errors before rollout begins, so deploys are blocked even when the image exists.
Useful? React with 👍 / 👎.
- workflow: add `packages: read` to the job permissions so the `Verify image exists on ghcr.io` step's `docker manifest inspect` call works against private ghcr.io images (Codex P1). - runbook §1: explain that `--ssh=false` disables Tailscale SSH and the workflow relies on the system sshd — operators who use Tailscale SSH elsewhere need to keep that in mind (Gemini Medium). - runbook §4: change `ssh-keyscan` example + troubleshooting to `ssh-keyscan -H` so known_hosts entries are hashed and the secret does not leak tailnet topology in plaintext (Gemini Security Medium). - runbook §4 variables: document that `NODES_RAFT_MAP` / `SSH_TARGETS_MAP` are workflow-side names the render step maps to the script's `NODES` / `SSH_TARGETS`; manual invocation from a workstation must use the script-side names (Gemini Medium). Not addressed: Gemini HIGH claim that the workflow file is missing (line 187) — it IS included at `.github/workflows/rolling-update.yml` in this PR; the reviewer misread the file list. Not addressed: Gemini HIGH re native --dry-run flag + zero-downtime strategy (line 128) — dry-run is deliberately a workflow-level input, not a script-level flag, so the script stays invokable from a workstation without CI-specific options; zero-downtime cutover is outside the scope of a CI wrapper and is tracked in the resilience-roadmap follow-ups.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ad00bdc0ba
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if [[ -z "$NODES_RAFT_MAP" ]]; then | ||
| echo "::error::nodes filter '$NODES_FILTER' matches nothing in NODES_RAFT_MAP" | ||
| exit 1 |
There was a problem hiding this comment.
Reject partially invalid node filters
The filter logic silently drops unknown raft IDs and only errors when no IDs match, so an input like nodes: n1,n9 proceeds as a rollout of n1 only. That can leave operators believing multiple nodes were upgraded when one was skipped, which is risky for staged deploys/rollbacks and can leave the cluster in an unintended mixed-version state. Treat inputs.nodes as an exact set and fail if any requested ID is missing from the configured map(s).
Useful? React with 👍 / 👎.
|
Addressed round-2 in ad00bdc:
Not addressed: HIGH claim that the workflow file is missing — it IS in this PR at .github/workflows/rolling-update.yml (reviewer misread file list). HIGH re native --dry-run flag + zero-downtime — dry-run is deliberately a workflow-level input to keep the script clean for workstation use; zero-downtime cutover is outside the CI wrapper's scope. /gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a design proposal and an operational runbook for migrating manual rolling updates to a GitHub Actions workflow using Tailscale. The feedback identifies that the workflow file itself is missing from the PR and points out technical issues regarding GitHub environment protection rules, DNS resolution for Raft nodes, and OAuth scope requirements. Additionally, it suggests adding mitigation strategies for risks associated with workflow cancellation.
|
|
||
| ## 5. Implementation plan | ||
|
|
||
| 1. Write `.github/workflows/rolling-update.yml` implementing §2.1. |
There was a problem hiding this comment.
The pull request title, description, and test plan indicate that the GitHub Actions workflow has been implemented and verified (e.g., actionlint passes). However, the .github/workflows/rolling-update.yml file is missing from the changes in this pull request. If this PR is intended to implement the deployment flow, please include the workflow file.
| reviewers. A non-dry-run deploy will pause until approved; the dry-run run | ||
| itself does not need approval (it only needs the tailnet join). |
There was a problem hiding this comment.
There is a contradiction between the workflow definition (line 54) and the dry-run semantics described here. If the job is associated with the production environment, GitHub will enforce the environment's protection rules (such as required reviewers) for every run, including dry-runs. To allow dry-runs to proceed without approval, the workflow should dynamically set the environment or use a separate job for the dry-run/preview phase.
| **Gap:** there is no "stop mid-rollout" control today. If the workflow is | ||
| cancelled via GitHub UI during a roll, the in-flight node may be mid-recreate. | ||
| `rolling-update.sh` is supposed to be idempotent and crash-safe, but this | ||
| should be verified before we call the workflow production-ready. |
There was a problem hiding this comment.
The design identifies a significant operational risk regarding workflow cancellation during a rollout. Per the general rules, such risks should be accompanied by detailed mitigation strategies. Consider adding a section to the runbook for manual recovery or updating the deployment script to handle termination signals gracefully to ensure the cluster is not left in an inconsistent state.
References
- When a design document identifies a significant operational risk, it must also detail potential mitigation strategies.
| |------|-------|---------| | ||
| | `IMAGE_BASE` | Container image path (no tag) | `ghcr.io/bootjp/elastickv` | | ||
| | `SSH_USER` | SSH login on every node | `bootjp` | | ||
| | `NODES_RAFT_MAP` | Comma-separated `raftId=host` (no port — the script appends `RAFT_PORT`). The workflow renders this into the script's `NODES` env var. | `n1=kv01,n2=kv02,n3=kv03,n4=kv04,n5=kv05` | |
There was a problem hiding this comment.
The example for NODES_RAFT_MAP uses short hostnames (e.g., kv01). For reliable cluster communication over Tailscale, it is recommended to use the full MagicDNS names (e.g., kv01.<tailnet>.ts.net). Short names may not be resolvable by the nodes themselves depending on their local DNS search domains, which could prevent the Raft cluster from forming correctly.
Suggested change:
| NODES_RAFT_MAP | Comma-separated raftId=host (no port — the script appends RAFT_PORT). The workflow renders this into the script's NODES env var. | n1=kv01.<tailnet>.ts.net,n2=kv02.<tailnet>.ts.net,n3=kv03.<tailnet>.ts.net,n4=kv04.<tailnet>.ts.net,n5=kv05.<tailnet>.ts.net |
| - Scopes: `auth_keys` (write). Recent `tailscale/github-action` versions | ||
| may additionally require `devices:core` (write); enable that if the | ||
| join step fails with an authorization error. The action's README is | ||
| the definitive source for current scope requirements. |
There was a problem hiding this comment.
When using an OAuth client with the tailscale/github-action, the devices scope with write access is generally required to allow the action to register the ephemeral runner as a node. I recommend updating the instructions to include this scope as a requirement to avoid authorization errors during the tailnet join step.
Suggested change:
- Scopes: auth_keys (write) and devices (write). The devices scope (or devices:core in newer API versions) is required for the action to register the ephemeral runner as a node in your tailnet.
| rules" to auto-approve if the workflow input `dry_run == true` (optional; cuts | ||
| friction for previews). |
There was a problem hiding this comment.
GitHub's native environment protection rules do not support conditional auto-approval based on workflow inputs. Implementing this would require a custom GitHub Action for Deployment Protection Rules. It would be more accurate to state that dry-runs will still require manual approval if they target the production environment, or suggest using a separate, unprotected environment for previews.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ad00bdc0ba
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if [[ -z "$NODES_RAFT_MAP" ]]; then | ||
| echo "::error::nodes filter '$NODES_FILTER' matches nothing in NODES_RAFT_MAP" | ||
| exit 1 |
There was a problem hiding this comment.
Fail when requested node IDs are only partially matched
The nodes filter currently errors only when zero IDs match, so a typo like n1,n9 (or whitespace variant n1, n2) silently drops unmatched IDs and proceeds with a partial rollout. That means operators can believe they updated a specific subset while one or more intended nodes were skipped, which is risky during security or incident-driven deploys. Please validate that every requested ID is present in NODES_RAFT_MAP and fail if any are missing.
Useful? React with 👍 / 👎.
| IFS=',' read -r -a entries <<< "$SSH_TARGETS" | ||
| failed=0 | ||
| for e in "${entries[@]}"; do |
There was a problem hiding this comment.
Check reachability for all rollout nodes, not only SSH map entries
The reachability step iterates only over SSH_TARGETS, but rolling-update.sh resolves missing SSH mappings by falling back to each node host from NODES (ssh_target_by_id). If SSH_TARGETS_MAP is incomplete, dry-run can report success without probing some actual rollout targets, and the job can then fail mid-roll when it reaches an unvalidated node. Preflight should derive targets from NODES + SSH_TARGETS using the same fallback semantics or enforce one-to-one mapping coverage first.
Useful? React with 👍 / 👎.
- workflow nodes filter (Codex P2): reject any raft ID in the `nodes` input that does not appear in NODES_RAFT_MAP. Previously a typo like `n1,n9` silently rolled n1 only; now the workflow fails fast with a list of known IDs so the operator sees the typo before touching prod. - runbook section 4 (Gemini Medium x2): GitHub's native environment protection rules cannot be made conditional on workflow inputs, so the previous "auto-approve dry-run" guidance was wrong. Documented the three workable options: accept the prompt for dry-runs too (v1 default), split into a second unprotected environment, or install a deployment-protection-rule GitHub App. - runbook section 4 NODES_RAFT_MAP example (Gemini Medium): use full MagicDNS FQDNs instead of short hostnames so every node can resolve its peers regardless of local DNS search domains. - runbook section 6 (Gemini Medium): added "If a running workflow is cancelled mid-rollout" recovery steps — find the in-flight node from logs, finish the recreate by hand, confirm leader, rerun scoped. Filed as a tracked gap to teach the workflow per-node start-markers in a follow-up. Not addressed: Gemini HIGH line 187 claiming the workflow file is missing — the file IS present at .github/workflows/rolling-update.yml and has been since the first push of this PR. Third time the bot has flagged this (same finding in rounds 1 and 2); leaving as-is since responding further would just be repeating the same correction.
|
Round 3 in 894bce9:
Not addressed: Gemini HIGH line 187 (workflow file missing) — the file IS in this PR; third round the bot has flagged it despite the response; leaving as-is. /gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a design document and a comprehensive runbook for migrating the manual rolling-update process to an automated GitHub Actions workflow using Tailscale for secure connectivity. The documentation covers setup, security configurations, and operational procedures. Feedback focuses on correcting technical inaccuracies, such as fixing a YAML syntax error in the workflow snippet, updating Tailscale OAuth scope names, and ensuring that log patterns and command visibility descriptions in the runbook accurately reflect the behavior of the underlying deployment scripts. Additionally, an inconsistency regarding the use of FQDNs for node mapping was identified between the design and operational docs.
| entries. Prevents the first-connect TOFU prompt. | ||
|
|
||
| **Variables (non-secret):** | ||
| - `NODES_RAFT_MAP` — `n1=kv01,n2=kv02,...` (advertised hostnames as seen |
There was a problem hiding this comment.
There is an inconsistency between the design document and the runbook. The runbook correctly recommends using full MagicDNS FQDNs for NODES_RAFT_MAP to ensure reliable resolution across nodes with different search domains, whereas this example uses short hostnames. Furthermore, as this design document proposes a production-grade system migration, it must detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
| - `NODES_RAFT_MAP` — `n1=kv01,n2=kv02,...` (advertised hostnames as seen | |
| - NODES_RAFT_MAP — n1=kv01.tailnet.ts.net,n2=kv02.tailnet.ts.net,... (advertised hostnames as seen |
References
- When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
|
|
||
| - Description: `elastickv GitHub Actions deploy` | ||
| - Scopes: `auth_keys` (write). Recent `tailscale/github-action` versions | ||
| may additionally require `devices:core` (write); enable that if the |
There was a problem hiding this comment.
The scope devices:core is not a standard Tailscale OAuth scope. For the tailscale/github-action to successfully join the tailnet and manage ephemeral nodes, it typically requires the devices:write scope (along with auth_keys:write). Using an incorrect scope name will cause the authentication step to fail.
| may additionally require `devices:core` (write); enable that if the | |
| may additionally require devices (write); enable that if the |
| hazard that needs manual cleanup. | ||
|
|
||
| 1. **Look at the last log line from the `Roll cluster` step.** The script | ||
| logs `[rolling-update] rolling n<id>: docker stop/rm/run ...` before |
There was a problem hiding this comment.
The runbook describes a log line that does not exist in the current version of scripts/rolling-update.sh. The script actually logs ==> [$node_id@$node_host] start at the beginning of each node's update process. Providing the correct log pattern is important for operators to accurately identify the state of a cancelled rollout.
| logs `[rolling-update] rolling n<id>: docker stop/rm/run ...` before | |
| logs ==> [n<id>@<host>] start before each node recreate. Whatever |
| flight when the cancel signal landed. | ||
| 2. **SSH into that node** over Tailscale and run `docker ps`. If the | ||
| container is absent or `Exited`, finish the recreate by hand with the | ||
| docker run arguments the script emitted (which you can see in the |
There was a problem hiding this comment.
The script scripts/rolling-update.sh does not emit the full docker run command to the logs; it redirects the output of the docker run command to /dev/null. Operators will not be able to see the arguments in the workflow log for the Roll cluster step. They should instead refer to the environment variables rendered in the preceding workflow step to reconstruct the command.
| docker run arguments the script emitted (which you can see in the | |
| docker run arguments (which you can reconstruct from the rendered |
|
|
||
| ``` | ||
| name: Rolling update | ||
| on: workflow_dispatch: |
There was a problem hiding this comment.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 894bce93d9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| Expected for non-dry-run deploys — a reviewer from the `production` environment | ||
| must click Approve. Check the "Required reviewers" list in the environment |
There was a problem hiding this comment.
Fix approval troubleshooting text for dry-run workflows
This troubleshooting guidance is incorrect for the workflow as implemented: .github/workflows/rolling-update.yml sets environment: production unconditionally, so dry-run and non-dry-run executions both pause for environment approval in v1. Telling operators this is only expected for non-dry-runs can send them down the wrong debugging path when a dry-run is legitimately waiting for approval.
Useful? React with 👍 / 👎.
- Gemini HIGH (design line 82): switch NODES_RAFT_MAP example to full MagicDNS FQDNs so it matches the runbook; bare hostnames resolve differently per node. - Gemini Medium (design line 45): fix YAML — on/workflow_dispatch/inputs must be nested, not on a single line, and the fence is labelled yaml. - Gemini Medium (runbook §3, design §2.3): retract devices:core — not a valid Tailscale OAuth scope; note devices:write as the standard one. - Gemini Medium (runbook §6, line 153-156): correct the cancelled-job log pattern to what the script actually emits (`==> [<id>@<host>] start`, scripts/rolling-update.sh:398), not the fictitious `[rolling-update] rolling n<id>: ...`. - Gemini Medium (runbook §6, line 156-160): clarify that docker run stdout/stderr is redirected to /dev/null, so operators reconstruct the invocation from the step-level env log, not from the docker-run argv. - Codex P2 (runbook §8 approval troubleshooting): clarify that both dry-run and non-dry-run runs pause for approval in v1 because `environment: production` is unconditional; reference §4 for the second-environment upgrade path.
|
Round 4 in 165116b:
Doc-only PR; no build needed. The No other actionable feedback this round. /gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a design document and a runbook for automating rolling updates via Tailscale and GitHub Actions. The feedback highlights that the actual workflow file is missing from the PR, identifies a contradiction regarding dry-run approvals in the environment settings, and corrects inaccuracies in the documentation concerning manual recovery variables and script idempotency.
|
|
||
| ## 5. Implementation plan | ||
|
|
||
| 1. Write `.github/workflows/rolling-update.yml` implementing §2.1. |
| reviewers. A non-dry-run deploy will pause until approved; the dry-run run | ||
| itself does not need approval (it only needs the tailnet join). |
There was a problem hiding this comment.
There is a contradiction between this design document and the runbook regarding dry-run approvals. The runbook (Line 76) correctly notes that GitHub environment protections are not conditional and will pause dry-runs for approval. This section should be updated to reflect that limitation.
| reviewers. A non-dry-run deploy will pause until approved; the dry-run run | |
| itself does not need approval (it only needs the tailnet join). | |
| reviewers. Note that in v1, both dry-run and non-dry-run executions will pause | |
| for approval because GitHub's environment protection rules are not conditional. |
| the resolved env instead: the step logs `NODES_RAFT_MAP`, | ||
| `EXTRA_ENV`, `GOMEMLIMIT`, `CONTAINER_MEMORY_LIMIT`, `IMAGE`, and | ||
| `DATA_DIR` before invoking the script — those are sufficient to |
There was a problem hiding this comment.
The list of variables mentioned here is insufficient for manually reconstructing the docker run command. It misses critical networking and S3 configuration (e.g., RAFT_PORT, REDIS_PORT, S3_PORT, ENABLE_S3) required by the script. Also, GOMEMLIMIT and CONTAINER_MEMORY_LIMIT are mentioned but are not directly used by scripts/rolling-update.sh.
| the resolved env instead: the step logs `NODES_RAFT_MAP`, | |
| `EXTRA_ENV`, `GOMEMLIMIT`, `CONTAINER_MEMORY_LIMIT`, `IMAGE`, and | |
| `DATA_DIR` before invoking the script — those are sufficient to | |
| the resolved env instead: the step logs the rendered NODES, | |
| SSH_TARGETS, IMAGE, DATA_DIR, EXTRA_ENV, and all PORT | |
| settings before invoking the script — those are sufficient to |
| The script is idempotent for the "container exists and is up" case, so | ||
| re-running the workflow with the same `ref` after confirming the | ||
| interrupted node is healthy is safe — the script will stop+recreate | ||
| each node in turn regardless of whether it was touched before. |
There was a problem hiding this comment.
The description of the script's behavior on re-run is inaccurate. scripts/rolling-update.sh (lines 794-798) skips nodes where the image ID is unchanged and the gRPC endpoint is healthy. It does not "stop+recreate each node in turn regardless".
| The script is idempotent for the "container exists and is up" case, so | |
| re-running the workflow with the same `ref` after confirming the | |
| interrupted node is healthy is safe — the script will stop+recreate | |
| each node in turn regardless of whether it was touched before. | |
| The script is idempotent: if the image ID is unchanged and the gRPC | |
| endpoint is healthy, it skips the node. Re-running the workflow with | |
| the same ref after confirming the interrupted node is healthy is | |
| safe — the script will skip completed nodes and only process those | |
| requiring an update. |
|
Codex Review: Didn't find any major issues. Delightful! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
…#617) ## Summary Add two OOM-defense defaults to `scripts/rolling-update.sh`: - `GOMEMLIMIT=1800MiB` (via new `DEFAULT_EXTRA_ENV`, merged into the existing `EXTRA_ENV` plumbing) - `--memory=2500m` on the remote `docker run` (via new `CONTAINER_MEMORY_LIMIT`) Both are env-var-controlled with empty-string opt-out (`${VAR-default}` so unset uses the default, but an explicit empty string disables it). ## Motivation 2026-04-24 incident: all 4 live nodes were kernel-OOM-SIGKILLed 22-169 times in 24h under a traffic spike. Each kill risked WAL-tail truncation and triggered election storms, cascading into p99 GET spikes to 6-8s. The runtime defense was applied by hand during the incident; this PR makes it the script default so future rollouts inherit it. - `GOMEMLIMIT` — Go runtime GCs aggressively as heap approaches the limit, keeping RSS below the container ceiling. - `--memory` (cgroup hard limit) — if Go can't keep up (e.g. non-heap growth), the kill is scoped to the container, not host processes like `qemu-guest-agent` or `systemd`. ## Behavior changes | Variable | Default | Opt-out | |----------|---------|---------| | `DEFAULT_EXTRA_ENV` | `GOMEMLIMIT=1800MiB` | `DEFAULT_EXTRA_ENV=""` | | `CONTAINER_MEMORY_LIMIT`| `2500m` | `CONTAINER_MEMORY_LIMIT=""` | Operator-supplied `EXTRA_ENV` keys override matching keys in `DEFAULT_EXTRA_ENV` (e.g., `EXTRA_ENV="GOMEMLIMIT=3000MiB"` wins over the default). ## Related Companion PRs (defense-in-depth): - #612 `memwatch` — graceful shutdown before kernel OOM (prevents WAL corruption in the first place) - #613 WAL auto-repair — recovers on startup when the above fails - #616 rolling-update via GitHub Actions over Tailscale — consumes this script ## Test plan - [x] `bash -n scripts/rolling-update.sh` passes - [x] Deployed equivalents manually on all 4 live nodes during the incident (2026-04-24T07:44Z - 07:46Z); no OOM recurrence since - [ ] Next rolling-update invocation should produce `docker run ... --memory=2500m ... -e GOMEMLIMIT=1800MiB ...` on each node ## Design doc reference `docs/design/2026_04_24_proposed_resilience_roadmap.md` (item 1 — capacity/runtime defenses).
- Gemini Medium (design §2.6 line 146): the design doc contradicted the runbook by claiming dry-runs do NOT need approval. GitHub's environment-protection rules cannot be made conditional on workflow inputs, so `environment: production` pauses BOTH dry-run and non-dry-run executions in v1. Aligned the design-doc wording with the runbook and cross-referenced §4 for the second-environment upgrade path. - Gemini Medium (runbook §6 line 164 env list): the list of reconstruction vars was incomplete. Listed every env var the workflow actually exports (IMAGE, DATA_DIR, RAFT_PORT, REDIS_PORT, S3_PORT, ENABLE_S3, NODES, SSH_TARGETS, EXTRA_ENV) and called out the script-level defaults for anything not overridden, plus noted GOMEMLIMIT / CONTAINER_MEMORY_LIMIT are propagated via EXTRA_ENV once PR #617 lands. - Gemini Medium (runbook §6 line 178 idempotency): corrected the "stop+recreate every node regardless" claim. The script (scripts/rolling-update.sh:794-798) skips nodes whose running image id matches the target AND whose gRPC endpoint is healthy, so re-running after a partial roll is safe because already-rolled nodes are no-ops, not stops. Declining again Gemini HIGH "workflow file missing" — the file IS in this PR at .github/workflows/rolling-update.yml; this is the fourth round the bot has flagged its own misread. See prior rounds for rationale; no change.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 24383b2db1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| uses: actions/checkout@v6 | ||
| with: | ||
| ref: ${{ inputs.ref }} |
There was a problem hiding this comment.
Restrict deploy script checkout to trusted refs
The workflow checks out ${{ inputs.ref }} and later executes ./scripts/rolling-update.sh after loading production credentials (SSH key, Tailscale OAuth secret) and joining the tailnet, so ref is effectively arbitrary code execution, not just a deployment selector. If someone can dispatch runs and supply a branch/tag they control, a modified script in that ref can exfiltrate secrets or perform unintended actions once the environment is approved. Keep execution on a trusted protected revision (or strictly validate allowed refs) and use the input only for image selection.
Useful? React with 👍 / 👎.
Codex P1: the workflow hands the checked-out tree a Tailscale OAuth secret and an SSH key, then executes scripts/rolling-update.sh from that tree. Anyone who can dispatch runs could previously point inputs.ref at a branch containing a malicious script modification and exfiltrate the secrets. Mitigations: - New 'Validate ref is default branch or a tag' step rejects any ref that is not the default branch (by name or HEAD sha) or an existing tag. A sha reachable from elsewhere is still accepted (the subsequent checkout does its own verification) but non- default branches fail closed with an operator-visible error. - actions/checkout now pins persist-credentials: false so the GITHUB_TOKEN is not left in the runner's git config for the deploy script to harvest. The token is still explicitly exposed to the ghcr verification step via env:, which is the only place it needs to be readable.
|
Addressed in 3ddcd42: Codex P1 (arbitrary code execution via Gemini HIGH / Medium comments already addressed in previous rounds (bot is reading stale state):
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b2598fc8e5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - name: Verify image exists on ghcr.io | ||
| env: | ||
| IMAGE_BASE: ${{ vars.IMAGE_BASE }} | ||
| IMAGE_TAG: ${{ inputs.image_tag || inputs.ref }} |
There was a problem hiding this comment.
Publish immutable image tags before deploying refs
When operators follow the documented default and enter a commit SHA/tag in ref, this resolves the deployment image tag from that value, but I checked the only image publishing workflow and it pushes only ghcr.io/${{ github.REPOSITORY }}:latest (.github/workflows/docker-image.yml:43). As a result docker manifest inspect fails for SHA/tag deploys before any rollout, and the documented rollback path using a previous SHA cannot work unless someone manually creates those tags; publish immutable tags in the build workflow or stop defaulting the deploy tag to ref.
Useful? React with 👍 / 👎.
| host="${host%%:*}" | ||
| # strip user@ if present | ||
| host="${host##*@}" | ||
| if tailscale ping --c 2 --timeout 3s "$host" >/dev/null 2>&1; then |
There was a problem hiding this comment.
Preflight SSH instead of Tailscale ping
This preflight only runs tailscale ping, but Tailscale documents default ping as DISCO path discovery, while tailscale nc <hostname-or-IP> <port> is the command that actually connects to a port (https://tailscale.com/docs/reference/ping-types#disco, https://tailscale.com/docs/reference/tailscale-cli#nc). When sshd, TCP/22 ACLs, host keys, or authorized keys are wrong, dry-run can still report ok here and a live rollout can fail only after earlier nodes have already been updated; preflight the rendered SSH targets with the same SSH options the script will use, or at least check TCP/22.
Useful? React with 👍 / 👎.
| echo "::error::nodes filter '$NODES_FILTER' references unknown raft IDs: $unknown. Known IDs: $(known_ids_csv "$NODES_RAFT_MAP")" | ||
| exit 1 | ||
| fi | ||
| NODES_RAFT_MAP="$(filter_csv "$NODES_RAFT_MAP" "$NODES_FILTER")" |
There was a problem hiding this comment.
Keep full cluster map when targeting a subset
When nodes is set for a staged rollout, this replaces NODES with only the requested subset, but rolling-update.sh treats NODES as the full cluster definition: it derives RAFT_TO_REDIS_MAP/RAFT_TO_S3_MAP from it and uses the full host list to find leadership-transfer candidates. A documented nodes: n1 run on a multi-node cluster therefore starts n1 with leader-routing maps that omit the rest of the cluster, and if n1 is currently leader the script has no healthy peer candidate and aborts before rolling it. Keep the full NODES_RAFT_MAP and pass the requested subset as ROLLING_ORDER/a separate rollout target instead of filtering the cluster map itself.
Useful? React with 👍 / 👎.
| { | ||
| "action": "accept", | ||
| "src": ["tag:ci-deploy"], | ||
| "dst": ["tag:elastickv-node:22"], |
There was a problem hiding this comment.
Allow node-to-node tailnet traffic for Raft
This ACL only grants the CI runner SSH access to the nodes, but the same runbook tells operators to put MagicDNS FQDNs in NODES_RAFT_MAP, and rolling-update.sh uses those hosts as each container's --address and to derive RAFT_TO_REDIS_MAP/RAFT_TO_S3_MAP. In a least-privilege tailnet following this snippet, tag:elastickv-node has no access to tag:elastickv-node on the Raft/adapter ports, so leader-transfer probes and the restarted node's cluster traffic can fail once the first node is rolled; add explicit node-to-node grants for the cluster ports while keeping tag:ci-deploy limited to SSH.
Useful? React with 👍 / 👎.
| NODES: ${{ steps.render.outputs.NODES }} | ||
| SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }} | ||
| SSH_USER: ${{ vars.SSH_USER }} | ||
| IMAGE: ${{ vars.IMAGE_BASE }}:${{ inputs.image_tag || inputs.ref }} |
There was a problem hiding this comment.
Provide registry auth to remote pulls
The workflow authenticates only the Actions runner to GHCR for manifest inspection, then the live rollout passes just IMAGE to rolling-update.sh; the actual docker pull runs later on each remote host. For private GHCR packages, which this workflow explicitly accounts for with packages: read, nodes without preexisting Docker credentials will fail on the first target even though verification passed, blocking approved deploys; either document/enforce node-side registry login or perform a remote login with a deploy-scoped token before invoking the script.
Useful? React with 👍 / 👎.
|
@codex review 対応しました。
検証:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ee2213ad68
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - name: Checkout trusted deploy script | ||
| uses: actions/checkout@v6 | ||
| with: | ||
| ref: ${{ steps.trusted-ref.outputs.checkout_ref }} | ||
| persist-credentials: false |
There was a problem hiding this comment.
Require dispatches to run from the trusted branch
When a manual workflow_dispatch run is started from a non-default branch (via the UI branch selector or gh workflow run --ref), the workflow YAML from that branch is already what GitHub executes; this checkout only changes the workspace contents afterward. Under the documented production environment setup, an unreviewed branch could modify later steps to use DEPLOY_SSH_PRIVATE_KEY/Tailscale secrets after approval, so checking out the default branch here does not actually keep the deploy workflow trusted. Add an early guard that fails unless the dispatch ref is the default branch, or require the production environment to allow deployments only from that branch.
Useful? React with 👍 / 👎.
| env: | ||
| NODES: ${{ steps.render.outputs.NODES }} | ||
| SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }} | ||
| ROLLING_ORDER: ${{ steps.render.outputs.ROLLING_ORDER }} | ||
| SSH_USER: ${{ vars.SSH_USER }} |
There was a problem hiding this comment.
Forward S3 security settings into the rollout
For deployments that currently set ENABLE_S3=false or S3_CREDENTIALS_FILE in the manual rollout environment, this workflow does not forward those script settings into rolling-update.sh. The script defaults ENABLE_S3=true, and when S3_CREDENTIALS_FILE is empty it omits --s3CredentialsFile, causing a restarted node to expose the S3 adapter without the configured SigV4 credential file; forward these vars from the production environment (or fail closed) before making this workflow the canonical deploy path.
Useful? React with 👍 / 👎.
| if [[ "$target" != *@* ]]; then | ||
| target="${SSH_USER:-$(id -un)}@$target" | ||
| fi | ||
| if ssh -o BatchMode=yes -o ConnectTimeout=10 -o StrictHostKeyChecking=yes "$target" true; then |
There was a problem hiding this comment.
Retry SSH reachability after joining Tailnet
In tailnets where the ephemeral GitHub runner has not propagated immediately after the Tailscale join step, this single SSH attempt can fail the entire dry run or live rollout even though the node becomes reachable seconds later; Tailscale's action docs call out this eventual consistency and recommend connectivity checks that wait (their ping option waits up to 3 minutes). Wrap the ssh ... true probe in a retry/backoff instead of one ConnectTimeout=10 attempt.
Useful? React with 👍 / 👎.
| "dst": [ | ||
| "tag:elastickv-node:50051", // Raft / raftadmin | ||
| "tag:elastickv-node:6379", // Redis adapter, if enabled | ||
| "tag:elastickv-node:9000", // S3 adapter, if enabled |
There was a problem hiding this comment.
Allow node-to-node DynamoDB traffic
When the DynamoDB-compatible adapter is in use, the rollout script starts it on DYNAMO_PORT (default 8000) and follower requests are proxied to the current leader, but this ACL example only permits Raft, Redis, and S3 between tag:elastickv-node peers. Operators who copy this rule will block node-to-node DynamoDB leader proxying after the workflow rolls nodes, so include the DynamoDB port or call out that DynamoDB deployments need an additional ACL entry.
Useful? React with 👍 / 👎.
|
@codex review Addressed the current P1 deployment findings:
Validation: actionlint .github/workflows/rolling-update.yml; bash -n scripts/rolling-update.sh; git diff --check; commit hook golangci-lint --config=.golangci.yaml run --fix. |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (6)
docs/deploy_via_tailscale_runbook.md (4)
12-18: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueAdd language specifier to fenced code block (MD040).
The shell script block starting at line 12 is missing a language identifier. Add
bashorshellfor syntax highlighting and consistency:-``` +```bash # on each kv0X node sudo tailscale up \🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/deploy_via_tailscale_runbook.md` around lines 12 - 18, The fenced code block containing the tailscale command sequence is missing a language identifier, which is required for markdown linting (MD040). Add the language specifier `bash` immediately after the opening triple backticks on the line that currently reads just ``` before the comment about the kv0X node, so it becomes ```bash to enable proper syntax highlighting and pass the linting check.
29-32: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueAdd language specifier to fenced code block (MD040).
The shell command block starting at line 29 is missing a language identifier. Add
bashorshell:-``` +```bash tailscale status | grep kv0X ping kv0X.<tailnet>.ts.net -``` +```🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/deploy_via_tailscale_runbook.md` around lines 29 - 32, The fenced code block containing the tailscale commands (tailscale status and ping) is missing a language specifier, which violates markdown linting rules. Add the bash language identifier by changing the opening triple backticks from ``` to ```bash on the line before the tailscale status command.
176-180: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low valueEmphasize the node-level GHCR credential requirement for private images.
Line 176–180 correctly documents that each node must be logged into
ghcr.iowith a deploy-scoped read token. This is critical and easy to overlook. Consider making it more prominent in the "Rollback" section or in a dedicated "Prerequisites" subsection, as it is a silent failure mode: the workflow's manifest check succeeds (runner is logged in) but the remotedocker pullon the node fails (node is not logged in).[optional]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/deploy_via_tailscale_runbook.md` around lines 176 - 180, The documentation about node-level GHCR credential requirements for private images is currently in the main text flow and could be easily missed by readers, especially since it represents a silent failure mode where the workflow's manifest check succeeds but the actual docker pull on the node fails. Move or reorganize the content discussing the requirement that each node must be logged into ghcr.io with a deploy-scoped read token to make it more prominent. Consider adding a dedicated "Prerequisites" subsection before the deployment steps, or reorganizing the "Rollback" section to highlight this requirement with emphasis formatting to ensure operators understand this critical configuration step and the silent failure behavior it prevents.
115-122: 🔒 Security & Privacy | 🔵 Trivial | 💤 Low valueClarify DEPLOY_KNOWN_HOSTS hashing and update guidance.
Line 122 correctly notes:
Use
-Hto hash hostnames so the secret's contents don't leak the tailnet topology if the runner environment is compromised.This is good security guidance. However, consider also noting that operators should regenerate this secret whenever the node list changes (e.g., adding a new node, decommissioning a node) or when known-hosts entries are out of sync with current Tailscale topology. A brief note in the runbook like "Regenerate when the node list changes or if SSH
Host key verification failed" (which is already in troubleshooting §8) would help operators maintain this over time.[optional]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/deploy_via_tailscale_runbook.md` around lines 115 - 122, The DEPLOY_KNOWN_HOSTS environment secret documentation in the table correctly advises using the `-H` flag for security, but lacks maintenance guidance for operators. Add clarification to the DEPLOY_KNOWN_HOSTS value description noting that this secret should be regenerated whenever the node list changes (such as adding or removing nodes from the Tailscale topology) or if SSH Host key verification failed errors are encountered during deployment. This guidance helps operators maintain consistency between the known_hosts entries and the actual current Tailscale infrastructure.docs/design/2026_04_24_proposed_deploy_via_tailscale.md (1)
128-142: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueClarify the scope of the dry-run and its limitations.
Section 2.5 describes the dry-run flow well, but consider adding a note that the script's dry-run mode validates node maps, rollout order, and service maps without executing any container operations. The design already states "The actual
docker stop/rm/runloop does NOT execute," which is clear, but operators may benefit from knowing that the script also skips pulling the image and other side effects.[optional]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/2026_04_24_proposed_deploy_via_tailscale.md` around lines 128 - 142, The dry-run semantics section in 2.5 clearly states that the docker stop/rm/run loop does not execute, but it lacks explicit clarification about other skipped operations. Add a note after the existing docker operation statement that explicitly lists other side effects that are skipped during dry-run mode (such as image pulling and other non-validation operations), so operators understand the full scope of what is and is not executed when running with dry_run: true..github/workflows/docker-image.yml (1)
43-45: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueOptional: normalize the image path to lowercase.
ghcr.iorejects image references containing uppercase characters.github.repositoryechoes the owner/repo as registered, so if the owner or repo name ever contains uppercase letters thedocker pushfails. This is safe for the currentbootjp/elastickvslug, but lowercasing makes it robust to org/repo renames. Thegithub.REPOSITORYcasing is also non-idiomatic (expression context keys are case-insensitive, so it works, butgithub.repositoryis the convention).Consider deriving a lowercased image name in an earlier step (e.g. via
${IMAGE,,}in arunstep writing to$GITHUB_OUTPUT) and referencing it here. SHA tagging itself looks correct.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/docker-image.yml around lines 43 - 45, The tags section in the docker-image.yml workflow uses the raw github.REPOSITORY variable which may contain uppercase characters that ghcr.io rejects. Add a preceding step that normalizes the repository name to lowercase using bash parameter expansion (${IMAGE,,} or similar) and writes it to GITHUB_OUTPUT, then update both tag references in the tags section to use this new lowercased output variable instead of github.REPOSITORY while keeping the github.sha reference unchanged. Also use the conventional casing of github.repository instead of github.REPOSITORY for consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/rolling-update.yml:
- Around line 81-97: The docker login command in the run script directly
interpolates `${{ github.actor }}` which creates a template injection
vulnerability. Move the github.actor value to an environment variable by adding
`GITHUB_ACTOR: ${{ github.actor }}` to the env section at the top of the step,
then in the docker login command replace the template expansion `${{
github.actor }}` with the quoted shell variable `"$GITHUB_ACTOR"` to safely
reference it.
- Around line 78-79: The "Install jq" step at lines 78-79 appears to be unused
since the file only uses GitHub CLI's built-in jq processor (gh api --jq) and
contains no other jq invocations. Either remove this step entirely as
unnecessary, or if it is genuinely needed for future use cases, add an apt-get
update command before the apt-get install command to prevent intermittent
failures from stale package indexes that would abort the deployment.
- Around line 72-73: The actions checkout and tailscale/github-action are
referenced by version tags rather than immutable commit SHAs, creating a
security vulnerability in this privileged deployment workflow that accesses
production secrets. Replace the version tag references with their corresponding
commit SHAs: change actions/checkout@v6 to use commit
df4cb1c069e1874edd31b4311f1884172cec0e10 and tailscale/github-action@v3 to use
commit 6cae46e2d796f265265cfcf628b72a32b4d7cade. This pins the actions to
immutable commits and prevents tag-moving attacks that could inject malicious
code into the deployment process.
In `@docs/deploy_via_tailscale_runbook.md`:
- Line 25: The runbook contains inconsistent spelling of the word "authorize"
using both UK English "authorised" (line 25) and US English "authorized" (line
121). Replace all instances of "authorised" with "authorized" throughout the
entire runbook to maintain consistent US English spelling convention that
matches the codebase. Specifically, update the phrase on line 25 that currently
reads "authorised the system sshd" to use "authorized" instead, and verify that
the remaining occurrences throughout the document (such as the reference on line
121 and any others) all use the "authorized" spelling consistently.
In `@docs/design/2026_04_24_proposed_deploy_via_tailscale.md`:
- Around line 128-142: In section 2.5 Dry-run semantics, the reachability check
description is outdated. Replace the bullet point that states "tailscale ping is
run against each SSH target to confirm reachability" with an accurate
description of the SSH-based reachability check, clarifying that it uses SSH
connectivity validation (such as ssh -o BatchMode=yes -o ConnectTimeout=10
<target> true) rather than a network-layer ping, and explain that this approach
catches SSH authentication and sshd availability failures that ICMP ping cannot
detect.
---
Nitpick comments:
In @.github/workflows/docker-image.yml:
- Around line 43-45: The tags section in the docker-image.yml workflow uses the
raw github.REPOSITORY variable which may contain uppercase characters that
ghcr.io rejects. Add a preceding step that normalizes the repository name to
lowercase using bash parameter expansion (${IMAGE,,} or similar) and writes it
to GITHUB_OUTPUT, then update both tag references in the tags section to use
this new lowercased output variable instead of github.REPOSITORY while keeping
the github.sha reference unchanged. Also use the conventional casing of
github.repository instead of github.REPOSITORY for consistency.
In `@docs/deploy_via_tailscale_runbook.md`:
- Around line 12-18: The fenced code block containing the tailscale command
sequence is missing a language identifier, which is required for markdown
linting (MD040). Add the language specifier `bash` immediately after the opening
triple backticks on the line that currently reads just ``` before the comment
about the kv0X node, so it becomes ```bash to enable proper syntax highlighting
and pass the linting check.
- Around line 29-32: The fenced code block containing the tailscale commands
(tailscale status and ping) is missing a language specifier, which violates
markdown linting rules. Add the bash language identifier by changing the opening
triple backticks from ``` to ```bash on the line before the tailscale status
command.
- Around line 176-180: The documentation about node-level GHCR credential
requirements for private images is currently in the main text flow and could be
easily missed by readers, especially since it represents a silent failure mode
where the workflow's manifest check succeeds but the actual docker pull on the
node fails. Move or reorganize the content discussing the requirement that each
node must be logged into ghcr.io with a deploy-scoped read token to make it more
prominent. Consider adding a dedicated "Prerequisites" subsection before the
deployment steps, or reorganizing the "Rollback" section to highlight this
requirement with emphasis formatting to ensure operators understand this
critical configuration step and the silent failure behavior it prevents.
- Around line 115-122: The DEPLOY_KNOWN_HOSTS environment secret documentation
in the table correctly advises using the `-H` flag for security, but lacks
maintenance guidance for operators. Add clarification to the DEPLOY_KNOWN_HOSTS
value description noting that this secret should be regenerated whenever the
node list changes (such as adding or removing nodes from the Tailscale topology)
or if SSH Host key verification failed errors are encountered during deployment.
This guidance helps operators maintain consistency between the known_hosts
entries and the actual current Tailscale infrastructure.
In `@docs/design/2026_04_24_proposed_deploy_via_tailscale.md`:
- Around line 128-142: The dry-run semantics section in 2.5 clearly states that
the docker stop/rm/run loop does not execute, but it lacks explicit
clarification about other skipped operations. Add a note after the existing
docker operation statement that explicitly lists other side effects that are
skipped during dry-run mode (such as image pulling and other non-validation
operations), so operators understand the full scope of what is and is not
executed when running with dry_run: true.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2c29d949-a9e4-4ab0-8c53-1f5ca4952ae4
📒 Files selected for processing (5)
.github/workflows/docker-image.yml.github/workflows/rolling-update.ymldocs/deploy_via_tailscale_runbook.mddocs/design/2026_04_24_proposed_deploy_via_tailscale.mdscripts/rolling-update.sh
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e96ac64ca7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| env: | ||
| NODES: ${{ steps.render.outputs.NODES }} | ||
| SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }} | ||
| ROLLING_ORDER: ${{ steps.render.outputs.ROLLING_ORDER }} | ||
| SSH_USER: ${{ vars.SSH_USER }} |
There was a problem hiding this comment.
Forward non-default rollout settings
In deployments that rely on any non-default rolling-update.sh settings (for example RAFT_PORT, REDIS_PORT, DYNAMO_PORT, DATA_DIR, RAFT_ENGINE, S3_REGION, or EXTRA_ENV), this env block drops those overrides and the script falls back to its built-in defaults when recreating the container. That can make a node advertise the wrong ports or use a different data directory/engine than the current manual rollout configuration, so the workflow should either forward the existing script contract from environment vars or fail closed when unsupported overrides are required.
Useful? React with 👍 / 👎.
| if [[ "$DRY_RUN" == "true" ]]; then | ||
| print_dry_run_plan | ||
| exit 0 |
There was a problem hiding this comment.
Validate extra env before dry-run exits
When operators use the new --dry-run with DEFAULT_EXTRA_ENV or EXTRA_ENV set, this early exit skips the normalization/validation that the live path performs later. A malformed default can pass dry-run and fail only when the real rollout starts, and malformed user EXTRA_ENV can still reach the remote run_container path after the old container is removed, so the preview can give false confidence for exactly the deploy configuration it is meant to validate.
Useful? React with 👍 / 👎.
| if [[ "$ENABLE_S3" == "true" && -z "$S3_CREDENTIALS_FILE" ]]; then | ||
| echo "::error::ENABLE_S3=true requires S3_CREDENTIALS_FILE in the production environment" | ||
| exit 1 |
There was a problem hiding this comment.
Check remote S3 credential files before rollout
When ENABLE_S3=true but the configured S3_CREDENTIALS_FILE is absent or unreadable on one target, this only checks that the path string is non-empty. The script's actual readability check happens inside run_container after stop_container, so a live rollout can remove a healthy container and then abort before starting its replacement; use the SSH preflight to run test -r on each rollout target (or move the script check before stopping).
Useful? React with 👍 / 👎.
|
@codex review Addressed the latest CodeRabbit review on the current head:
Validation: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8fcaac654b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if [[ "$ENABLE_S3" == "true" && -z "$S3_CREDENTIALS_FILE" ]]; then | ||
| echo "::error::ENABLE_S3=true requires S3_CREDENTIALS_FILE in the production environment" | ||
| exit 1 |
There was a problem hiding this comment.
Preflight remote S3 credential files
When ENABLE_S3=true and S3_CREDENTIALS_FILE is set but the file is missing or unreadable on a node, this check passes and the workflow proceeds to scripts/rolling-update.sh. That script validates the file inside run_container only after it has already removed the existing container, so a stale path in GitHub vars can leave the first targeted node down before the rollout aborts. Since this workflow already renders ROLLING_SSH_TARGETS, add a preflight SSH test -r "$S3_CREDENTIALS_FILE" for each target before invoking the script.
Useful? React with 👍 / 👎.
| echo "::error::configure the production environment to allow deployments only from the default branch" | ||
| exit 1 | ||
| fi | ||
| echo "checkout_ref=$default_branch" >> "$GITHUB_OUTPUT" |
There was a problem hiding this comment.
Pin checkout to the dispatch SHA
When a production run waits for environment approval, the default branch can advance before these steps execute; because this output is the branch name, actions/checkout later fetches whatever is at main at that moment rather than the workflow run's github.sha. That can make an approved/audited rollout execute a different scripts/rolling-update.sh than the one associated with the dispatched workflow, so resolve and checkout the immutable dispatch SHA instead of the moving branch ref.
Useful? React with 👍 / 👎.
| environment — the workflow exports `IMAGE`, `DATA_DIR`, | ||
| `RAFT_PORT`, `REDIS_PORT`, `S3_PORT`, `ENABLE_S3`, `NODES`, | ||
| `SSH_TARGETS`, and the merged `EXTRA_ENV` before invoking the |
There was a problem hiding this comment.
Point recovery at values the workflow actually emits
When a rollout is canceled mid-node, this recovery step tells operators to reconstruct the docker run from the Roll cluster step's rendered environment, but that workflow step does not export or log DATA_DIR, RAFT_PORT, REDIS_PORT, S3_PORT, or the merged EXTRA_ENV it names here. In the cancellation scenario described here, especially if production uses non-default ports or directories, following the runbook can leave the operator without the values needed to restart the removed container; either emit those values in the workflow or point the instructions to the actual defaults/config source.
Useful? React with 👍 / 👎.
| for current scope requirements. `devices:core` is NOT a valid | ||
| Tailscale OAuth scope — earlier drafts of this runbook named it and | ||
| would have produced an auth failure. |
There was a problem hiding this comment.
Correct the Tailscale OAuth scope guidance
The Tailscale OAuth docs list devices:core as a valid Devices Core scope, so this warning can send operators away from a real permission while suggesting devices:write, which may not match the current UI. If the join step needs device permissions beyond auth_keys, following these lines can leave tailscale/github-action failing with 403 during first setup; remove the "not valid" warning or name the exact scope required by the pinned action docs.
Useful? React with 👍 / 👎.
|
@codex review Updated in fdb65de:
|
|
Codex Review: Didn't find any major issues. What shall we delve into next? Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
# Conflicts: # scripts/rolling-update.sh
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bee82be34e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| S3_REGION: ${{ vars.S3_REGION || 'us-east-1' }} | ||
| S3_CREDENTIALS_FILE: ${{ vars.S3_CREDENTIALS_FILE }} | ||
| S3_PATH_STYLE_ONLY: ${{ vars.S3_PATH_STYLE_ONLY || 'true' }} | ||
| DEFAULT_EXTRA_ENV: ${{ vars.DEFAULT_EXTRA_ENV || 'GOMEMLIMIT=1800MiB' }} |
There was a problem hiding this comment.
Forward opt-in service settings
Fresh evidence in the current diff is that the job-level environment stops after the S3 settings and never maps vars.ENABLE_SQS/SQS_* (nor the other script-supported opt-in service knobs) into scripts/rolling-update.sh. In a production environment that already runs the SQS adapter via the manual rollout env, this Actions path will restart each node with the script default ENABLE_SQS=false, so the SQS listener and its leader-proxy map disappear during an otherwise routine roll; add those vars here or fail closed when they are configured but unsupported.
Useful? React with 👍 / 👎.
| "tag:elastickv-node:50051", // Raft / raftadmin | ||
| "tag:elastickv-node:6379", // Redis adapter, if enabled | ||
| "tag:elastickv-node:8000", // DynamoDB adapter, if enabled | ||
| "tag:elastickv-node:9000", // S3 adapter, if enabled |
There was a problem hiding this comment.
Allow the SQS peer port in the Tailnet ACL
When the SQS adapter is enabled, rolling-update.sh starts it on SQS_PORT (default 9324) and derives RAFT_TO_SQS_MAP for follower-to-leader proxying, but this node-to-node ACL example only permits Raft, Redis, DynamoDB, and S3 ports. Operators copying this runbook for an SQS deployment will block node-to-node SQS proxy traffic even though the restarted containers advertise those peer endpoints, so add the SQS port here or call out that SQS needs an extra ACL entry.
Useful? React with 👍 / 👎.
| echo "::error::ENABLE_S3=true requires S3_CREDENTIALS_FILE in the production environment" | ||
| exit 1 | ||
| fi | ||
| ./scripts/rolling-update.sh |
There was a problem hiding this comment.
Set up Go before invoking the live rollout
When dry_run=false, this step invokes rolling-update.sh; I checked the script's live path and it builds ./cmd/raftadmin with go build before rolling nodes, while the dry-run path exits before that build. This workflow never runs actions/setup-go for the go.mod toolchain (go1.26.4), unlike the repo's test workflows, so an approved live deploy can fail on a fresh ubuntu-latest runner because the required Go toolchain is missing or must be fetched uncached even though the dry-run passed; install Go before this step or provide a prebuilt RAFTADMIN_BIN.
Useful? React with 👍 / 👎.
Summary
Move rolling deploys from "operator SSHes from workstation" to a GitHub-Actions
workflow_dispatchthat joins the Tailnet and invokes the existingscripts/rolling-update.shunchanged. Adds audit trail, approval gate, dry-run preview, and centralized secret handling.Precondition (operator responsibility): Tailscale already installed + logged in on every cluster node with
tag:elastickv-node. Runbook documents this in §1.Motivation
NODEStakes out the cluster before anyone else sees it.What the workflow does
workflow_dispatchonly (no auto-deploy on push/tag).ref(required),image_tag(rollback override),nodes(raft-id subset filter),dry_run(bool, defaulttrue).tailscale/github-action@v3(ephemeral OAuth,tag:ci-deploy).NODES+SSH_TARGETSfromproductionenvironment vars; filters to requested subset.tailscale pingon every target.scripts/rolling-update.sh.environment: production→ required reviewers on non-dry-run.concurrency: rolling-updatewithcancel-in-progress: false.Design and operator docs
docs/design/2026_04_24_proposed_deploy_via_tailscale.mddocs/deploy_via_tailscale_runbook.md— ACL config, OAuth client, required GitHub secrets/variables, first-run sequence, rollback, troubleshooting.What's NOT changed
scripts/rolling-update.sh. The script's env-var contract is the integration boundary; keeping the script stable means operators can still invoke it manually from a workstation during incidents.authorized_keysentries. v1 is plain SSH over Tailscale; Tailscale SSH (keyless) is called out as a follow-up (§2.4 of design).Test plan
actionlintpasses (0 errors)dry_run: true,nodes: n1— confirms tailnet join, SSH config, image availability, mapping render without touching proddry_run: false,nodes: n1— single-node roll, verify cluster stays healthy and correct image landsdry_run: false,nodes:(empty) — full rollFollow-ups (out of scope for v1, tracked in the design doc)
cosign verifyon the imagedeployuser with restricted sudoBefore merge
Operator setup is in the runbook (§3, §4). PR doesn't unblock the workflow by itself — it's dead until the secrets/variables in the
productionenvironment are populated. That's a deliberate choice so merging is harmless.Summary by CodeRabbit
--dry-runsupport for the rolling update script to preview rollout order and targets without performing remote container changes.latestand commit-specific tags to GHCR.