Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions .claude/skills/ci/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
name: ci
description: |
Monitor GitHub Actions CI runs for pgxntool and/or pgxntool-test after a push.
Reports which branches are under test, per-job pass/fail, and failure details.
Uses shell scripts for all heavy work to minimize context consumption.

Use when: "monitor CI", "watch CI", "check CI", "/ci"
allowed-tools: Bash(bash .claude/skills/ci/scripts/*), Read
---

# CI Monitor Skill

Monitor GitHub Actions CI across both repos after a push. Always run in background.

## Usage

- `/ci` — monitor the most recent CI run on both repos for the current branch
- `/ci pgxntool-test` — monitor pgxntool-test only
- `/ci pgxntool` — monitor pgxntool only
- `/ci <branch> <pgxntool-sha> <pgxntool-test-sha>` — monitor specific push SHAs (most reliable)

## Workflow

### 1. Start Monitor (Background)

After every `git push`, immediately launch:

```bash
bash .claude/skills/ci/scripts/monitor-ci.sh [repos] [branch] [sha1] [sha2]
```

Arguments:
- `repos`: `both` (default), `pgxntool-test`, or `pgxntool`
- `branch`: the branch just pushed (default: current git branch)
- `sha1`: SHA pushed to pgxntool-test (optional but recommended)
- `sha2`: SHA pushed to pgxntool (optional but recommended)

When pushing to both repos, always pass the SHAs to avoid a race condition where
`--branch` might pick up a different concurrent push on the same branch.

> **Race condition note**: `gh run list --branch` returns the most recent run on
> that branch — if two pushes happen close together (e.g. two sessions pushing
> in parallel), it may pick up the wrong run. Passing `--commit SHA` targets the
> exact push and avoids this. When SHA is unavailable, always verify the
> `=== BRANCHES: ===` line in the output matches the code you pushed.

**Always use `run_in_background: true`.**

### 2. Read Results

When the background task completes, read the output. The script emits:

```
[pgxntool-test] Run 12345678 found
[pgxntool-test] === BRANCHES: pgxntool-test=feature/foo pgxntool=feature/foo ===
[pgxntool-test] Polling... (running: 🐘 PostgreSQL 13, 🐘 PostgreSQL 15)
[pgxntool-test] PASS 🐘 PostgreSQL 12
[pgxntool-test] PASS 🐘 PostgreSQL 15
[pgxntool-test] FAIL 🐘 PostgreSQL 13
[pgxntool-test] Run completed: FAILURE
[pgxntool-test] === FAILURE: 🐘 PostgreSQL 13 ===
... failure log lines ...
OVERALL: FAIL
```

The **last line is always `OVERALL: <STATUS>`**. Check this first:

| OVERALL | Exit code | Meaning |
|---------|-----------|---------|
| `ALL_PASS` | 0 | All jobs green — safe to proceed |
| `FAIL` | 1 | One or more jobs failed — stop and report |
| `TIMEOUT` | 2 | Run(s) did not complete within timeout |

**Always verify the `=== BRANCHES ===` line** matches the code you just pushed —
this is your primary safeguard against the `--branch` race condition. If the
branches don't match, cancel the run and re-trigger: `gh run cancel <id> --repo
<repo>` then re-push or re-run via `gh run rerun`.

### 3. Enforce Results

**CRITICAL RULES:**

1. Any CI failure must be **reported to the user immediately**. Do not continue with other work.
2. Diagnose from the **first** `not ok` line — ignore cascading failures below it.
3. Failures in our workflow files (dependency installs, git config, etc.) are our problem to fix.
4. Failures in test code (not ok from BATS) may be pre-existing — report to user and ask before touching test files.
5. Never rationalize failures as "pre-existing" or "unrelated" without explicitly telling the user.
6. If CI is taking longer than expected on pgxntool, it may be waiting up to 5 min for a pgxntool-test PR — that is normal.

## Key rules

1. **ALWAYS** monitor CI after every push — use this skill, never `gh run watch` directly
2. When pushing to both repos, start two background monitors simultaneously (one per repo)
3. Pass the exact push SHA when available — `--branch` has a race condition on rapid pushes
4. The `=== BRANCHES ===` line in the output confirms which code is under test — always verify it matches your intent
209 changes: 209 additions & 0 deletions .claude/skills/ci/scripts/monitor-ci.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
#!/usr/bin/env bash
# monitor-ci.sh [repos] [branch] [sha_pgxntool_test] [sha_pgxntool]
#
# Monitor GitHub Actions CI runs for pgxntool-test and/or pgxntool.
# Designed to be run in background by Claude after every git push.
#
# Arguments:
# repos : "both" (default), "pgxntool-test", or "pgxntool"
# branch : branch name (default: current git branch)
# sha_pgxntool_test: exact SHA pushed to pgxntool-test (optional)
# sha_pgxntool : exact SHA pushed to pgxntool (optional)
#
# Exit codes:
# 0 : ALL_PASS — all jobs succeeded
# 1 : FAIL — one or more jobs failed
# 2 : TIMEOUT — run(s) did not complete within the timeout
# 3 : NO_RUNS — no CI run found for this branch after waiting
#
# Requires: gh CLI authenticated with repo access.

set -euo pipefail

REPOS="${1:-both}"
BRANCH="${2:-$(git rev-parse --abbrev-ref HEAD 2>/dev/null || echo "")}"
SHA_TEST="${3:-}"
SHA_PGXN="${4:-}"

REPO_TEST="Postgres-Extensions/pgxntool-test"
REPO_PGXN="Postgres-Extensions/pgxntool"

# pgxntool runs can take up to 10 min: 5 min waiting for a test PR + test time.
# pgxntool-test runs typically take 2-3 min.
TIMEOUT_TEST=300 # 5 minutes
TIMEOUT_PGXN=600 # 10 minutes
POLL_INTERVAL=10 # seconds between status polls

# ─── Helper: wait for a run to appear, then poll until done ──────────────────
monitor_one() {
local repo="$1"
local branch="$2"
local sha="$3"
local timeout="$4"
local label="[$repo]"
local elapsed=0

# Step 1: find the run ID
local run_id=""
echo "$label Waiting for CI run on branch '$branch'..."
while [[ -z "$run_id" ]]; do
if [[ -n "$sha" ]]; then
# Prefer exact SHA match — avoids race condition when multiple pushes
# are close together on the same branch.
run_id=$(gh run list --repo "$repo" --commit "$sha" \
--json databaseId --jq '.[0].databaseId // empty' 2>/dev/null || true)
fi
if [[ -z "$run_id" && -n "$branch" ]]; then
# Fallback: most recent pull_request run on the branch.
# NOTE: this can pick up a different run if two pushes happen rapidly.
run_id=$(gh run list --repo "$repo" --branch "$branch" \
--event pull_request --limit 1 \
--json databaseId --jq '.[0].databaseId // empty' 2>/dev/null || true)
fi
if [[ -z "$run_id" ]]; then
sleep 5
elapsed=$((elapsed + 5))
if [[ $elapsed -ge $timeout ]]; then
echo "$label ERROR: no CI run found after ${timeout}s" >&2
return 1
fi
fi
done
echo "$label Run $run_id found"

# Step 2: extract the BRANCHES line as soon as the first job starts.
# We use the direct jobs API (fast ~1s) rather than the zip-download log path
# (slow 3-10s). We only need one job — all jobs emit the same BRANCHES line.
local branches_line=""
local attempts=0
while [[ -z "$branches_line" && $elapsed -lt $timeout ]]; do
local first_job_id
first_job_id=$(gh run view "$run_id" --repo "$repo" \
--json jobs --jq '[.jobs[].databaseId][0] // empty' 2>/dev/null || true)

if [[ -n "$first_job_id" ]]; then
# grep may return non-zero if the line isn't present yet — that's fine.
branches_line=$(gh api "repos/${repo}/actions/jobs/${first_job_id}/logs" \
2>/dev/null | grep "^=== BRANCHES:" | tail -1 || true)
fi

if [[ -z "$branches_line" ]]; then
attempts=$((attempts + 1))
if [[ $attempts -ge 3 ]]; then
# Give up waiting for the BRANCHES line and move on to polling.
echo "$label (BRANCHES line not yet available; proceeding to poll)"
break
fi
sleep "$POLL_INTERVAL"
elapsed=$((elapsed + POLL_INTERVAL))
fi
done
if [[ -n "$branches_line" ]]; then
echo "$label $branches_line"
fi

# Step 3: poll until all jobs complete.
local status="in_progress"
local result=""
while [[ "$status" != "completed" && $elapsed -lt $timeout ]]; do
result=$(gh run view "$run_id" --repo "$repo" \
--json status,conclusion,jobs \
--jq '{status: .status, conclusion: .conclusion,
jobs: [.jobs[] | {name: .name, status: .status, conclusion: .conclusion}]}' \
2>/dev/null || true)

if [[ -z "$result" ]]; then
sleep "$POLL_INTERVAL"
elapsed=$((elapsed + POLL_INTERVAL))
continue
fi

status=$(echo "$result" | jq -r '.status')

if [[ "$status" != "completed" ]]; then
local running
running=$(echo "$result" | jq -r \
'[.jobs[] | select(.status == "in_progress") | .name] | join(", ")' || true)
if [[ -n "$running" ]]; then
echo "$label Polling... (running: $running)"
fi
sleep "$POLL_INTERVAL"
elapsed=$((elapsed + POLL_INTERVAL))
fi
done

if [[ $elapsed -ge $timeout ]]; then
echo "$label ERROR: timed out after ${timeout}s" >&2
return 2
fi

# Step 4: report per-job outcomes.
local conclusion
conclusion=$(echo "$result" | jq -r '.conclusion')
echo "$label Run $run_id completed: $(echo "$conclusion" | tr '[:lower:]' '[:upper:]')"
echo "$result" | jq -r '.jobs[] | "\(if .conclusion == "success" then "PASS" elif .conclusion == null then .status else .conclusion | ascii_upcase end) \(.name)"' \
| sed "s|^|$label |"

# Step 5: for failed jobs, print the failure log (last 60 lines per job).
if [[ "$conclusion" != "success" ]]; then
local failed_job_ids
failed_job_ids=$(gh run view "$run_id" --repo "$repo" \
--json jobs \
--jq '[.jobs[] | select(.conclusion == "failure") | .databaseId] | .[]' \
2>/dev/null || true)

for job_id in $failed_job_ids; do
local job_name
job_name=$(gh run view "$run_id" --repo "$repo" \
--json jobs \
--jq --argjson id "$job_id" \
'[.jobs[] | select(.databaseId == $id) | .name] | .[0]' 2>/dev/null || true)
echo ""
echo "$label === FAILURE: ${job_name:-job $job_id} ==="
# Use --log-failed to get only the failed step output, keeping output compact.
gh run view --repo "$repo" --job "$job_id" --log-failed 2>&1 \
| grep -v "^$" | tail -60 || true
done

return 1
fi

return 0
}

# ─── Main: run monitors in parallel or series ─────────────────────────────────
exit_code=0
pid_test=""
pid_pgxn=""

case "$REPOS" in
pgxntool-test)
monitor_one "$REPO_TEST" "$BRANCH" "$SHA_TEST" "$TIMEOUT_TEST" || exit_code=1
;;
pgxntool)
monitor_one "$REPO_PGXN" "$BRANCH" "$SHA_PGXN" "$TIMEOUT_PGXN" || exit_code=1
;;
both|*)
# Run both in parallel. Each writes to stdout (interleaved but prefixed with
# the repo name for readability). Capture both PIDs and wait for both.
monitor_one "$REPO_TEST" "$BRANCH" "$SHA_TEST" "$TIMEOUT_TEST" &
pid_test=$!
monitor_one "$REPO_PGXN" "$BRANCH" "$SHA_PGXN" "$TIMEOUT_PGXN" &
pid_pgxn=$!

wait "$pid_test" || { r=$?; echo "[both] pgxntool-test CI FAILED"; [[ $r -gt $exit_code ]] && exit_code=$r; }
wait "$pid_pgxn" || { r=$?; echo "[both] pgxntool CI FAILED"; [[ $r -gt $exit_code ]] && exit_code=$r; }
;;
esac

# Emit a parseable summary line. Claude should check this line rather than
# parsing the full output. Convention matches the test skill's STATUS line.
if [[ $exit_code -eq 0 ]]; then
echo "OVERALL: ALL_PASS"
elif [[ $exit_code -eq 2 ]]; then
echo "OVERALL: TIMEOUT"
else
echo "OVERALL: FAIL"
fi

exit $exit_code
Loading
Loading