fix(ci): flag only event warnings still firing after the settle, not transients by devantler · Pull Request #1677 · devantler-tech/platform

devantler · 2026-06-01T21:29:32Z

Root cause

The check-event-warnings deploy gate is what failed the recent #1659 reconcile:

New Warning events since 2026-06-01T21:09:00Z (4)
  [flux-system] Pod/helm-controller-…      Unhealthy: Readiness probe failed: …:9440/readyz: connection refused
  [flux-system] Pod/kustomize-controller-… Unhealthy: Readiness probe failed: …:9440/readyz: connection refused
  [flux-system] Kustomization/infrastructure-controllers  HealthCheckFailed: … context canceled
  [flux-system] Pod/notification-controller-…  Unhealthy: Readiness probe failed: …:9440/readyz: connection refused

The action's docstring says it ignores transient one-shot warnings. But the implementation records the marker before the settle and flags any Warning with lastTimestamp >= marker. So a one-shot warning that fires during the settle window and then stops is flagged anyway — it only actually ignores transients that fired before the deploy started.

That false-positives on any deploy that rolls a workload during the settle. Concretely: applying the Flux-controller topologySpreadConstraints (#1659) — or any controller config change — restarts the controllers. While a controller restarts, its /readyz returns connection refused for a few seconds (leader-election handoff), and kustomize-controller restarting cancels the in-flight infrastructure-controllers health check (context canceled). All four events are transient and self-heal within seconds — the controllers are 1/1 Running immediately after — yet the gate failed the deploy.

Fix

Record the marker after the settle, then observe for a short window (new observe-seconds input, default 60):

settle (90s, grace — rollouts converge)  →  marker  →  observe (60s)  →  sample

Only warnings whose most-recent occurrence is at/after the post-settle marker are flagged — i.e. still firing once the cluster has settled, which is the documented "steady state" intent. Persistent problems (crash loops, FailedAttachVolume, missing PriorityClass — all of which fire continuously) keep firing into the observe window and are still caught; one-shot rollout blips that drain during the settle are not.

This is a correctness fix, not a loosening: it makes the gate do what its docstring already claims.

Validation

action.yaml parses; embedded bash passes bash -n.
Fed the gate's jq filter the exact 21:09:17 readiness-blip event (before a post-settle marker) plus a synthetic BackOff ×42 crash-loop (after it): only the crash-loop is flagged. ✅
No caller changes needed — all three call sites (ci.yaml system-test + merge_group, cd.yaml) use defaults; observe-seconds defaults to 60.

Relationship to other PRs

This unblocks #1659 (Flux controller spread) and any future controller-config deploy. Note #1661 ("require workloads to spread across nodes") merged recently — worth confirming whether #1659 is still needed or is now redundant with that policy.

…transients The check-event-warnings gate records its marker BEFORE the settle, then flags any Warning whose lastTimestamp is at/after the marker. So a one-shot warning that fires DURING the settle window and then stops is flagged — contradicting the action's own docstring ("transient one-shot warnings ... are ignored"). In practice it only ignores transients that fired before the marker, i.e. before the deploy even started. This false-positives on any deploy that rolls a workload during the settle. Concretely: applying the Flux-controller topologySpread (#1659) restarts the controllers; their /readyz returns "connection refused" for a few seconds during the restart, and kustomize-controller restarting cancels the infrastructure-controllers health check. All transient and self-healed — yet the gate failed the deploy on those four events. Fix: record the marker AFTER the settle, then observe for a short window (new observe-seconds input, default 60s). Only warnings still firing at/after the post-settle marker are flagged — the documented "still firing at steady state" intent. Persistent problems (crash loops, FailedAttachVolume, missing PriorityClass) keep firing into the observe window and are still caught; one-shot rollout blips that drain during the settle are not. Verified: the jq filter, fed the exact 21:09:17 readiness-blip event plus a synthetic BackOff crash-loop, now flags only the crash-loop. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

This pull request fixes the check-event-warnings composite action so it only fails deployments when the cluster continues emitting Warning events after the settle period (steady state), rather than flagging transient warnings that occur during rollout convergence.

Changes:

Moves the warning “marker” timestamp to after the settle sleep, aligning behavior with the documented “steady state” intent.
Adds an observe-seconds input (default 60) to create a short post-settle observation window before sampling events.
Updates action documentation/messages to reflect the new settle → marker → observe → sample flow.

devantler · 2026-06-01T21:42:35Z

Superseded by #1678: rather than fix the event-warnings gate's marker timing, we're removing the gate entirely and relying on Flux reconcile status (ksail workload reconcile already waits per-Kustomization; all Kustomizations are wait: true so Ready encodes Flux's health checks). Per maintainer preference for trusting Flux/ksail over bespoke CI heuristics.

Copilot AI review requested due to automatic review settings June 1, 2026 21:29

github-project-automation Bot added this to 🌊 Project Board Jun 1, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board Jun 1, 2026

Copilot started reviewing on behalf of devantler June 1, 2026 21:29 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

devantler mentioned this pull request Jun 1, 2026

ci: drop fragile event-warnings gate; rely on Flux reconcile status #1678

Open

devantler closed this Jun 1, 2026

github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): flag only event warnings still firing after the settle, not transients#1677

fix(ci): flag only event warnings still firing after the settle, not transients#1677
devantler wants to merge 1 commit into
mainfrom
fix/event-warnings-gate-transient

devantler commented Jun 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

devantler commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

devantler commented Jun 1, 2026

Root cause

Fix

Validation

Relationship to other PRs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

devantler commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants