fix(ci): flag only event warnings still firing after the settle, not transients#1677
Closed
devantler wants to merge 1 commit into
Closed
fix(ci): flag only event warnings still firing after the settle, not transients#1677devantler wants to merge 1 commit into
devantler wants to merge 1 commit into
Conversation
…transients
The check-event-warnings gate records its marker BEFORE the settle, then
flags any Warning whose lastTimestamp is at/after the marker. So a one-shot
warning that fires DURING the settle window and then stops is flagged —
contradicting the action's own docstring ("transient one-shot warnings ...
are ignored"). In practice it only ignores transients that fired before the
marker, i.e. before the deploy even started.
This false-positives on any deploy that rolls a workload during the settle.
Concretely: applying the Flux-controller topologySpread (#1659) restarts the
controllers; their /readyz returns "connection refused" for a few seconds
during the restart, and kustomize-controller restarting cancels the
infrastructure-controllers health check. All transient and self-healed — yet
the gate failed the deploy on those four events.
Fix: record the marker AFTER the settle, then observe for a short window
(new observe-seconds input, default 60s). Only warnings still firing
at/after the post-settle marker are flagged — the documented "still firing
at steady state" intent. Persistent problems (crash loops, FailedAttachVolume,
missing PriorityClass) keep firing into the observe window and are still
caught; one-shot rollout blips that drain during the settle are not.
Verified: the jq filter, fed the exact 21:09:17 readiness-blip event plus a
synthetic BackOff crash-loop, now flags only the crash-loop.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This pull request fixes the check-event-warnings composite action so it only fails deployments when the cluster continues emitting Warning events after the settle period (steady state), rather than flagging transient warnings that occur during rollout convergence.
Changes:
- Moves the warning “marker” timestamp to after the settle sleep, aligning behavior with the documented “steady state” intent.
- Adds an
observe-secondsinput (default60) to create a short post-settle observation window before sampling events. - Updates action documentation/messages to reflect the new settle → marker → observe → sample flow.
Contributor
Author
|
Superseded by #1678: rather than fix the event-warnings gate's marker timing, we're removing the gate entirely and relying on Flux reconcile status ( |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause
The
check-event-warningsdeploy gate is what failed the recent #1659 reconcile:The action's docstring says it ignores transient one-shot warnings. But the implementation records the marker before the settle and flags any Warning with
lastTimestamp >= marker. So a one-shot warning that fires during the settle window and then stops is flagged anyway — it only actually ignores transients that fired before the deploy started.That false-positives on any deploy that rolls a workload during the settle. Concretely: applying the Flux-controller
topologySpreadConstraints(#1659) — or any controller config change — restarts the controllers. While a controller restarts, its/readyzreturnsconnection refusedfor a few seconds (leader-election handoff), andkustomize-controllerrestarting cancels the in-flightinfrastructure-controllershealth check (context canceled). All four events are transient and self-heal within seconds — the controllers are1/1 Runningimmediately after — yet the gate failed the deploy.Fix
Record the marker after the settle, then observe for a short window (new
observe-secondsinput, default60):Only warnings whose most-recent occurrence is at/after the post-settle marker are flagged — i.e. still firing once the cluster has settled, which is the documented "steady state" intent. Persistent problems (crash loops,
FailedAttachVolume, missing PriorityClass — all of which fire continuously) keep firing into the observe window and are still caught; one-shot rollout blips that drain during the settle are not.This is a correctness fix, not a loosening: it makes the gate do what its docstring already claims.
Validation
action.yamlparses; embedded bash passesbash -n.jqfilter the exact21:09:17readiness-blip event (before a post-settle marker) plus a syntheticBackOff ×42crash-loop (after it): only the crash-loop is flagged. ✅ci.yamlsystem-test + merge_group,cd.yaml) use defaults;observe-secondsdefaults to 60.Relationship to other PRs
This unblocks #1659 (Flux controller spread) and any future controller-config deploy. Note #1661 ("require workloads to spread across nodes") merged recently — worth confirming whether #1659 is still needed or is now redundant with that policy.