fix(ci): harden Submit docs URLs to IndexNow step#833
Conversation
The existing step had three functional problems that made it silently broken since PR #827 landed: 1. Hardcoded fallback key `9ded60b43b8b40578588fc42b04c3e5d` was the stale key deleted from landing/public/ in keploy/landing#186 because Bing was rejecting it with HTTP 403 UserForbiddedToAccessSite. Every push-run since PR #827 has been hitting 403 in silence. 2. `continue-on-error: true` combined with a plain `echo` on failure (no `::error::` annotation, no `exit 1`) meant the failing step reported green in the Actions UI — no signal, no alerting. 3. No `keyLocation` in the payload, no CURL_EXIT capture, no filter for legacy versioned docs or auto-generated tag pages (so ~800 URLs were submitted, mostly low-value /tags/ and /docs/{1,2,3}.0.0/ pages that robots.txt already disallows). Changes: - Default INDEXNOW_KEY to feeeddbe661a4686ba6e4d5cb663b2cb (the key actually served from keploy.io/<key>.txt). Secret override still works for future rotations. - Add keyLocation to the payload. - Filter sitemap URLs: drop /tags/ pages and legacy /docs/{1,2,3}.0.0/ before submitting — reduces from ~800 to ~206 URLs, all on supported paths. Verified locally with a real POST, HTTP 200. - set -euo pipefail, capture CURL_EXIT separately from HTTP_CODE, exit non-zero with ::error:: on real failure, print response body + troubleshooting block with keyLocation check and a ready-to-paste reproducer. 2xx (including 202 accepted-pending-verification) is an explicit exit 0. - Keep continue-on-error: true so the IndexNow step never blocks the deploy, but the ::error:: annotation now surfaces failures in the UI instead of hiding them. Dry-run locally against the live sitemap-derived URL list returned HTTP 200 with 20 real docs URLs. Signed-off-by: nehagup <15074229+nehagup@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Hardens the GitHub Actions “Submit docs URLs to IndexNow” step so IndexNow failures are visible (without blocking deploys) and submissions are limited to the desired subset of docs URLs.
Changes:
- Defaults
INDEXNOW_KEYto a new key (still overridable viasecrets.INDEXNOW_KEY) and addskeyLocationto the IndexNow payload. - Filters sitemap-derived URLs to exclude
/tags/pages and legacy versioned docs paths before submitting. - Improves shell robustness and failure reporting (pipefail, curl exit vs HTTP code,
::error::annotations, troubleshooting output).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| URLS=$(grep -oP '<loc>\K[^<]+' "$SITEMAP" \ | ||
| | grep -v '/tags/' \ | ||
| | grep -vE '/docs/(1|2|3)\.[0-9]+\.[0-9]+/' \ | ||
| | sort -u) |
There was a problem hiding this comment.
With set -euo pipefail, this URLS=$(grep ... | grep ... | ...) pipeline will cause the script to exit immediately if grep finds no matches (exit status 1), so the subsequent if [ -z "$URLS" ] skip logic may never run. Make the pipeline non-fatal (e.g., append || true to the command substitution, or temporarily disable -e around the parse) so an empty sitemap/filter result is handled as an intentional skip rather than a step failure.
| | sort -u) | |
| | sort -u || true) |
There was a problem hiding this comment.
Fixed in 514e127: appended || true to the command substitution so the filter pipeline can return a non-empty exit without aborting the step. Verified locally with a fixture where every URL in the sitemap matched the drop filters — the [ -z "$URLS" ] skip branch is now reachable and fires cleanly.
| if [ ! -f "$SITEMAP" ]; then | ||
| echo "Sitemap not found at $SITEMAP, skipping IndexNow" | ||
| echo "::warning::Sitemap not found at $SITEMAP, skipping IndexNow submission" | ||
| exit 0 | ||
| fi | ||
|
|
||
| URLS=$(grep -oP '<loc>\K[^<]+' "$SITEMAP" || true) | ||
| # Filter to the current (unversioned) docs only: | ||
| # - drop auto-generated /tags/ pages (low-value for search indexing) | ||
| # - drop legacy versioned docs (/docs/1.0.0/, 2.0.0/, 3.0.0/) since | ||
| # robots.txt disallows them and they're marked noindex anyway | ||
| URLS=$(grep -oP '<loc>\K[^<]+' "$SITEMAP" \ | ||
| | grep -v '/tags/' \ | ||
| | grep -vE '/docs/(1|2|3)\.[0-9]+\.[0-9]+/' \ | ||
| | sort -u) | ||
|
|
||
| if [ -z "$URLS" ]; then | ||
| echo "No URLs found in sitemap, skipping" | ||
| echo "::warning::No current-version docs URLs parsed from $SITEMAP, skipping" | ||
| exit 0 |
There was a problem hiding this comment.
The workflow annotation uses ::warning:: for non-error conditions (missing sitemap / no URLs). This repo’s logging guidelines disallow adding warnings; consider using a plain echo or ::notice:: instead so the Actions UI doesn’t surface these as warnings while still providing visibility.
There was a problem hiding this comment.
Fixed in 514e127: changed both non-error skip paths (missing sitemap, no URLs after filter) from ::warning:: to ::notice::. The ::error:: annotations on actual IndexNow failures (curl-exit / non-2xx) are retained — those are still real errors that should surface.
| # - drop legacy versioned docs (/docs/1.0.0/, 2.0.0/, 3.0.0/) since | ||
| # robots.txt disallows them and they're marked noindex anyway |
There was a problem hiding this comment.
This comment says legacy versioned docs are excluded because robots.txt disallows them and they are noindex, but the repository’s static/robots.txt currently allows all paths (except /cgi-bin/) and there’s no matching noindex rule found. Please update the comment to reflect the real reason for filtering (e.g., keeping submissions focused on preferred doc paths) so future maintainers aren’t misled.
| # - drop legacy versioned docs (/docs/1.0.0/, 2.0.0/, 3.0.0/) since | |
| # robots.txt disallows them and they're marked noindex anyway | |
| # - drop legacy versioned docs (/docs/1.0.0/, 2.0.0/, 3.0.0/) to keep | |
| # submissions focused on the preferred current doc paths |
There was a problem hiding this comment.
Fixed in 514e127: replaced the misleading "robots.txt disallows / noindex" rationale with the real one ("keep submissions focused on the preferred current doc paths"). The old comment was written assuming #832's nuanced robots.txt would land first, but that's a separate PR and this file has to stand on its own.
| HTTP_CODE=$(curl -sS -o /tmp/indexnow-resp.txt -w "%{http_code}" \ | ||
| -X POST "https://api.indexnow.org/indexnow" \ | ||
| -H "Content-Type: application/json" \ | ||
| -d "{\"host\":\"keploy.io\",\"key\":\"$KEY\",\"urlList\":$JSON_URLS}" || echo "000") | ||
| -d "{\"host\":\"${INDEXNOW_HOST}\",\"key\":\"${INDEXNOW_KEY}\",\"keyLocation\":\"${INDEXNOW_KEY_LOC}\",\"urlList\":${JSON_URLS}}") || CURL_EXIT=$? |
There was a problem hiding this comment.
The response body is written to a fixed path (/tmp/indexnow-resp.txt) and never cleaned up. Using a unique temp file (e.g., mktemp) and a trap cleanup makes the step more robust and avoids any chance of collisions or stale data being read later in the script.
There was a problem hiding this comment.
Fixed in 514e127: introduced RESPONSE_FILE=$(mktemp) at the top of the run script plus trap 'rm -f "$RESPONSE_FILE"' EXIT so every exit path (happy path, skip, curl failure, non-2xx) cleans up. Verified the trap fires on EXIT with a local dry-run. Also swapped the two call sites (curl -o and the non-2xx cat) to use $RESPONSE_FILE — no more hardcoded /tmp/indexnow-resp.txt.
…ect filter rationale, mktemp+trap Four fixes from the Copilot review on PR #833: 1. `|| true` on the URL extraction pipeline. Under `set -euo pipefail`, if the filter chain (`grep -oP | grep -v | grep -vE | sort -u`) produced no matches, grep's exit 1 would abort the step before the `[ -z "$URLS" ]` skip branch could run. Trailing `|| true` makes the empty-result path intentionally non-fatal and the skip branch is now reachable. 2. `::warning::` → `::notice::` for the two non-error skip paths (missing sitemap, no URLs after filter). Per repo guidelines, the Actions UI shouldn't surface these as warnings; they're expected no-op conditions. 3. Corrected the filter comment. The previous wording claimed legacy versioned docs were excluded "because robots.txt disallows them and they're marked noindex anyway", but the current static/robots.txt on main doesn't have that disallow and there's no noindex rule for those paths. Replaced with the real reason: keeping submissions focused on the preferred current doc paths. 4. Replaced the fixed `/tmp/indexnow-resp.txt` path with a `mktemp`- generated RESPONSE_FILE and a `trap ... EXIT` cleanup, so repeat runs can't collide and stale data can't leak into later reads. Dry-ran all three edge cases locally (all-filtered-out, normal URLs, missing sitemap) + verified the trap fires on EXIT. Signed-off-by: nehagup <15074229+nehagup@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| JSON_URLS=$(printf '%s\n' "$URLS" | jq -R -s 'split("\n") | map(select(length > 0))') | ||
|
|
There was a problem hiding this comment.
With set -euo pipefail, if jq is missing or fails to parse/build JSON_URLS, the step will exit immediately without the ::error:: annotation / troubleshooting output that you provide for curl/HTTP failures. Consider explicitly checking jq availability and/or handling a non-zero exit from the jq command to emit a ::error:: with a next step (e.g., install jq / verify sitemap format) before exiting non-zero.
| JSON_URLS=$(printf '%s\n' "$URLS" | jq -R -s 'split("\n") | map(select(length > 0))') | |
| if ! command -v jq >/dev/null 2>&1; then | |
| echo "::error::jq is required to build the IndexNow request payload. Install jq or use a runner image that includes jq, then rerun this workflow." | |
| exit 1 | |
| fi | |
| if ! JSON_URLS=$(printf '%s\n' "$URLS" | jq -R -s 'split("\n") | map(select(length > 0))'); then | |
| echo "::error::Failed to build the IndexNow JSON payload from sitemap URLs. Verify the sitemap-derived URL list is valid and rerun the workflow." | |
| exit 1 | |
| fi |
There was a problem hiding this comment.
Fixed in 8033df7: added explicit command -v jq check and wrapped the JSON_URLS=$(... jq ...) assignment in if ! ... then ... exit 1; fi so both failure modes (missing jq, jq parse error) emit ::error:: annotations with actionable next steps before exiting, instead of aborting silently under set -euo pipefail.
| cat "$RESPONSE_FILE" 2>/dev/null || true | ||
| echo "" | ||
| echo "Troubleshooting:" | ||
| echo " 1. Verify key file is publicly reachable: curl -sS $INDEXNOW_KEY_LOC" |
There was a problem hiding this comment.
The troubleshooting line prints curl -sS $INDEXNOW_KEY_LOC without quoting. Quoting the URL in the echoed command makes the copy/paste reproducer more robust (and avoids surprises if the value ever contains shell-special characters).
| echo " 1. Verify key file is publicly reachable: curl -sS $INDEXNOW_KEY_LOC" | |
| echo " 1. Verify key file is publicly reachable: curl -sS \"$INDEXNOW_KEY_LOC\"" |
There was a problem hiding this comment.
Fixed in 8033df7: quoted $INDEXNOW_KEY_LOC (and the IndexNow API URL on the next line) in the printed troubleshooting reproducer so a copy/paste from the CI log stays shell-safe even if the key or host ever contains a shell-special character.
Two Copilot review follow-ups: 1. Guard the payload build under `set -euo pipefail`: explicitly check `command -v jq` and the `jq` command's exit status, emitting a `::error::` annotation on either failure before `exit 1`. Previously a missing or broken jq would abort the step immediately without the actionable error message, leaving an unannotated "process exited 1" in the Actions UI. 2. Quote `$INDEXNOW_KEY_LOC` and the IndexNow API URL in the printed troubleshooting reproducer so a copy/paste of the line from the CI log remains shell-safe even if the key or host ever contains a shell-special character. Signed-off-by: nehagup <15074229+nehagup@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| echo "Failed to build JSON URL list, skipping" | ||
| exit 0 | ||
| if ! JSON_URLS=$(printf '%s\n' "$URLS" | jq -R -s 'split("\n") | map(select(length > 0))'); then | ||
| echo "::error::Failed to build the IndexNow JSON payload from the sitemap URL list. Inspect the URLs above for unexpected characters and rerun the workflow." |
There was a problem hiding this comment.
This ::error:: message says “Inspect the URLs above…”, but the script never prints the URL list (only the count). On a jq failure, the log won’t actually contain the URLs to inspect. Either print a small sample of the parsed URLs before running jq (or dump them to the log on failure), or adjust the error text to point to the sitemap/path filtering logic instead of “URLs above”.
| echo "::error::Failed to build the IndexNow JSON payload from the sitemap URL list. Inspect the URLs above for unexpected characters and rerun the workflow." | |
| echo "::error::Failed to build the IndexNow JSON payload from the sitemap URL list. Review the sitemap contents and the /tags/ and versioned /docs/ filtering logic in this step for unexpected input, then rerun the workflow." |
There was a problem hiding this comment.
Fixed in 91c9c66: added a First 5 URLs: sample log immediately after Submitting $URL_COUNT docs URLs to IndexNow (printed via printf '%s\n' "$URLS" | head -5 | sed 's/^/ /') so every run — including jq failures — leaves actionable URL context in the Actions log. The jq-failure error message now references that sample ("See the 'First 5 URLs' sample logged above") instead of the non-existent "URLs above".
| # Filter to the current (unversioned) docs only: | ||
| # - drop auto-generated /tags/ pages (low-value for search indexing) | ||
| # - drop legacy versioned docs (/docs/1.0.0/, 2.0.0/, 3.0.0/) to keep | ||
| # submissions focused on the preferred current doc paths |
There was a problem hiding this comment.
The comment says this filters to “current (unversioned) docs only”, but the actual filter only excludes versioned paths for majors 1–3; it will still include any versioned docs like /docs/4.0.0/ (and higher) if they appear in the sitemap. Please either (a) update the comment to match the actual intent (e.g., exclude only unmaintained versions 1–3), or (b) broaden the regex to exclude all versioned doc paths if you truly want unversioned-only URLs.
There was a problem hiding this comment.
Fixed in 91c9c66 — went with option (b): broadened the regex from /docs/(1|2|3)\.[0-9]+\.[0-9]+/ to /docs/[0-9]+\.[0-9]+\.[0-9]+/, and updated the inline comment to say "drop any versioned doc path like /docs/N.M.P/ so only the current (unversioned) docs are submitted". This keeps the rule correct when a future v5/v6 archive lands alongside today's v1-v3 legacy versions without requiring a follow-up edit. Verified locally against four fixtures (/docs/about/, /docs/1.0.0/tags/foo/, /docs/4.0.0/running-keploy/, /docs/12.3.45/test/) — only the unversioned URL is kept.
Two review follow-ups: 1. Tighten the versioned-docs filter from (1|2|3) to [0-9]+ so the regex stays correct when a future v5/v6 archive lands alongside today's v1-v3 legacy versions. Verified against fixtures including /docs/1.0.0/, /docs/4.0.0/, /docs/12.3.45/ — all dropped. 2. Print the first 5 parsed URLs before the jq step so a jq parse failure leaves something actionable in the Actions log. Updated the jq-failure error message to reference the "First 5 URLs" sample instead of the non-existent "URLs above" artifact. Signed-off-by: nehagup <15074229+nehagup@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
Three functional bugs in the existing
Submit docs URLs to IndexNowstep inmain.ymlhave been silently broken since #827 landed:${INDEXNOW_KEY:-9ded60b43b8b40578588fc42b04c3e5d}. That key was deleted fromlanding/public/in keploy/landing#186 because Bing was rejecting it withHTTP 403 UserForbiddedToAccessSite. Every push-run ofmain.ymlsince feat: combined SEO/GEO/AEO + IndexNow auto-submission + build fix #827 has been hitting 403 in silence. Reproduced just now:continue-on-error: true+ no::error::annotation + noexit 1meant the failure was completely invisible. Step went green, no alerting./tags/pages and 184 legacy/docs/{1,2,3}.0.0/pages (whichrobots.txtalready disallows). Only ~206 are current-version docs worth submitting.Changes
INDEXNOW_KEYtofeeeddbe661a4686ba6e4d5cb663b2cb— the key actually served fromkeploy.io/<key>.txt. Thesecrets.INDEXNOW_KEYoverride still takes priority for rotations.keyLocationto the payload./tags/pages and legacy/docs/{1,2,3}.0.0/paths before submitting. Reduces submission from ~800 → ~206, all on currently-supported paths.set -euo pipefail, captureCURL_EXITseparately fromHTTP_CODE, exit non-zero with::error::on real failure (curl exit != 0 or HTTP non-2xx), print response body plus a Troubleshooting block containing akeyLocationcheck and a ready-to-paste POST reproducer.2xx(including202 accepted-pending-verification) is an explicitexit 0.continue-on-error: trueso a failure never blocks the deploy, but::error::annotations now surface failures in the Actions UI instead of hiding them.Also mirrors the hardening pattern already shipped in keploy/blog-website#366.
Test plan
docs/sitemap.xml(after gunzip) → 206 current-version URLs, 0 tag pages, 0 legacy version pageshttps://api.indexnow.org/indexnowwith the new key +keyLocation→ HTTP 200INDEXNOW_KEYsecret already set onkeploy/docs(separate, so Option A from the diagnosis is in effect right now — the nextmain.ymlrun should hit HTTP 200 even before this PR merges)main.ymllogsSubmitted <N> URLs to IndexNow (HTTP 200)instead of silently 403'ing🤖 Generated with Claude Code