stackrox · robbycochran · May 21, 2026 · May 21, 2026 · May 21, 2026 · May 21, 2026
@@ -0,0 +1,115 @@
+---
+name: analyze-test-failures
+description: Analyze test failure artifacts and generate root cause analysis report
+---
+
+# Test Failure Analysis
+
+Analyze test failures from CI artifacts and generate a concise root cause analysis for the oncall team.
+
+## Usage
+
+```
+/analyze-test-failures <artifacts-dir> <workflow-name> <failed-jobs>
+```
+
+**Arguments:**
+- `artifacts-dir`: Directory containing test artifacts (default: test-artifacts/)
+- `workflow-name`: Name of the workflow that failed (e.g., "Integration Tests")
+- `failed-jobs`: Comma-separated list of failed job names
+
+**Example:**
+```
+/analyze-test-failures test-artifacts/ "Integration Tests" "amd64-integration-tests,arm64-integration-tests"
+```
+
+## What This Does
+
+1. **Find test reports**: Searches for JUnit XML files (integration-test-report-*.xml, junit.xml)
+2. **Parse failures**: Extracts test names, error messages, stack traces
+3. **Investigate code**: Reads failing test source and implementation code
+4. **Check git history**: Looks for recent changes that may have caused failures
+5. **Identify patterns**: Detects platform-specific issues (arch/OS)
+6. **Generate report**: Creates analysis-report.md with findings
+
+## Report Format
+
+The generated `analysis-report.md` contains:
+
+```markdown
+**🤖 AI Analysis**
+
+**Root Cause**: [1-2 sentence summary with file:line references]
+
+**Evidence**:
+• [Specific code observations]
+• [Patterns across failures]
+• [Recent changes correlation]
+
+**Affected Platforms**: [Architectures/OS if pattern found]
+
+**Recommendations**:
+• [Specific file:line to fix with suggested change]
+• [Additional investigation needed]
+• [Prevention strategy]
+
+---
+**Statistics**
+• Total Failures: [count]
+• Total Errors: [count]
+• Failed Jobs: [list]
+```
+
+## Implementation
+
+Start by finding and parsing test reports:
+
+```bash
+# Find all XML test reports
+find <artifacts-dir> -name "*.xml" -type f
+```
+
+For each failure:
+- Read the test source code to understand intent
+- Examine the implementation being tested
+- Check `git log --oneline -20` for recent changes
+- Look for patterns across different platforms
+
+Generate the report focusing on **actionable insights** for the oncall engineer:
+- File paths and line numbers for fixes
+- Platform-specific patterns (endianness, timing, etc.)
+- Links to similar past failures if found
+
+Keep the analysis **under 500 words** and emphasize:
+- What broke
+- Why it broke
+- How to fix it
+
+## CRITICAL: File Creation Step
+
+You MUST execute this bash command to create the report file:
+
+```bash
+cat > analysis-report.md <<'EOF'
+**🤖 AI Analysis**
+
+**Root Cause**: [your analysis here]
+
+**Evidence**:
+• [your findings]
+
+**Affected Platforms**: [platforms]
+
+**Recommendations**:
+• [actionable fixes]
+
+---
+**Statistics**
+• Total Failures: [count]
+• Failed Jobs: [jobs]
+EOF
+```
+
+DO NOT just summarize your findings - you MUST create the actual file using the bash command above.
+
+This is a required step. The workflow depends on analysis-report.md existing.
@@ -0,0 +1,232 @@
+# Test Failure Analysis with Claude
+
+Automatically analyzes test failures using Claude AI and includes intelligent insights in Slack notifications.
+
+## Architecture
+
+```
+Integration Tests Run
+  ├── amd64-integration-tests (may fail)
+  ├── arm64-integration-tests (may fail)
+  ├── s390x-integration-tests (may fail)
+  ├── ppc64le-integration-tests (may fail)
+  │
+  ├── collect-failures
+  │    └── Determine which jobs failed
+  │
+  └── analyze-and-notify (reusable workflow)
+       ├── analyze-failures
+       │    ├── Download test artifacts
+       │    ├── Execute /analyze-test-failures skill
+       │    └── Upload analysis-report.md
+       │
+       └── notify
+            ├── Download analysis-report.md
+            └── Post to Slack with AI insights
+```
+
+## How It Works
+
+### 1. Test Failures
+Any integration test job fails (e.g., `rhcos-arm64`, `cos-logs`)
+
+### 2. Collect Failures
+The `collect-failures` job identifies which jobs failed and outputs the list
+
+### 3. Analyze Failures (Claude Skill)
+Uses `claude-code-base-action` to execute the `/analyze-test-failures` skill:
+
+**The skill (`.claude/commands/analyze-test-failures.md`):**
+- Finds and parses JUnit XML test reports
+- Reads failing test source code
+- Examines implementation code being tested
+- Checks git log for recent changes
+- Identifies platform-specific patterns (arch/OS)
+- Creates `analysis-report.md` with actionable insights
+
+**Claude has access to:**
+- `Skill` - Load and execute the analysis skill
+- `Read` - View source files
+- `Grep` - Search codebase
+- `Glob` - Find files
+- `Bash` - Execute git commands, create reports
+
+### 4. Notify
+Posts to Slack (#team-acs-collector-oncall) with:
+- AI-generated root cause analysis
+- Evidence from code and logs
+- Platform-specific patterns detected
+- Actionable recommendations with file:line references
+
+Falls back to simple notification if analysis fails.
+
+## Files
+
+### Workflows
+- `.github/workflows/integration-tests.yml` - Main integration test workflow
+- `.github/workflows/analyze-and-notify.yml` - Reusable analysis workflow
+
+### Skill
+- `.claude/commands/analyze-test-failures.md` - Claude skill defining analysis logic
+
+## Example Output
+
+**Slack message with AI analysis:**
+```
+@acs-collector-oncall
+
+🤖 AI Analysis
+
+**Root Cause**: NetworkSignalHandler.cpp:245 missing ntohs() call 
+causing UDP checksum failures on ARM64 platforms.
+
+**Evidence**:
+• UDP test failures isolated to arm64 runners (rhcos-arm64, cos-arm64)
+• Checksum comparison uses direct equality without byte order conversion
+• Recent commit abc123f modified network packet handling
+• Tests pass on amd64 where byte order matches
+
+**Affected Platforms**: arm64 (rhcos-arm64, cos-arm64, ubuntu-arm)
+
+**Recommendations**:
+• Fix collector/lib/NetworkSignalHandler.cpp:245 - add ntohs() call
+• Add endianness test to integration suite
+• Review other protocol handlers for similar issues
+
+---
+**Statistics**
+• Total Failures: 2
+• Failed Jobs: rhcos-arm64, cos-arm64
+```
+
+## How It's Different from Manual Analysis
+
+**Before:** Generic notification
+```
+@acs-collector-oncall
+Integration tests failed.
+```
+
+**After:** Actionable analysis with Claude
+- Specific file and line number to fix
+- Root cause explanation based on code analysis
+- Platform/architecture pattern detection
+- Links recent git changes to failures
+- Provides concrete next steps
+
+## Testing
+
+### Test on a PR
+
+Add the label `test-oncall-workflow` to any PR to trigger the workflow.
+
+**What happens:**
+- Workflow runs with empty test artifacts
+- Claude analyzes and generates a report
+- Report is uploaded as artifact
+- **Slack notification is skipped** (only runs on actual test failures)
+
+**Use case:** Verify Claude analysis executes without spamming Slack.
+
+**To verify it worked:**
+1. Check the workflow run in Actions tab
+2. Download the `failure-analysis` artifact to see the generated report
+
+### Test with Real Failures
+
+The best test is observing the workflow on actual test failures:
+1. Wait for integration tests to fail naturally
+2. Check #team-acs-collector-oncall for the AI analysis
+3. Verify the analysis is helpful and actionable
+
+## Configuration
+
+### Vertex AI Region
+Set in `.github/workflows/analyze-and-notify.yml`:
+```yaml
+env:
+  CLOUD_ML_REGION: us-east5
+```
+
+### Required Secrets
+
+Already configured:
+- `GCP_CLAUDE_SERVICE_ACCOUNT_KEY` - Service account JSON for Vertex AI
+- `GCP_CLAUDE_PROJECT_ID` - GCP project ID
+- `SLACK_COLLECTOR_ONCALL_WEBHOOK` - Slack webhook URL
+
+### Allowed Tools
+
+Claude has access to these tools for investigation:
+```yaml
+allowed_tools: "Skill,Read,Grep,Glob,Bash"
+```
+
+### Reusable Workflow Inputs
+
+The `analyze-and-notify.yml` workflow accepts:
+- `failed-jobs` - Comma-separated list of failed job names
+- `workflow-name` - Name of the workflow that failed
+
+## Troubleshooting
+
+### No Analysis Report Generated
+
+**Check:**
+1. Claude action step logs - did it execute successfully?
+2. "Check if analysis report was created" step - does file exist?
+3. Skill file exists at `.claude/commands/analyze-test-failures.md`
+4. `Skill` tool is in `allowed_tools`
+
+### Vertex AI Errors
+
+**Common issues:**
+- Model not available in configured region
+- Service account lacks `roles/aiplatform.user` permission
+- `GCP_CLAUDE_PROJECT_ID` secret not set correctly
+
+**Solution:**
+Check Claude action logs for specific error details.
+
+### No Slack Notification
+
+**Check:**
+1. `SLACK_COLLECTOR_ONCALL_WEBHOOK` secret is set
+2. Notify job logs show download step succeeded
+3. Webhook URL is valid
+
+### Analysis Quality Issues
+
+**If Claude's analysis is not helpful:**
+1. Check that test artifacts are being uploaded correctly
+2. Verify JUnit XML format is valid
+3. Update skill instructions in `.claude/commands/analyze-test-failures.md`
+4. The skill can be iterated on independently of the workflow
+
+## Local Development
+
+### Test the Skill Locally
+
+```bash
+# Requires Claude CLI installed
+claude /analyze-test-failures test-artifacts/ "Integration Tests" "rhcos-arm64,cos"
+```
+
+### Update the Skill
+
+Edit `.claude/commands/analyze-test-failures.md` to:
+- Change analysis instructions
+- Update report format
+- Add new investigation steps
+- Modify recommendations structure
+
+Changes take effect on the next workflow run - no workflow YAML changes needed.
+
+## Future Enhancements
+
+- [ ] Correlate failures with specific PR/commit
+- [ ] Track failure patterns over time  
+- [ ] Link to similar historical failures
+- [ ] Auto-create issues for recurring failures
+- [ ] Support for other test frameworks beyond JUnit XML
+- [ ] Integration with test retries/flakiness detection