Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 115 additions & 0 deletions .claude/commands/analyze-test-failures.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
---
name: analyze-test-failures
description: Analyze test failure artifacts and generate root cause analysis report
---

# Test Failure Analysis

Analyze test failures from CI artifacts and generate a concise root cause analysis for the oncall team.

## Usage

```
/analyze-test-failures <artifacts-dir> <workflow-name> <failed-jobs>
```

**Arguments:**
- `artifacts-dir`: Directory containing test artifacts (default: test-artifacts/)
- `workflow-name`: Name of the workflow that failed (e.g., "Integration Tests")
- `failed-jobs`: Comma-separated list of failed job names

**Example:**
```
/analyze-test-failures test-artifacts/ "Integration Tests" "amd64-integration-tests,arm64-integration-tests"
```

## What This Does

1. **Find test reports**: Searches for JUnit XML files (integration-test-report-*.xml, junit.xml)
2. **Parse failures**: Extracts test names, error messages, stack traces
3. **Investigate code**: Reads failing test source and implementation code
4. **Check git history**: Looks for recent changes that may have caused failures
5. **Identify patterns**: Detects platform-specific issues (arch/OS)
6. **Generate report**: Creates analysis-report.md with findings

## Report Format

The generated `analysis-report.md` contains:

```markdown
**🤖 AI Analysis**

**Root Cause**: [1-2 sentence summary with file:line references]

**Evidence**:
• [Specific code observations]
• [Patterns across failures]
• [Recent changes correlation]

**Affected Platforms**: [Architectures/OS if pattern found]

**Recommendations**:
• [Specific file:line to fix with suggested change]
• [Additional investigation needed]
• [Prevention strategy]

---
**Statistics**
• Total Failures: [count]
• Total Errors: [count]
• Failed Jobs: [list]
```

## Implementation

Start by finding and parsing test reports:

```bash
# Find all XML test reports
find <artifacts-dir> -name "*.xml" -type f
```

For each failure:
- Read the test source code to understand intent
- Examine the implementation being tested
- Check `git log --oneline -20` for recent changes
- Look for patterns across different platforms

Generate the report focusing on **actionable insights** for the oncall engineer:
- File paths and line numbers for fixes
- Platform-specific patterns (endianness, timing, etc.)
- Links to similar past failures if found

Keep the analysis **under 500 words** and emphasize:
- What broke
- Why it broke
- How to fix it

## CRITICAL: File Creation Step

You MUST execute this bash command to create the report file:

```bash
cat > analysis-report.md <<'EOF'
**🤖 AI Analysis**

**Root Cause**: [your analysis here]

**Evidence**:
• [your findings]

**Affected Platforms**: [platforms]

**Recommendations**:
• [actionable fixes]

---
**Statistics**
• Total Failures: [count]
• Failed Jobs: [jobs]
EOF
```

DO NOT just summarize your findings - you MUST create the actual file using the bash command above.

This is a required step. The workflow depends on analysis-report.md existing.
232 changes: 232 additions & 0 deletions .github/scripts/README.md
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this in .github/scripts? Also, is this just a description of what analyze-test-failures.md does? Do we need a 200+ lines of markdown to explain what a separate 100+ line markdown file does?

Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
# Test Failure Analysis with Claude

Automatically analyzes test failures using Claude AI and includes intelligent insights in Slack notifications.

## Architecture

```
Integration Tests Run
├── amd64-integration-tests (may fail)
├── arm64-integration-tests (may fail)
├── s390x-integration-tests (may fail)
├── ppc64le-integration-tests (may fail)
├── collect-failures
│ └── Determine which jobs failed
└── analyze-and-notify (reusable workflow)
├── analyze-failures
│ ├── Download test artifacts
│ ├── Execute /analyze-test-failures skill
│ └── Upload analysis-report.md
└── notify
├── Download analysis-report.md
└── Post to Slack with AI insights
```

## How It Works

### 1. Test Failures
Any integration test job fails (e.g., `rhcos-arm64`, `cos-logs`)

### 2. Collect Failures
The `collect-failures` job identifies which jobs failed and outputs the list

### 3. Analyze Failures (Claude Skill)
Uses `claude-code-base-action` to execute the `/analyze-test-failures` skill:

**The skill (`.claude/commands/analyze-test-failures.md`):**
- Finds and parses JUnit XML test reports
- Reads failing test source code
- Examines implementation code being tested
- Checks git log for recent changes
- Identifies platform-specific patterns (arch/OS)
- Creates `analysis-report.md` with actionable insights

**Claude has access to:**
- `Skill` - Load and execute the analysis skill
- `Read` - View source files
- `Grep` - Search codebase
- `Glob` - Find files
- `Bash` - Execute git commands, create reports

### 4. Notify
Posts to Slack (#team-acs-collector-oncall) with:
- AI-generated root cause analysis
- Evidence from code and logs
- Platform-specific patterns detected
- Actionable recommendations with file:line references

Falls back to simple notification if analysis fails.

## Files

### Workflows
- `.github/workflows/integration-tests.yml` - Main integration test workflow
- `.github/workflows/analyze-and-notify.yml` - Reusable analysis workflow

### Skill
- `.claude/commands/analyze-test-failures.md` - Claude skill defining analysis logic

## Example Output

**Slack message with AI analysis:**
```
@acs-collector-oncall

🤖 AI Analysis

**Root Cause**: NetworkSignalHandler.cpp:245 missing ntohs() call
causing UDP checksum failures on ARM64 platforms.

**Evidence**:
• UDP test failures isolated to arm64 runners (rhcos-arm64, cos-arm64)
• Checksum comparison uses direct equality without byte order conversion
• Recent commit abc123f modified network packet handling
• Tests pass on amd64 where byte order matches

**Affected Platforms**: arm64 (rhcos-arm64, cos-arm64, ubuntu-arm)

**Recommendations**:
• Fix collector/lib/NetworkSignalHandler.cpp:245 - add ntohs() call
• Add endianness test to integration suite
• Review other protocol handlers for similar issues

---
**Statistics**
• Total Failures: 2
• Failed Jobs: rhcos-arm64, cos-arm64
```

## How It's Different from Manual Analysis

**Before:** Generic notification
```
@acs-collector-oncall
Integration tests failed.
```

**After:** Actionable analysis with Claude
- Specific file and line number to fix
- Root cause explanation based on code analysis
- Platform/architecture pattern detection
- Links recent git changes to failures
- Provides concrete next steps

## Testing

### Test on a PR

Add the label `test-oncall-workflow` to any PR to trigger the workflow.

**What happens:**
- Workflow runs with empty test artifacts
- Claude analyzes and generates a report
- Report is uploaded as artifact
- **Slack notification is skipped** (only runs on actual test failures)

**Use case:** Verify Claude analysis executes without spamming Slack.

**To verify it worked:**
1. Check the workflow run in Actions tab
2. Download the `failure-analysis` artifact to see the generated report

### Test with Real Failures

The best test is observing the workflow on actual test failures:
1. Wait for integration tests to fail naturally
2. Check #team-acs-collector-oncall for the AI analysis
3. Verify the analysis is helpful and actionable

## Configuration

### Vertex AI Region
Set in `.github/workflows/analyze-and-notify.yml`:
```yaml
env:
CLOUD_ML_REGION: us-east5
```

### Required Secrets

Already configured:
- `GCP_CLAUDE_SERVICE_ACCOUNT_KEY` - Service account JSON for Vertex AI
- `GCP_CLAUDE_PROJECT_ID` - GCP project ID
- `SLACK_COLLECTOR_ONCALL_WEBHOOK` - Slack webhook URL

### Allowed Tools

Claude has access to these tools for investigation:
```yaml
allowed_tools: "Skill,Read,Grep,Glob,Bash"
```

### Reusable Workflow Inputs

The `analyze-and-notify.yml` workflow accepts:
- `failed-jobs` - Comma-separated list of failed job names
- `workflow-name` - Name of the workflow that failed

## Troubleshooting

### No Analysis Report Generated

**Check:**
1. Claude action step logs - did it execute successfully?
2. "Check if analysis report was created" step - does file exist?
3. Skill file exists at `.claude/commands/analyze-test-failures.md`
4. `Skill` tool is in `allowed_tools`

### Vertex AI Errors

**Common issues:**
- Model not available in configured region
- Service account lacks `roles/aiplatform.user` permission
- `GCP_CLAUDE_PROJECT_ID` secret not set correctly

**Solution:**
Check Claude action logs for specific error details.

### No Slack Notification

**Check:**
1. `SLACK_COLLECTOR_ONCALL_WEBHOOK` secret is set
2. Notify job logs show download step succeeded
3. Webhook URL is valid

### Analysis Quality Issues

**If Claude's analysis is not helpful:**
1. Check that test artifacts are being uploaded correctly
2. Verify JUnit XML format is valid
3. Update skill instructions in `.claude/commands/analyze-test-failures.md`
4. The skill can be iterated on independently of the workflow

## Local Development

### Test the Skill Locally

```bash
# Requires Claude CLI installed
claude /analyze-test-failures test-artifacts/ "Integration Tests" "rhcos-arm64,cos"
```

### Update the Skill

Edit `.claude/commands/analyze-test-failures.md` to:
- Change analysis instructions
- Update report format
- Add new investigation steps
- Modify recommendations structure

Changes take effect on the next workflow run - no workflow YAML changes needed.

## Future Enhancements

- [ ] Correlate failures with specific PR/commit
- [ ] Track failure patterns over time
- [ ] Link to similar historical failures
- [ ] Auto-create issues for recurring failures
- [ ] Support for other test frameworks beyond JUnit XML
- [ ] Integration with test retries/flakiness detection
Loading
Loading