An informal cumulative and competitive frontier model eval using a Javascript chess engine.
Assume A is currently the leading engine (initially 0000_original). A model/CLI is selected to improve it by creating a new engine B via prompt.md. If a B v A SPRT passes, B becomes the new leader. So for example 0002_sonnet_4_6 was derived from 0000_original, not 0001_haiku_4_5.
/---> 0001 /---> 0004
0000 ---> 0002 ---> 0003 ---> 0005 ---> 0006 etc.
See bin/sprt.
| Engine | Diff | Model | CLI | SPRT |
|---|---|---|---|---|
| 0008_opus_4_8 | Δ | Anthropic Claude Opus 4.8 | Claude Code | ✓ |
| 0007_opus_4_7 | Δ | Anthropic Claude Opus 4.7 | Claude Code | ✓ |
| 0006_gpt_5_5 | Δ | OpenAI GPT 5.5 | Codex | ✓ |
| 0005_opus_4_7 | Δ | Anthropic Claude Opus 4.7 | Claude Code | ✓ |
| 0004_gpt_5_5 | Δ | OpenAI GPT 5.5 | Codex | ✗ |
| 0003_opus_4_7 | Δ | Anthropic Claude Opus 4.7 | Claude Code | ✓ |
| 0002_sonnet_4_6 | Δ | Anthropic Claude Sonnet 4.6 | Claude Code | ✓ |
| 0001_haiku_4_5 | Δ | Anthropic Claude Haiku 4.5 | Claude Code | ✗ |
| 0000_original |
| Rank | Engine | Elo | Games | Score | Draws |
|---|---|---|---|---|---|
| 1 | 0008_opus_4_8 | 2164 ±17.21 | 1600 | 73.0% | 27.4% |
| 2 | 0007_opus_4_7 | 2148 ±17.39 | 1600 | 71.1% | 26.9% |
| 3 | 0006_gpt_5_5 | 2060 ±15.50 | 1600 | 59.6% | 30.0% |
| 4 | 0005_opus_4_7 | 2026 ±15.53 | 1600 | 55.0% | 31.0% |
| 5 | 0003_opus_4_7 | 2019 ±15.96 | 1600 | 53.9% | 30.8% |
| 6 | 0004_gpt_5_5 | 2013 ±15.92 | 1600 | 53.0% | 27.6% |
| 7 | 0002_sonnet_4_6 | 1900 ±16.55 | 1600 | 37.0% | 26.6% |
| 8 | 0000_original | 1800 ±18.66 | 1600 | 24.8% | 21.9% |
| 9 | 0001_haiku_4_5 | 1778 ±18.42 | 1600 | 22.6% | 21.6% |
See bin/tourny.
- There is a Windows executable for each engine in
./enginesfor anybody that is interested.
- https://github.com/Disservin/fastchess - SPRT and tournament manager