Skip to content

op12no2/patchwork

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Patchwork

An informal cumulative and competitive frontier model eval using a Javascript chess engine.

Procedure

Assume A is currently the leading engine (initially 0000_original). A model/CLI is selected to improve it by creating a new engine B via prompt.md. If a B v A SPRT passes, B becomes the new leader. So for example 0002_sonnet_4_6 was derived from 0000_original, not 0001_haiku_4_5.

    /---> 0001          /---> 0004
0000 ---> 0002 ---> 0003 ---> 0005 ---> 0006 etc.

See bin/sprt.

Progress

Engine Diff Model CLI SPRT
0008_opus_4_8 Δ Anthropic Claude Opus 4.8 Claude Code
0007_opus_4_7 Δ Anthropic Claude Opus 4.7 Claude Code
0006_gpt_5_5 Δ OpenAI GPT 5.5 Codex
0005_opus_4_7 Δ Anthropic Claude Opus 4.7 Claude Code
0004_gpt_5_5 Δ OpenAI GPT 5.5 Codex
0003_opus_4_7 Δ Anthropic Claude Opus 4.7 Claude Code
0002_sonnet_4_6 Δ Anthropic Claude Sonnet 4.6 Claude Code
0001_haiku_4_5 Δ Anthropic Claude Haiku 4.5 Claude Code
0000_original

Tournament

Rank Engine Elo Games Score Draws
1 0008_opus_4_8 2164 ±17.21 1600 73.0% 27.4%
2 0007_opus_4_7 2148 ±17.39 1600 71.1% 26.9%
3 0006_gpt_5_5 2060 ±15.50 1600 59.6% 30.0%
4 0005_opus_4_7 2026 ±15.53 1600 55.0% 31.0%
5 0003_opus_4_7 2019 ±15.96 1600 53.9% 30.8%
6 0004_gpt_5_5 2013 ±15.92 1600 53.0% 27.6%
7 0002_sonnet_4_6 1900 ±16.55 1600 37.0% 26.6%
8 0000_original 1800 ±18.66 1600 24.8% 21.9%
9 0001_haiku_4_5 1778 ±18.42 1600 22.6% 21.6%

See bin/tourny.

Notes

  • There is a Windows executable for each engine in ./engines for anybody that is interested.

Acknowledgements

About

An informal cumulative and competitive frontier model eval using a Javascript chess engine

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors