Save arena battles and ELO ratings as artifacts#63
Open
kargibora wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The arena ELO pipeline (
estimate_elo_ratings.main) computed Bradley-Terry ratings but saved nothing usable.The ratings were only
printed and returned (the CLI discards the return value), and the per-battle data lived only in the internalcache/elo/*.csv.ziprecompute-skipping layer.There was no battle file and no ratings file, so a run could not be visualized or its ELO recomputed without rerunning the whole GPU job.
The pairwise pipeline already writes a results folder; the arena pipeline did not.
Solution
Persist the run as a small set of artifacts, built around the battle as the atomic unit: ELO is a pure function of a list of battles, so saving the battles (plus the bootstrap ratings) is enough to reconstruct or re-analyse a run.
New module
judgearena/battles.py:Battle— one outcome (model_a,model_b,winner,source,question_id?,judge_model?).from_dictignores unknown keys, so old files keep loading after the schema grows.write_battles/read_battles/battles_to_frame— JSONL round-trip plus the bridge to the existingcompute_bradley_terry.RatingEntry/EloReport— the leaderboard (mean + bootstrap CI per model) and run metadata.estimate_elo_ratings.mainnow writes a results folder:The ELO math is unchanged; the combined battle frame just gained
source/question_id/judge_modelcolumns thatcompute_bradley_terryignores.Design notes
battles.jsonlinlines the human arena battles (not just the judged ones) so the file is self-contained and ELO is recomputable from it alone. This is why it is large (~16 MB / ~87k rows for LMArena-100k).sourcevocabularies:Battle.sourceisllm-judge|human(who decided the outcome);RatingEntry.sourceisevaluated|human(the model under test vs. every other model on the board).Example output
Run:
Qwen2.5-1.5B-Instructplaced into LMArena-100k, judged byQwen3-8B, 100 battles, 100 bootstraps.Leaderboard (printed and saved):
battles.jsonl(87,331 rows = 100llm-judge+ 87,231human):{"model_a": "VLLM/Qwen/Qwen2.5-1.5B-Instruct", "model_b": "gpt-3.5-turbo-0125", "winner": "model_a", "source": "llm-judge", "question_id": "4c6978df...", "judge_model": "VLLM/Qwen/Qwen3-8B"} {"model_a": "claude-3-5-sonnet-20240620", "model_b": "gpt-3.5-turbo-0125", "winner": "tie (bothbad)", "source": "human", "question_id": "1a2b3c4d...", "judge_model": null}elo_ratings.json:{ "arena": "LMArena-100k", "model": "VLLM/Qwen/Qwen2.5-1.5B-Instruct", "judge_model": "VLLM/Qwen/Qwen3-8B", "n_bootstraps": 100, "seed": 0, "ratings": [ {"model": "chatgpt-4o-latest", "rating": 1114.3, "ci_low": 1105.9, "ci_high": 1121.8, "n_battles": 4697, "source": "human"}, {"model": "VLLM/Qwen/Qwen2.5-1.5B-Instruct", "rating": 686.2, "ci_low": 578.8, "ci_high": 780.4, "n_battles": 100, "source": "evaluated"} ] }