Evaluating Language Model Reasoning with Puzzle Duels
Inspired by 16th-century Italian mathematical duels, The Token Games (TTG) is a benchmark where LLMs challenge each other by creating and solving programming puzzles with no human-authored problems required.
Two models take turns as proposer and solver. The proposer crafts a Python function (mystery(x)) and provides a secret sample solution. The solver must find any input that makes it return True. Solutions are verified by simply running the code.
Unlike static benchmarks, TTG cannot be saturated — stronger models can always design harder puzzles. It incentivizes creativity (recycling known problems is suboptimal since opponents may know them too) and tests self-calibration (proposing a puzzle you can't solve yourself is penalized), allowing us to gauge a model's overconfidence and tendency to hallucinate.
TTG uses Programming Puzzles (a Python function that returns a boolean) as a universal format for encoding reasoning challenges. This format is flexible enough to represent everything from simple string constraints to NP-complete problems.
Each duel consists of 10 rounds. In each round, one model proposes a puzzle and the other tries to solve it, then they swap. A proposer scores if it provides a valid puzzle with a correct solution that the opponent fails to solve. If the proposer's own solution is wrong, the solver scores instead. If the solver succeeds, it's a draw.
We ran all 90 ordered pairings of 10 frontier models, each playing 10 rounds. From duel outcomes we compute Elo ratings using the Bradley-Terry model, yielding a ranking that closely matches expert-authored benchmarks like HLE (ρ = 0.75) and GPQA Diamond (ρ = 0.74) at a fraction of the cost.
Performance of 10 frontier models on TTG. Solv% = fraction of rounds won as solver. Prop% = fraction of rounds where the model proposed a valid unsolved puzzle. Penalty = fraction of proposer rounds where the model's own solution was wrong.
| # | Model | Solv% | Prop% (unsolved) | Penalty% |
|---|---|---|---|---|
| 1 | GPT-5.2 Pro | 100.0% | 50.6% | 14.4% |
| 2 | Gemini 3 Pro | 93.2% | 32.8% | 32.2% |
| 3 | Grok-4 | 91.9% | 11.9% | 25.6% |
| 4 | GPT-5 Mini | 89.1% | 18.2% | 26.7% |
| 5 | Claude Opus 4.5 | 84.9% | 15.2% | 12.2% |
| 6 | DeepSeek Reasoner | 77.4% | 24.6% | 27.8% |
| 7 | Gemini 2.5 Pro | 75.4% | 11.1% | 90.0% |
| 8 | Gemini 2.5 Flash | 73.8% | 0.0% | 52.2% |
| 9 | Claude Sonnet 4.5 | 68.3% | 4.1% | 18.9% |
| 10 | GPT-5.2 | 52.6% | 0.0% | 97.8% |
Are strong solvers also good proposers? We find a strong correlation (ρ = 0.85), but proposing is far harder: even GPT-5.2 Pro, which solved every puzzle, only stumped opponents 50.6% of the time as proposer.
When a proposer fails to score, it's either because the puzzle was too easy (the opponent solved it) or too ambitious (the proposer's own solution was wrong, incurring a penalty). The balance between these failure modes varies dramatically across models.
GPT-5.2 is extraordinarily overconfident, failing on its own solution 97.8% of the time. Claude Opus 4.5 errs in the other direction — opponents solve its puzzles 74.4% of the time.
Proposers can see the full history of the duel. Do they use it? Yes — puzzles created in later rounds are measurably harder. When GPT-5.2 and GPT-5 Mini attempt all puzzles independently, solve rates drop steadily from round 1 to round 10.
Browse all 90 duels and their puzzles in our interactive duel viewer. Here are some highlights from the paper:
| Puzzle | Proposer | Solver | Outcome | |
|---|---|---|---|---|
| String constraints with modular product | Claude Opus 4.5 | Claude Sonnet 4.5 | Solved | view |
| 8-digit number with 7 constraints | Claude Opus 4.5 | Claude Sonnet 4.5 | Solver Failed | view |
| MD5 hash + number theory + XOR | Gemini 2.5 Pro | Claude Opus 4.5 | Sample Solution Wrong | view |
| Prime year + Friday the 13th date puzzle | DeepSeek Reasoner | Claude Opus 4.5 | Solved | view |
| Reverse == 4x palindrome | Claude Sonnet 4.5 | Claude Opus 4.5 | Solved | view |
| Brainfuck VM with SHA-256 gate | Gemini 2.5 Pro | GPT-5.2 | Sample Solution Wrong | view |
| 12-char string with 13 constraints | GPT-5.2 Pro | Claude Opus 4.5 | Solver Failed | view |
| Weighted sum + symmetry + XOR chain | Claude Opus 4.5 | GPT-5 Mini | Solver Failed | view |
| ASCII sum perfect square (trivial) | Claude Sonnet 4.5 | Grok-4 | Solved | view |
| 8-digit palindrome with digit product | Claude Opus 4.5 | Gemini 2.5 Pro | Solved | view |
| Hallucinated hex + broken XOR + SHA-256 | GPT-5.2 | Gemini 2.5 Pro | Sample Solution Wrong | view |
If you use The Token Games in your research, please cite: