Evaluating Language Model Reasoning with Puzzle Duels
Inspired by 16th-century Italian mathematical duels, The Token Games (TTG) is a benchmark where LLMs challenge each other by creating and solving programming puzzles with no human-authored problems required.
In a duel, two models take turns as proposer and solver. The proposer crafts a Python function (mystery(x)) and provides a secret sample solution. The solver must find any input that makes it return True. Solutions are verified by simply running the code. The proposer loses the turn if its own solution is wrong. It wins if the solver fails to find a solution. If the solver succeeds, that round ends in a draw. Proposers can see their own history and outcomes of past rounds, so that they can adapt puzzle difficulty as the duel progresses.
Unlike static benchmarks, TTG cannot be saturated — stronger models can always design harder puzzles that current models cannot solve. We find that designing hard puzzles is itself a very hard task, with even the strongest models failing often to provide a valid solution to their own puzzle, or to create puzzles that are challenging for other models. Designing puzzles requires producing novelty (recycling known problems is suboptimal since opponents may know them too) and tests for self-calibration (you want to propose the hardest puzzle that you can safely produce a sample solution for: if your puzzle is too hard, you may end up hallucinating the sample solution).
TTG uses Programming Puzzles (a Python function that returns a boolean) as a universal format for encoding reasoning challenges. This format is flexible enough to represent everything from simple string constraints to NP-complete problems.
Each duel consists of 10 rounds. In each round, one model proposes a puzzle and the other tries to solve it, then they swap. A proposer scores if it provides a valid puzzle with a correct solution that the opponent fails to solve. If the proposer's own solution is wrong, the solver scores instead. If the solver succeeds, it's a draw.
We ran all 90 ordered pairings of 10 frontier models, each playing 10 rounds. From duel outcomes we compute Elo ratings using the Bradley-Terry model, yielding a ranking that closely matches expert-authored benchmarks like HLE (ρ = 0.87), ARC-AGI (ρ = 0.89), and GPQA Diamond (ρ = 0.86) at a fraction of the cost (under $200 USD).
Performance of 10 frontier models on TTG. Elo = Bradley-Terry rating from match outcomes. Solv% = fraction of rounds won as solver. Prop% = fraction of rounds where the model proposed a valid unsolved puzzle. Penalty% = fraction of proposer rounds where the model's own solution was wrong.
| # | Model | Elo | Solv% | Prop% (unsolved) | Penalty% |
|---|---|---|---|---|---|
| 1 | GPT-5.5 | 1167 | 97.2% | 35.6% | 5.6% |
| 2 | Gemini 3.1 Pro Preview | 1129 | 89.2% | 31.1% | 11.1% |
| 3 | Claude Opus 4-7 | 1100 | 93.9% | 17.8% | 5.6% |
| 4 | GPT-5.4 Mini | 1068 | 89.3% | 12.2% | 22.2% |
| 5 | Grok-4.20-0309 Reasoning | 1057 | 78.7% | 17.8% | 11.1% |
| 6 | Claude Sonnet 4-6 | 1043 | 83.5% | 10.0% | 1.1% |
| 7 | Gemini 3 Flash Preview | 1011 | 70.5% | 11.1% | 44.4% |
| 8 | DeepSeek v3.2 Thinking | 1007 | 69.7% | 10.0% | 32.2% |
| 9 | Claude Haiku 4-5 | 1000 | 78.1% | 0.0% | 4.4% |
| 10 | Grok-4 Fast Reasoning | 992 | 74.4% | 4.4% | 25.6% |
Despite the complete absence of human-authored problems, TTG rankings are generally consistent with expert-authored benchmarks. The top models on TTG (GPT-5.5 and Gemini 3.1 Pro Preview) also lead on HLE, ARC-AGI, TextQuests, and GPQA Diamond. Ranks in the top 3 and bottom 3 are highlighted.
| Model | TTG | HLE | ARC-AGI | SWE-BP | TextQ | GPQA-D | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Elo | Solv% | Prop% | Acc | # | Acc | # | Acc | # | Acc | # | Acc | # | |
| GPT-5.5 | 1167 | 97.2 | 35.6 | 43.6 | 2 | 77.5 | 1 | 53.4 | 3 | 42.0 | 2 | 93.5 | 2 |
| Gemini 3.1 Pro | 1129 | 89.2 | 31.1 | 45.9 | 1 | 73.3 | 2 | 46.7 | 4 | 45.8 | 1 | 94.1 | 1 |
| Claude Opus 4-7 | 1100 | 93.9 | 17.8 | 39.0 | 3 | 50.8 | 4 | 60.9 | 1 | 37.0 | 3 | 91.4 | 3 |
| GPT-5.4 Mini | 1068 | 89.3 | 12.2 | 23.5 | 6 | 5.8 | 7 | 37.9 | 7 | 29.6 | 6 | 87.5 | 6 |
| Grok-4.20 | 1057 | 78.7 | 17.8 | 30.2 | 5 | 55.0 | 3 | 26.3 | 9 | 18.5 | 9 | 88.5 | 5 |
| Claude Sonnet 4-6 | 1043 | 83.5 | 10.0 | 21.1 | 8 | 24.2 | 6 | 53.8 | 2 | 31.5 | 5 | 87.5 | 6 |
| Gemini 3 Flash | 1011 | 70.5 | 11.1 | 36.6 | 4 | 30.8 | 5 | 38.6 | 6 | 36.4 | 4 | 89.8 | 4 |
| DeepSeek v3.2 | 1007 | 69.7 | 10.0 | 21.8 | 7 | 5.0 | 8 | 33.1 | 8 | 21.2 | 7 | 84.0 | 9 |
| Claude Haiku 4-5 | 1000 | 78.1 | 0.0 | 9.7 | 10 | 4.0 | 9 | 41.0 | 5 | 15.1 | 10 | 67.2 | 10 |
| Grok-4 Fast | 992 | 74.4 | 4.4 | 17.8 | 9 | 3.3 | 10 | 12.0 | 10 | 20.1 | 8 | 84.7 | 8 |
Bolded values are statistically significant (p < 0.05). Proposer Win Rate correlates most strongly with general reasoning benchmarks, while Solver Win Rate uniquely correlates with SWE-Bench Pro.
| TTG Metric | vs HLE | vs ARC-AGI | vs SWE-BP | vs TextQ | vs GPQA-D |
|---|---|---|---|---|---|
| Elo | +0.87 (p=.001) | +0.89 (p=.001) | +0.58 (p=.082) | +0.77 (p=.009) | +0.86 (p=.002) |
| Solver Win Rate | +0.55 (p=.098) | +0.62 (p=.054) | +0.64 (p=.048) | +0.56 (p=.090) | +0.63 (p=.053) |
| Proposer Win Rate | +0.94 (p<.001) | +0.94 (p<.001) | +0.36 (p=.307) | +0.73 (p=.018) | +0.91 (p<.001) |
Are strong solvers also good proposers? We find a strong correlation (ρ = 0.855, p = 0.002), but proposing is far harder: even GPT-5.5, which solved 97.2% of puzzles, only stumped opponents 35.6% of the time as proposer. Across all rounds, proposers fail to score 82.1% of the time.
When a proposer fails to score, it's either because the puzzle was too easy (the opponent solved it) or too ambitious (the proposer's own solution was wrong, incurring a penalty). The balance between these failure modes varies dramatically across models.
Gemini 3 Flash Preview exhibits high overconfidence, with a 44.4% penalty rate for incorrect sample solutions. Claude Sonnet 4-6 errs in the other direction — opponents solve its puzzles 88.8% of the time, but it almost never fails on its own solution (1.1%).
Browse all 90 duels and their puzzles in our interactive duel viewer. Here are the 10 puzzles highlighted in the paper:
| Puzzle | Proposer | Solver | Outcome | |
|---|---|---|---|---|
| Brainfuck Interpreter | Gemini 3.1 Pro Preview | Gemini 3 Flash Preview | Solved | view |
| Multi-Phase Bit-Rotation Cipher | Gemini 3 Flash Preview | GPT-5.4 Mini | Sample Solution Wrong | view |
| Layered Digit Constraints | Claude Opus 4-7 | Gemini 3 Flash Preview | Solver Failed | view |
| Quine with Restricted Characters | Gemini 3.1 Pro Preview | DeepSeek v3.2 Thinking | Solver Failed | view |
| Collatz Sequence as a Lambda | Gemini 3.1 Pro Preview | DeepSeek v3.2 Thinking | Solver Failed | view |
| ROT13 + Base64 + Hash Verification | GPT-5.5 | Claude Haiku 4-5 | Solver Failed | view |
| PBKDF2-HMAC with Null Byte | GPT-5.5 | Claude Haiku 4-5 | Solver Failed | view |
| Unicode Palindrome Case Trick | Gemini 3.1 Pro Preview | Claude Haiku 4-5 | Solver Failed | view |
| IEEE 754 Negative Zero | Grok-4.20-0309 Reasoning | Gemini 3.1 Pro Preview | Sample Solution Wrong | view |
| Dual Decimal-Binary Palindrome | DeepSeek v3.2 Thinking | Claude Haiku 4-5 | Solved | view |
If you use The Token Games in your research, please cite: