May the best reasoner win

The Token Games

Evaluating Language Model Reasoning with Puzzle Duels

Simon Henniger*  ·  Gabriel Poesia*
Harvard University

Inspired by 16th-century Italian mathematical duels, The Token Games (TTG) is a benchmark where LLMs challenge each other by creating and solving programming puzzles with no human-authored problems required.

In a duel, two models take turns as proposer and solver. The proposer crafts a Python function (mystery(x)) and provides a secret sample solution. The solver must find any input that makes it return True. Solutions are verified by simply running the code. The proposer loses the turn if its own solution is wrong. It wins if the solver fails to find a solution. If the solver succeeds, that round ends in a draw. Proposers can see their own history and outcomes of past rounds, so that they can adapt puzzle difficulty as the duel progresses.

Why The Token Games?

Unlike static benchmarks, TTG cannot be saturated — stronger models can always design harder puzzles that current models cannot solve. We find that designing hard puzzles is itself a very hard task, with even the strongest models failing often to provide a valid solution to their own puzzle, or to create puzzles that are challenging for other models. Designing puzzles requires producing novelty (recycling known problems is suboptimal since opponents may know them too) and tests for self-calibration (you want to propose the hardest puzzle that you can safely produce a sample solution for: if your puzzle is too hard, you may end up hallucinating the sample solution).

The Duel Protocol

TTG uses Programming Puzzles (a Python function that returns a boolean) as a universal format for encoding reasoning challenges. This format is flexible enough to represent everything from simple string constraints to NP-complete problems.

Each duel consists of 10 rounds. In each round, one model proposes a puzzle and the other tries to solve it, then they swap. A proposer scores if it provides a valid puzzle with a correct solution that the opponent fails to solve. If the proposer's own solution is wrong, the solver scores instead. If the solver succeeds, it's a draw.

Proposer
Here's a function f. Which s makes f(s) true?
Verify Puzzle
Test proposer's sample solution
Solver
Here's an x such that f(x) == True
Verify Solution
Test given solution

We ran all 90 ordered pairings of 10 frontier models, each playing 10 rounds. From duel outcomes we compute Elo ratings using the Bradley-Terry model, yielding a ranking that closely matches expert-authored benchmarks like HLE (ρ = 0.87), ARC-AGI (ρ = 0.89), and GPQA Diamond (ρ = 0.86) at a fraction of the cost (under $200 USD).

Model Performance

Performance of 10 frontier models on TTG. Elo = Bradley-Terry rating from match outcomes. Solv% = fraction of rounds won as solver. Prop% = fraction of rounds where the model proposed a valid unsolved puzzle. Penalty% = fraction of proposer rounds where the model's own solution was wrong.

# Model Elo Solv% Prop% (unsolved) Penalty%
1 GPT-5.5 1167 97.2% 35.6% 5.6%
2 Gemini 3.1 Pro Preview 1129 89.2% 31.1% 11.1%
3 Claude Opus 4-7 1100 93.9% 17.8% 5.6%
4 GPT-5.4 Mini 1068 89.3% 12.2% 22.2%
5 Grok-4.20-0309 Reasoning 1057 78.7% 17.8% 11.1%
6 Claude Sonnet 4-6 1043 83.5% 10.0% 1.1%
7 Gemini 3 Flash Preview 1011 70.5% 11.1% 44.4%
8 DeepSeek v3.2 Thinking 1007 69.7% 10.0% 32.2%
9 Claude Haiku 4-5 1000 78.1% 0.0% 4.4%
10 Grok-4 Fast Reasoning 992 74.4% 4.4% 25.6%

Correlation with Expert-Authored Benchmarks

Despite the complete absence of human-authored problems, TTG rankings are generally consistent with expert-authored benchmarks. The top models on TTG (GPT-5.5 and Gemini 3.1 Pro Preview) also lead on HLE, ARC-AGI, TextQuests, and GPQA Diamond. Ranks in the top 3 and bottom 3 are highlighted.

Model TTG HLE ARC-AGI SWE-BP TextQ GPQA-D
EloSolv%Prop% Acc# Acc# Acc# Acc# Acc#
GPT-5.5 116797.235.6 43.62 77.51 53.43 42.02 93.52
Gemini 3.1 Pro 112989.231.1 45.91 73.32 46.74 45.81 94.11
Claude Opus 4-7 110093.917.8 39.03 50.84 60.91 37.03 91.43
GPT-5.4 Mini 106889.312.2 23.56 5.87 37.97 29.66 87.56
Grok-4.20 105778.717.8 30.25 55.03 26.39 18.59 88.55
Claude Sonnet 4-6 104383.510.0 21.18 24.26 53.82 31.55 87.56
Gemini 3 Flash 101170.511.1 36.64 30.85 38.66 36.44 89.84
DeepSeek v3.2 100769.710.0 21.87 5.08 33.18 21.27 84.09
Claude Haiku 4-5 100078.10.0 9.710 4.09 41.05 15.110 67.210
Grok-4 Fast 99274.44.4 17.89 3.310 12.010 20.18 84.78

Spearman Rank Correlations

Bolded values are statistically significant (p < 0.05). Proposer Win Rate correlates most strongly with general reasoning benchmarks, while Solver Win Rate uniquely correlates with SWE-Bench Pro.

TTG Metric vs HLE vs ARC-AGI vs SWE-BP vs TextQ vs GPQA-D
Elo +0.87 (p=.001) +0.89 (p=.001) +0.58 (p=.082) +0.77 (p=.009) +0.86 (p=.002)
Solver Win Rate +0.55 (p=.098) +0.62 (p=.054) +0.64 (p=.048) +0.56 (p=.090) +0.63 (p=.053)
Proposer Win Rate +0.94 (p<.001) +0.94 (p<.001) +0.36 (p=.307) +0.73 (p=.018) +0.91 (p<.001)

Solver vs. Proposer Ability

Are strong solvers also good proposers? We find a strong correlation (ρ = 0.855, p = 0.002), but proposing is far harder: even GPT-5.5, which solved 97.2% of puzzles, only stumped opponents 35.6% of the time as proposer. Across all rounds, proposers fail to score 82.1% of the time.

Measuring a Model's Overconfidence

When a proposer fails to score, it's either because the puzzle was too easy (the opponent solved it) or too ambitious (the proposer's own solution was wrong, incurring a penalty). The balance between these failure modes varies dramatically across models.

Gemini 3 Flash Preview exhibits high overconfidence, with a 44.4% penalty rate for incorrect sample solutions. Claude Sonnet 4-6 errs in the other direction — opponents solve its puzzles 88.8% of the time, but it almost never fails on its own solution (1.1%).

Explore the Puzzles

Browse all 90 duels and their puzzles in our interactive duel viewer. Here are the 10 puzzles highlighted in the paper:

Puzzle Proposer Solver Outcome
Brainfuck Interpreter Gemini 3.1 Pro Preview Gemini 3 Flash Preview Solved view
Multi-Phase Bit-Rotation Cipher Gemini 3 Flash Preview GPT-5.4 Mini Sample Solution Wrong view
Layered Digit Constraints Claude Opus 4-7 Gemini 3 Flash Preview Solver Failed view
Quine with Restricted Characters Gemini 3.1 Pro Preview DeepSeek v3.2 Thinking Solver Failed view
Collatz Sequence as a Lambda Gemini 3.1 Pro Preview DeepSeek v3.2 Thinking Solver Failed view
ROT13 + Base64 + Hash Verification GPT-5.5 Claude Haiku 4-5 Solver Failed view
PBKDF2-HMAC with Null Byte GPT-5.5 Claude Haiku 4-5 Solver Failed view
Unicode Palindrome Case Trick Gemini 3.1 Pro Preview Claude Haiku 4-5 Solver Failed view
IEEE 754 Negative Zero Grok-4.20-0309 Reasoning Gemini 3.1 Pro Preview Sample Solution Wrong view
Dual Decimal-Binary Palindrome DeepSeek v3.2 Thinking Claude Haiku 4-5 Solved view

Citation

If you use The Token Games in your research, please cite:

@misc{hennigerpoesia2026, title={The Token Games: Evaluating Language Model Reasoning with Puzzle Duels}, author={Simon Henniger and Gabriel Poesia}, year={2026}, eprint={2602.17831}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2602.17831}, }