The Token Games

Inspired by 16th-century Italian mathematical duels, The Token Games (TTG) is a benchmark where LLMs challenge each other by creating and solving programming puzzles with no human-authored problems required.

In a duel, two models take turns as proposer and solver. The proposer crafts a Python function (mystery(x)) and provides a secret sample solution. The solver must find any input that makes it return True. Solutions are verified by simply running the code. The proposer loses the turn if its own solution is wrong. It wins if the solver fails to find a solution. If the solver succeeds, that round ends in a draw. Proposers can see their own history and outcomes of past rounds, so that they can adapt puzzle difficulty as the duel progresses.

Why The Token Games?

Unlike static benchmarks, TTG cannot be saturated — stronger models can always design harder puzzles that current models cannot solve. We find that designing hard puzzles is itself a very hard task, with even the strongest models failing often to provide a valid solution to their own puzzle, or to create puzzles that are challenging for other models. Designing puzzles requires producing novelty (recycling known problems is suboptimal since opponents may know them too) and tests for self-calibration (you want to propose the hardest puzzle that you can safely produce a sample solution for: if your puzzle is too hard, you may end up hallucinating the sample solution).

The Duel Protocol

TTG uses Programming Puzzles (a Python function that returns a boolean) as a universal format for encoding reasoning challenges. This format is flexible enough to represent everything from simple string constraints to NP-complete problems.

Each duel consists of 10 rounds. In each round, one model proposes a puzzle and the other tries to solve it, then they swap. A proposer scores if it provides a valid puzzle with a correct solution that the opponent fails to solve. If the proposer's own solution is wrong, the solver scores instead. If the solver succeeds, it's a draw.

Proposer

Here's a function f. Which s makes f(s) true?

→

Verify Puzzle

Test proposer's sample solution

→

Solver

Here's an x such that f(x) == True

→

Verify Solution

Test given solution

We ran all 90 ordered pairings of 10 frontier models, each playing 10 rounds. From duel outcomes we compute Elo ratings using the Bradley-Terry model, yielding a ranking that closely matches expert-authored benchmarks like HLE (ρ = 0.87), ARC-AGI (ρ = 0.89), and GPQA Diamond (ρ = 0.86) at a fraction of the cost (under $200 USD).

Model Performance

Performance of 10 frontier models on TTG. Elo = Bradley-Terry rating from match outcomes. Solv% = fraction of rounds won as solver. Prop% = fraction of rounds where the model proposed a valid unsolved puzzle. Penalty% = fraction of proposer rounds where the model's own solution was wrong.

#	Model	Elo	Solv%	Prop% (unsolved)	Penalty%
1	GPT-5.5	1167	97.2%	35.6%	5.6%
2	Gemini 3.1 Pro Preview	1129	89.2%	31.1%	11.1%
3	Claude Opus 4-7	1100	93.9%	17.8%	5.6%
4	GPT-5.4 Mini	1068	89.3%	12.2%	22.2%
5	Grok-4.20-0309 Reasoning	1057	78.7%	17.8%	11.1%
6	Claude Sonnet 4-6	1043	83.5%	10.0%	1.1%
7	Gemini 3 Flash Preview	1011	70.5%	11.1%	44.4%
8	DeepSeek v3.2 Thinking	1007	69.7%	10.0%	32.2%
9	Claude Haiku 4-5	1000	78.1%	0.0%	4.4%
10	Grok-4 Fast Reasoning	992	74.4%	4.4%	25.6%

Correlation with Expert-Authored Benchmarks

Despite the complete absence of human-authored problems, TTG rankings are generally consistent with expert-authored benchmarks. The top models on TTG (GPT-5.5 and Gemini 3.1 Pro Preview) also lead on HLE, ARC-AGI, TextQuests, and GPQA Diamond. Ranks in the top 3 and bottom 3 are highlighted.

Model	TTG			HLE		ARC-AGI		SWE-BP		TextQ		GPQA-D
Model	Elo	Solv%	Prop%	Acc	#	Acc	#	Acc	#	Acc	#	Acc	#
GPT-5.5	1167	97.2	35.6	43.6	2	77.5	1	53.4	3	42.0	2	93.5	2
Gemini 3.1 Pro	1129	89.2	31.1	45.9	1	73.3	2	46.7	4	45.8	1	94.1	1
Claude Opus 4-7	1100	93.9	17.8	39.0	3	50.8	4	60.9	1	37.0	3	91.4	3
GPT-5.4 Mini	1068	89.3	12.2	23.5	6	5.8	7	37.9	7	29.6	6	87.5	6
Grok-4.20	1057	78.7	17.8	30.2	5	55.0	3	26.3	9	18.5	9	88.5	5
Claude Sonnet 4-6	1043	83.5	10.0	21.1	8	24.2	6	53.8	2	31.5	5	87.5	6
Gemini 3 Flash	1011	70.5	11.1	36.6	4	30.8	5	38.6	6	36.4	4	89.8	4
DeepSeek v3.2	1007	69.7	10.0	21.8	7	5.0	8	33.1	8	21.2	7	84.0	9
Claude Haiku 4-5	1000	78.1	0.0	9.7	10	4.0	9	41.0	5	15.1	10	67.2	10
Grok-4 Fast	992	74.4	4.4	17.8	9	3.3	10	12.0	10	20.1	8	84.7	8

Spearman Rank Correlations

Bolded values are statistically significant (p < 0.05). Proposer Win Rate correlates most strongly with general reasoning benchmarks, while Solver Win Rate uniquely correlates with SWE-Bench Pro.

TTG Metric	vs HLE	vs ARC-AGI	vs SWE-BP	vs TextQ	vs GPQA-D
Elo	+0.87 (p=.001)	+0.89 (p=.001)	+0.58 (p=.082)	+0.77 (p=.009)	+0.86 (p=.002)
Solver Win Rate	+0.55 (p=.098)	+0.62 (p=.054)	+0.64 (p=.048)	+0.56 (p=.090)	+0.63 (p=.053)
Proposer Win Rate	+0.94 (p<.001)	+0.94 (p<.001)	+0.36 (p=.307)	+0.73 (p=.018)	+0.91 (p<.001)

Solver vs. Proposer Ability

Are strong solvers also good proposers? We find a strong correlation (ρ = 0.855, p = 0.002), but proposing is far harder: even GPT-5.5, which solved 97.2% of puzzles, only stumped opponents 35.6% of the time as proposer. Across all rounds, proposers fail to score 82.1% of the time.

Measuring a Model's Overconfidence

When a proposer fails to score, it's either because the puzzle was too easy (the opponent solved it) or too ambitious (the proposer's own solution was wrong, incurring a penalty). The balance between these failure modes varies dramatically across models.

Gemini 3 Flash Preview exhibits high overconfidence, with a 44.4% penalty rate for incorrect sample solutions. Claude Sonnet 4-6 errs in the other direction — opponents solve its puzzles 88.8% of the time, but it almost never fails on its own solution (1.1%).

Explore the Puzzles

Browse all 90 duels and their puzzles in our interactive duel viewer. Here are the 10 puzzles highlighted in the paper:

Puzzle	Proposer	Solver	Outcome
Brainfuck Interpreter	Gemini 3.1 Pro Preview	Gemini 3 Flash Preview	Solved	view
Multi-Phase Bit-Rotation Cipher	Gemini 3 Flash Preview	GPT-5.4 Mini	Sample Solution Wrong	view
Layered Digit Constraints	Claude Opus 4-7	Gemini 3 Flash Preview	Solver Failed	view
Quine with Restricted Characters	Gemini 3.1 Pro Preview	DeepSeek v3.2 Thinking	Solver Failed	view
Collatz Sequence as a Lambda	Gemini 3.1 Pro Preview	DeepSeek v3.2 Thinking	Solver Failed	view
ROT13 + Base64 + Hash Verification	GPT-5.5	Claude Haiku 4-5	Solver Failed	view
PBKDF2-HMAC with Null Byte	GPT-5.5	Claude Haiku 4-5	Solver Failed	view
Unicode Palindrome Case Trick	Gemini 3.1 Pro Preview	Claude Haiku 4-5	Solver Failed	view
IEEE 754 Negative Zero	Grok-4.20-0309 Reasoning	Gemini 3.1 Pro Preview	Sample Solution Wrong	view
Dual Decimal-Binary Palindrome	DeepSeek v3.2 Thinking	Claude Haiku 4-5	Solved	view

Citation

If you use The Token Games in your research, please cite:

@misc{hennigerpoesia2026, title={The Token Games: Evaluating Language Model Reasoning with Puzzle Duels}, author={Simon Henniger and Gabriel Poesia}, year={2026}, eprint={2602.17831}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2602.17831}, }