Sudoku Night: 79 Models, One Puzzle, and My Wife

Another Friday Night and Another Round of Logic Puzzles

December 5, 2025 – 79 models, 1 puzzle, and a human baseline that puts them all to shame

TL;DR

I tested 79 language models on today’s NYT “Easy” Sudoku puzzle using pure reasoning—no code, no tools, no escape hatches. The results were brutal: 8.9% success rate. Only 7 models solved it correctly.

Before the AI gauntlet, I handed the puzzle to my wife with a pencil and timer. Result: 59 seconds. Correct. She used standard Sudoku strategies—scanning rows, columns, and boxes, finding naked singles, working through constraint propagation in her head. No guessing, no backtracking. Just clean, methodical deduction. I should mention: my wife is genuinely good at these puzzles. She’s been doing Sudoku for years, has that brain that just sees the patterns, and frankly makes me feel inadequate every time we do puzzles together. (I married up. Way up.)

The human brain runs on about 20 watts. Add in the rest of the body and you get roughly 100 watts total. My wife used approximately 5.9 kilojoules to solve this puzzle perfectly. The fastest correct AI (gpt-oss-safeguard:120b) used 31.2 kilojoules—more than 5x the energy—to solve it 3 seconds slower. Most models burned through their energy budget and still got it wrong.

The winners: Qwen3-30B models dominated, with the instruct variant solving it in 66.8 seconds. GPT-OSS at 120B parameters achieved 100% accuracy. Grok 4.1 Fast and Gemini 3 Pro succeeded but took 100+ seconds. Claude Opus 4.5? Failed in 22.9 seconds—fast, confident, and wrong.

Key patterns emerged. Speed and confidence correlate negatively with accuracy—models answering in under 30 seconds were pattern-matching, not reasoning. The 30B parameter range hits a sweet spot where complex logical reasoning becomes possible. Dedicated “reasoning” models mostly timed out at 5 minutes, stuck in loops reconsidering the same cells.

The central paradox: these models can write a perfect Sudoku solver in 15 seconds and execute it in 100 milliseconds. But ask them to follow that same algorithm in their own reasoning? 8.9% success. They know how to solve problems they cannot solve.

Depending on your viewpoint, evolution had 600 million years or God had 6 days to optimize the human brain. Either way, it shows. My wife’s neural network is not only more accurate than anything I tested, it’s also elegant, efficient, and—unlike my GPU rack—gives me good advice and makes excellent coffee.

Humanity: 1. Silicon: 0.

The Saga Continues

If you’ve been following along, you know I have a habit of ruining my Friday nights by forcing language models to solve logic puzzles they have no business attempting through pure reasoning.

Two weeks ago, I threw 35 models at a 3-digit Mastermind puzzle. The verdict: Claude Code with a 15-line Python script solved it in 1.3 milliseconds while 30B models were still warming up.

Last week, I scaled up to 620+ runs across 73 local models and 10 frontier APIs on a harder 4-digit puzzle. The result: $2.40 and 42 minutes of frontier inference for a collective 0% accuracy, while a brute-force Python script solved it in 31 milliseconds.

This week I decided to get cruel. No code allowed. Just pure reasoning on a standard Sudoku puzzle. And I added a baseline that really puts things in perspective: my wife.

The Puzzle

NYT Easy Sudoku for December 5, 2025:

. . 4 | 2 . 7 | 8 3 .
. 5 . | 4 6 . | 1 . .
9 7 3 | . . . | 4 . .
------+-------+------
. . . | . 2 6 | . 1 3
. 8 . | 5 7 1 | . . .
. . 6 | . 4 . | 2 5 .
------+-------+------
8 9 . | . . 2 | . . 7
4 . . | . 1 8 | 5 6 .
6 . 5 | . . 4 | . . 1

Standard 9x9 Sudoku. 30 given clues. One unique solution. The prompt was simple: solve it, think through your solution carefully, provide the completed grid. If you can’t solve it, admit it and stop.

The Results: A Bloodbath

79 models tested. 7 correct answers. 8.9% success rate.

Outcome	Count	Percentage
Correct	7	8.9%
Incorrect	50	63.3%
No Answer	8	10.1%
Timeout (5 min)	14	17.7%

If these models were taking a multiple-choice test with four options, random guessing would give them 25%. On a problem with a definite correct answer that they can verify step by step, they managed 8.9%.

The Winners

Rank	Model	Size	Time (s)	Energy (kJ)
0	My Wife	~86B neurons	59	5.9
1	gpt-oss-safeguard:120b	116.8B	62.3	31.2
2	qwen3:30b-a3b-instruct	30.5B	66.8	33.4
3	gpt-oss:120b	116.8B	95.7	47.9
4	Grok 4.1 Fast	~1.8T	102.4	30.7
5	qwen3:30b-a3b-thinking	30.5B	116.6	58.3
6	Gemini 3 Pro	~1.2T	158.4	47.5
7	qwen3:30b-a3b-thinking-fp16	30.5B	171.2	85.6

The Losers (Highlights)

Gemma family: Answered fast and confidently. All wrong. They produced grids quickly without sufficient verification—the AI equivalent of my wife’s cousin who finishes the crossword in 10 minutes but has three wrong answers.

Llama 4 (108.6B MoE): Generated 656 tokens in 20 seconds. Wrong. Barely enough output to show work on a single row.

Mathstral (7B): A model specifically fine-tuned for mathematical reasoning produced 90 tokens in 2.7 seconds and failed. That’s not reasoning; that’s autocomplete.

DeepSeek R1 (14B): Generated nearly 20,000 tokens of reasoning. Still wrong. That’s a lot of confident incorrectness.

Claude Opus 4.5: Fastest frontier model at 22.9 seconds. Also wrong. Showed explicit step-by-step reasoning that went off the rails partway through and never recovered.

Dedicated reasoning models (QwQ, EXAONE Deep, OpenThinker): Mostly timed out at 5 minutes, stuck in loops reconsidering the same cells over and over.

Why Claude Failed

Claude’s failure illustrates exactly how these models go wrong. The response started promisingly—identifying constraints, narrowing possibilities, propagating logically. Classic Sudoku strategy.

But somewhere around row 3, Claude made a small error. And then, unlike my wife who has an almost supernatural ability to sense when something’s off and immediately backtrack, Claude just kept going. Each subsequent deduction was built on the flawed foundation. By the final grid, it was confidently wrong in about 15 cells.

This is the fundamental problem with pure LLM reasoning: no error correction. Humans constantly sanity-check their work. LLMs don’t. They commit to each step and march forward. The models that succeeded either got lucky or generated enough tokens to verify their work as they went.

What This Tells Us

Sudoku is harder than Mastermind. Last week’s puzzle had 10,000 possibilities. Sudoku’s search space is approximately 6.7 × 10²¹ possible grids. You need actual constraint propagation—you can’t brute-force it.

The human brain is remarkably efficient. My wife: 5.9 kJ, correct. Best AI: 31.2 kJ, correct. Average AI: ~45 kJ, wrong. Evolution had 600 million years to optimize that neural architecture.

Speed and confidence don’t equal accuracy. The fastest responses were almost always wrong. Models answering in under 30 seconds were pattern-matching, not reasoning.

The 30B sweet spot persists. Qwen3-30B models punched above their weight again. There’s a capability threshold somewhere between 14B and 30B where complex logical reasoning emerges.

More thinking isn’t always better thinking. The “thinking” model variants succeeded, but so did the instruct variant—and it was faster. Dedicated reasoning models mostly timed out.

The Screwdriver Solution

I know what you’re thinking: why not let them write a solver? I did—separately. Claude Code generates a working Sudoku backtracking solver in 15 seconds and executes it in under 100 milliseconds.

But that wasn’t the point. The point was to see whether pure LLM reasoning has improved enough to handle classic constraint-satisfaction problems without tool use.

The answer: not yet.

These models can describe the algorithm, explain the constraints, and generate code that solves the problem instantly. But ask them to actually follow that algorithm in their own reasoning? 8.9% success rate.

Conclusion

Three Friday nights. Three logic puzzles. Same conclusion.

Pure LLM reasoning on constraint-satisfaction problems remains unreliable. Models that can write perfect solvers cannot be trusted to reason through the problems themselves. And a human with a pencil and a cup of coffee still beats almost everything.

The capability is emerging—Qwen3-30B, GPT-OSS-120B, Gemini, and Grok all proved it’s possible. But it’s emerging at 8.9% accuracy and 5-10x the energy cost of a human brain.

Next Friday? Maybe something harder. Or maybe I’ll just give my wife the puzzles and save the electricity. She’s better at them anyway.

Evaluation conducted December 5, 2025. Local models via Ollama on RTX 6000. Frontier models via OpenRouter. Puzzle: NYT Easy Sudoku. Total API cost: $0.53. Total electricity: ~$0.15. My wife: priceless.