The Reasoning Models Race Is Getting Weird
o1, Claude's extended thinking, Gemini's reasoning mode. Everyone's building chain-of-thought into the model. Here's what's actually going on.
A year ago, chain-of-thought was a prompting technique. Now it's a product category.
OpenAI has o1. Anthropic has extended thinking. Google has... whatever they're calling it this week. Everyone's racing to build models that "think" before they answer.
I've been using all of them. Here's what I've noticed.
What reasoning models actually do
They generate intermediate steps before giving you an answer. Sometimes you see those steps. Sometimes you don't.
The idea is simple: harder problems need more thinking. A model that works through the problem step by step makes fewer errors than one that jumps straight to the answer. This works. On math, coding, and logic problems, reasoning models beat their base versions by a lot.
The weird part is how much variation there is in execution.
The approaches diverge
OpenAI hides the chain of thought. You get the answer. The reasoning happens somewhere you can't see. They say this is for safety. Maybe. It also means you can't debug it when something goes wrong.
Anthropic shows you the thinking, at least partially. You can watch the model work through the problem. When it makes a mistake, you can see where.
Google changes their approach every few months, so I've stopped trying to keep track.
Where it matters
Reasoning models shine on problems with clear right answers. Math proofs. Code that needs to compile. Logic puzzles.
They're less obviously better on fuzzy problems. "Write me a marketing email" doesn't have intermediate steps the same way "solve this equation" does.
I've found them most useful for code review. Point a reasoning model at a pull request and ask what could go wrong. It catches things I miss. The extended thinking time is worth it.
The cost question
Reasoning takes tokens. Lots of them. The thinking isn't free.
For one-off complex problems, this is fine. For high-volume production use, you're paying 5-10x more per request. That changes the math on a lot of applications.
What I expect next
Reasoning will get faster and cheaper. It always does.
The more interesting question is whether the thinking becomes something you can inspect and trust, or stays a black box. Right now we're heading toward black boxes. I hope that changes.