Discussion about this post

User's avatar
Two Dollar Bill's avatar

Really great read, thanks for sharing.

"The results really surprised me—Mid-tier reasoning models from early 2025 (o3-mini, o4-mini, Claude Sonnet 4, Gemini 2.0 Flash) beat the average MBA student at a top business school. But the frontier models like GPT-5 got meaningfully worse, falling below the MBA average."

I'm actually not that surprised to read this. It was pretty clear GPT-5 was an instant and obvious downgrade for my purposes when it was released and that was the moment I pivoted the bulk of my usage of OpenAI. Part of me has always felt that while advances in the models since last summer have definitely brought improvements in say, agentic coding, they are tangibly worse at a lot of things than models like 4.1, o3, o1-pro, even the original 3.5 sonnet, excelled at. This is particularly true for OpenAI's models, Anthropic's have not seen nearly as much of a drop off.

The reason you cited (included below) makes a lot of sense to me. But how much of this would you speculate is due to the need to manage cost / compute demand? Is it possible the older models were genuinely more capable in some ways but too expensive to run at scale given current demand levels relative to hardware capacity? o1-pro was fantastic at helping me reason through generalist problems (with real world results to back it up) but is literally 100x more expensive than GPT-5.

"Strategy is about committing resources to uncertain futures with delayed, noisy feedback, and the biggest models default to backward-looking, risk-averse, protect-what’s-working behavior because that is what works in their primary use cases in coding and marketing."

Add to that, there's probably no real business incentive to fix this. These companies make their money when enterprises adopt the models for high-volume, measurable work (code, marketing copy, customer support) tasks where you can show a productivity gain in a quarterly deck. Nobody is going to optimize a foundation model for "makes good bets under uncertainty" when they could optimize for "writes better React components," because one of those shows up in procurement conversations and the other doesn't. So this gap might get worse, not better, as the industry matures. And the CEOs reading the output are generally not the people who are going to catch it.

Alan's avatar

It's an interesting result the researchers got in the study you mentioned. I use it all the time for long-term planning but realize it's going to tell me what I want to hear. There are a few techniques I've found that help. Have it take the persona of famous people with a distinctive view that's been documented on the internet and forcing it into real trade off situations. I've suggested to a few writers with a distinctive or contrarian style that they should offer an MCP connector to their work. You get an insight into how people use it and I get to plug your thinking into my process. Win win.

6 more comments...

No posts

Ready for more?