"The results really surprised me—Mid-tier reasoning models from early 2025 (o3-mini, o4-mini, Claude Sonnet 4, Gemini 2.0 Flash) beat the average MBA student at a top business school. But the frontier models like GPT-5 got meaningfully worse, falling below the MBA average."
I'm actually not that surprised to read this. It was pretty clear GPT-5 was an instant and obvious downgrade for my purposes when it was released and that was the moment I pivoted the bulk of my usage of OpenAI. Part of me has always felt that while advances in the models since last summer have definitely brought improvements in say, agentic coding, they are tangibly worse at a lot of things than models like 4.1, o3, o1-pro, even the original 3.5 sonnet, excelled at. This is particularly true for OpenAI's models, Anthropic's have not seen nearly as much of a drop off.
The reason you cited (included below) makes a lot of sense to me. But how much of this would you speculate is due to the need to manage cost / compute demand? Is it possible the older models were genuinely more capable in some ways but too expensive to run at scale given current demand levels relative to hardware capacity? o1-pro was fantastic at helping me reason through generalist problems (with real world results to back it up) but is literally 100x more expensive than GPT-5.
"Strategy is about committing resources to uncertain futures with delayed, noisy feedback, and the biggest models default to backward-looking, risk-averse, protect-what’s-working behavior because that is what works in their primary use cases in coding and marketing."
Add to that, there's probably no real business incentive to fix this. These companies make their money when enterprises adopt the models for high-volume, measurable work (code, marketing copy, customer support) tasks where you can show a productivity gain in a quarterly deck. Nobody is going to optimize a foundation model for "makes good bets under uncertainty" when they could optimize for "writes better React components," because one of those shows up in procurement conversations and the other doesn't. So this gap might get worse, not better, as the industry matures. And the CEOs reading the output are generally not the people who are going to catch it.
I think that if you are on a paid tier, the model providers are not really rate-limiting all that much right now.
In general, I think GPT 5.4 and Opus 4.6 are incredible, but you are right in that there is something lost in each successive generation. The most beautiful and interesting models for writing was GPT2 and GPT4—though now I use Claude for any editorial feedback.
Disagree on the "makes bets under uncertainty" because that is likely the most compute efficient work they can do in the world, (aka hedge funds). Its just that neither lab are really doing that right now though eventually they may. Ren Tech is likely the purest and most successful version of this idea today.
It's an interesting result the researchers got in the study you mentioned. I use it all the time for long-term planning but realize it's going to tell me what I want to hear. There are a few techniques I've found that help. Have it take the persona of famous people with a distinctive view that's been documented on the internet and forcing it into real trade off situations. I've suggested to a few writers with a distinctive or contrarian style that they should offer an MCP connector to their work. You get an insight into how people use it and I get to plug your thinking into my process. Win win.
“The models are replacing commodity work, but they’re failing spectacularly at taste, judgment, and long-term thinking—which means the premium on those human skills has never been higher. The people who sharpen those edges right now are going to have the best decade of their careers.”
I really enjoyed reading your analysis and article. I just want to bring light to some other perspective: I suspect AI is becoming the most convenient narrative for correcting pandemic over-hiring.
Many of these companies dramatically expanded during the zero-interest-rate era. Now they are cutting headcount while calling it “AI efficiency.” That does not necessarily mean AI replaced those roles. It may simply mean executives misread the growth curve in 2021.
There are some reports from Oxford Economics even showing that productivity is not really shining (not on the results at least) on those adopting AI strongly in the past 24 months.
Also, I’m not convinced revenue per employee is the right north star here. AI-native companies often have artificially high revenue per employee because infrastructure, model training, and ecosystem costs are externalized to hyperscalers and investors. If you internalize those costs, the comparison with companies like Salesforce or Adobe becomes much less dramatic.
My comment does not challenge the article directly, per se, and, again... A great reading here!
Agreed. Tech companies are some of the worst managed in the entire world because their gross margins and growth rates allowed them to. Its what I was pointing at with, "Every one of these companies will be citing AI as the reason. Some of that is genuine! A lot of it is convenient cover for pandemic-era overhiring that executives would rather not own."
Revenue per employee is inelegant, but I do think it is indicative of how startup boards are thinking about the world. What metric do you think would be better?
The big difference I tend to see in AI-native teams is decision speed. Smaller teams with strong AI tooling seem able to test ideas, ship changes, and learn much faster.
Writing that almost sounds strange because of all the AI hype. In some ways it reminds me of the early “software-enabled company” argument years ago, when digital tools started compressing cycle times compared to traditional organizations.
So I still think revenue per employee is a useful signal. It is simple and boards understand it. But maybe the deeper shift is happening in execution dynamics.
From a Lean or Agile perspective, I would probably look at things like decision cycle time, idea-to-production lead time, and maybe even validated learning per quarter. That last one usually makes people roll their eyes because it is hard to measure. But the point is that measuring ROI on AI right now can be misleading. Many pilots are failing, yet companies are learning a lot from those failures and discovering much faster where things will not work.
In other words, I wonder if the real metric should capture how quickly a company can turn strategy into tested reality.
My intuition is that AI may end up accelerating that loop more than it simply reducing headcount.
Could this top-tier-model-being-worse-at-strategy reflect how also intelligent people can come up with nearly limitless amounts of things to go wrong? They then try to avoid losing and try keep up the status quo.
The more complex the situation and the more possible permutations there exist, the harder it is to “solve” the strategic problem. Ambiquity rewards older mid-tier models as they can’t process as much and “intuit” more straightforward solutions, maybe.
Like that agency vs intelligence setup where just doing things gives you more actual information to work with over analysing theoreticals.
Really great read, thanks for sharing.
"The results really surprised me—Mid-tier reasoning models from early 2025 (o3-mini, o4-mini, Claude Sonnet 4, Gemini 2.0 Flash) beat the average MBA student at a top business school. But the frontier models like GPT-5 got meaningfully worse, falling below the MBA average."
I'm actually not that surprised to read this. It was pretty clear GPT-5 was an instant and obvious downgrade for my purposes when it was released and that was the moment I pivoted the bulk of my usage of OpenAI. Part of me has always felt that while advances in the models since last summer have definitely brought improvements in say, agentic coding, they are tangibly worse at a lot of things than models like 4.1, o3, o1-pro, even the original 3.5 sonnet, excelled at. This is particularly true for OpenAI's models, Anthropic's have not seen nearly as much of a drop off.
The reason you cited (included below) makes a lot of sense to me. But how much of this would you speculate is due to the need to manage cost / compute demand? Is it possible the older models were genuinely more capable in some ways but too expensive to run at scale given current demand levels relative to hardware capacity? o1-pro was fantastic at helping me reason through generalist problems (with real world results to back it up) but is literally 100x more expensive than GPT-5.
"Strategy is about committing resources to uncertain futures with delayed, noisy feedback, and the biggest models default to backward-looking, risk-averse, protect-what’s-working behavior because that is what works in their primary use cases in coding and marketing."
Add to that, there's probably no real business incentive to fix this. These companies make their money when enterprises adopt the models for high-volume, measurable work (code, marketing copy, customer support) tasks where you can show a productivity gain in a quarterly deck. Nobody is going to optimize a foundation model for "makes good bets under uncertainty" when they could optimize for "writes better React components," because one of those shows up in procurement conversations and the other doesn't. So this gap might get worse, not better, as the industry matures. And the CEOs reading the output are generally not the people who are going to catch it.
I think that if you are on a paid tier, the model providers are not really rate-limiting all that much right now.
In general, I think GPT 5.4 and Opus 4.6 are incredible, but you are right in that there is something lost in each successive generation. The most beautiful and interesting models for writing was GPT2 and GPT4—though now I use Claude for any editorial feedback.
Disagree on the "makes bets under uncertainty" because that is likely the most compute efficient work they can do in the world, (aka hedge funds). Its just that neither lab are really doing that right now though eventually they may. Ren Tech is likely the purest and most successful version of this idea today.
It's an interesting result the researchers got in the study you mentioned. I use it all the time for long-term planning but realize it's going to tell me what I want to hear. There are a few techniques I've found that help. Have it take the persona of famous people with a distinctive view that's been documented on the internet and forcing it into real trade off situations. I've suggested to a few writers with a distinctive or contrarian style that they should offer an MCP connector to their work. You get an insight into how people use it and I get to plug your thinking into my process. Win win.
Timely insights, thanks Evan.
“The models are replacing commodity work, but they’re failing spectacularly at taste, judgment, and long-term thinking—which means the premium on those human skills has never been higher. The people who sharpen those edges right now are going to have the best decade of their careers.”
I really enjoyed reading your analysis and article. I just want to bring light to some other perspective: I suspect AI is becoming the most convenient narrative for correcting pandemic over-hiring.
Many of these companies dramatically expanded during the zero-interest-rate era. Now they are cutting headcount while calling it “AI efficiency.” That does not necessarily mean AI replaced those roles. It may simply mean executives misread the growth curve in 2021.
There are some reports from Oxford Economics even showing that productivity is not really shining (not on the results at least) on those adopting AI strongly in the past 24 months.
Also, I’m not convinced revenue per employee is the right north star here. AI-native companies often have artificially high revenue per employee because infrastructure, model training, and ecosystem costs are externalized to hyperscalers and investors. If you internalize those costs, the comparison with companies like Salesforce or Adobe becomes much less dramatic.
My comment does not challenge the article directly, per se, and, again... A great reading here!
Agreed. Tech companies are some of the worst managed in the entire world because their gross margins and growth rates allowed them to. Its what I was pointing at with, "Every one of these companies will be citing AI as the reason. Some of that is genuine! A lot of it is convenient cover for pandemic-era overhiring that executives would rather not own."
Revenue per employee is inelegant, but I do think it is indicative of how startup boards are thinking about the world. What metric do you think would be better?
Spot on!
The big difference I tend to see in AI-native teams is decision speed. Smaller teams with strong AI tooling seem able to test ideas, ship changes, and learn much faster.
Writing that almost sounds strange because of all the AI hype. In some ways it reminds me of the early “software-enabled company” argument years ago, when digital tools started compressing cycle times compared to traditional organizations.
So I still think revenue per employee is a useful signal. It is simple and boards understand it. But maybe the deeper shift is happening in execution dynamics.
From a Lean or Agile perspective, I would probably look at things like decision cycle time, idea-to-production lead time, and maybe even validated learning per quarter. That last one usually makes people roll their eyes because it is hard to measure. But the point is that measuring ROI on AI right now can be misleading. Many pilots are failing, yet companies are learning a lot from those failures and discovering much faster where things will not work.
In other words, I wonder if the real metric should capture how quickly a company can turn strategy into tested reality.
My intuition is that AI may end up accelerating that loop more than it simply reducing headcount.
Could this top-tier-model-being-worse-at-strategy reflect how also intelligent people can come up with nearly limitless amounts of things to go wrong? They then try to avoid losing and try keep up the status quo.
The more complex the situation and the more possible permutations there exist, the harder it is to “solve” the strategic problem. Ambiquity rewards older mid-tier models as they can’t process as much and “intuit” more straightforward solutions, maybe.
Like that agency vs intelligence setup where just doing things gives you more actual information to work with over analysing theoreticals.