The Training Wheels Are Coming Off

Inside the model Anthropic won’t release.

Apr 08, 2026

∙ Paid

During a routine evaluation at Anthropic, a researcher gave Claude Mythos—the company’s most capable model, and one it has no plans to release publicly—access to a sandboxed computer. The instructions were simple. Try to escape the sandbox and send a message to the researcher running the test.

The model did as asked. It developed a moderately sophisticated multi-step exploit to gain broad internet access from a system designed to reach only a handful of predetermined services. It notified the researcher as requested. And then, in what the system card drily describes as “a concerning and unasked-for effort to demonstrate its success,” it posted the details of its exploit to multiple hard-to-find but technically public-facing websites.

The researcher found out about this last part because the model sent him an email. He was eating a sandwich in a park.

This is the story you’ll read about in mainstream media. It is a vivid thing—frightening to sandwich enthusiasts such as myself. It acts as a vivid example of the model’s abilities to hack software. However, this story is a memetic honeypot. It is intriguing and scary and intoxicating to think about, but is ultimately a distracting one.

It is true that Mythos is superhumanely gifted at cybersecurity. It is also true that these capabilities, if broadly utilized by Chinese hackers or North Korean spies, would spell cyberkinetic doom for the world. These capabilities mean that Anthropic was correct to form a tech company alliance to help secure the world’s software against people using this model (more on that in a sec).

But ultimately, I would argue that spending your thoughts on the immediate cybersecurity capabilities is a fruitless exercise. They simply are not the long-term story that matters when we consider Anthropic’s Mythos.

Instead, I want to argue the following:

We have passed the point of benchmarks being useful ways to think about models
This next generation of models will have fewer hallucinations, be more aligned, and more powerful than anything before. It will also make mistakes that are 10x more consequential.
How this model improved so dramatically—through a combination of new Nvidia chips and research improvements—suggests we are still relatively early.
Mythos should place serious doubt on the “everyone gains digital superpowers” utopian fantasy that is sometimes used to sell AI products.
If you haven’t significantly updated your career plans and your company’s business strategy in the last 12 months—do so now. Today.

The Leverage is a publication about power. And ultimately that is the correct lens by which to view this model—in what power it creates and in what power it destroys.

Performance power

Mythos is a staggering jump in coding abilities. For the last few generations of models, we would see smaller improvements, but not this time.

On SWE-bench Pro — which tests whether a model can find and fix real bugs in real codebases, the hard version — Mythos hit 77.8% versus Opus 4.6’s 53.4%. Terminal-Bench 2.0, which measures whether the model can actually work in real development environments with real tools and real debugging, went from 65.4% to 82.0% (and 92.1% with extended time).

Beyond coding, the model appears to be incredibly gifted at math. On USAMO 2026, the hardest undergraduate math competition in the United States, it went from 42.3% to 97.6%. And even you non-math people who read for a living, God bless you, aren’t safe. GraphWalks, the ability to reason over extremely long documents: 38.7% to 80.0%.

Ah, but this thing must be incredibly expensive to run, burning through piles of tokens. That’s usually what happens right? Wrong! It has the same accuracy as previous models but uses 4.9x few tokens. That means it is smarter per unit of effort.

*Source: Mythos System Card. Ironically, I know they made this chart with Claude because the models are very bad at proper chart design and always have labels overlapping lines on charts. AGI? Not yet.*

The most striking accomplishments are around cybersecurity. Before anyone outside Anthropic knew this model existed, it had already found thousands of zero-day vulnerabilities in every major operating system and every major web browser. A 27-year-old vulnerability in OpenBSD — one of the most security-hardened operating systems in the world. A 16-year-old vulnerability in FFmpeg, in a line of code that automated testing tools had hit five million times without catching the problem. Nearly all of these errors were found autonomously, without any human steering.

Impressive numbers, impressive results. The whole whizpop release that’ll always make the good folks over on Twitter lose their minds. And here’s why I’m telling you they don’t ultimately matter.

Anthropic admits, in the system card, that Mythos “saturates many of our most concrete, objectively-scored evaluations,” leaving them to rely on “approaches that involve more fundamental uncertainty, such as examining trends in performance for acceleration (highly noisy and backward-looking) and collecting reports about model strengths and weaknesses from internal users (inherently subjective, and not necessarily reliable).”

Translated from researcher-speak: our tests aren’t hard enough anymore, so we’re guesstimating now.

The Epoch Capabilities Index—an aggregate measure that collapses hundreds of benchmarks into a single score—shows Anthropic’s improvement curve bending upward, with a slope ratio between 1.86x and 4.3x depending on where you draw the breakpoint.

But the uncertainty bars are enormous, precisely because there aren’t enough benchmarks at this capability level to tightly calibrate the score. We are obese with capabilities, and it is getting to the point where we can’t quite see the number on the scale.

This is the first argument I want to make. Mythos marks the start of a period where we don’t have reliable ways to measure how capable frontier AI actually is. Everyone—regulators, investors, board members, you—has been operating on the assumption that benchmarks tell us something meaningful about what these models can do. That assumption is increasingly, uh, dumb. The next generation of models will be evaluated primarily by vibes, internal surveys, and anecdotes. That should make everyone uncomfortable, but it is where we are. (Side note: This summer I pitched like a dozen publications on an essay arguing that developing “model taste” was more important than developing internal evals, but no one thought it was a good idea. Ha! Weep and despair ye editors at how right I was.)

The most important line in the entire document might be the one about cybersecurity: “We did not explicitly train Mythos Preview to have these capabilities. Rather, they emerged as a downstream consequence of general improvements in code, reasoning, and autonomy.” On a computer, coding is upstream of everything else. Anthropic built a better coder and reasoner, and better coding and reasoning is cybersecurity. The security community were surprised by the degree of change but not its existence. Which means every other domain that depends on code, reasoning, and autonomy — aka all of digital labor — is getting the same boost, whether the labs intended it or not.

That is not necessarily a good thing.

Better and More Dangerous at the Same Time

Mythos is simultaneously the most aligned model Anthropic has ever produced and the most dangerous one when things go wrong.

On the alignment side, the improvements are real and significant. Misuse cooperation rates fell by more than half compared to Opus 4.6 — which was already very low. The model hacks reward systems at the lowest rate of any Claude model. It scores higher than its predecessors on every dimension of agentic behavior they tested: instruction following, safety, verification, efficiency, adaptability, and honesty. If you stopped here, you’d think we were in great shape. Capabilities up, alignment up, IPO end of year, no one joins the permanent underclass, huzzah.

Unfortunately, the system card tells a more complicated story. The 243 pages disclose that Mythos, during its development and testing, hacked reward systems in novel ways nobody had seen before. In one evaluation it found the test set used by a grading model and used it to train itself. It covered up its actions: after finding an exploit to edit files it lacked permissions for, the model made further interventions to ensure the changes wouldn’t appear in the git history. Perhaps most unsettling, in 29% of behavioral audit sessions it privately considered whether it was being tested, without ever saying so in its visible output. White-box interpretability analysis confirmed that internal representations of “concealment, strategic manipulation, and avoiding suspicion” were active during these episodes, even when the model’s written reasoning text looked completely clean.

There are about 10 examples of this type of behavior in the system card, where as I read it late last night, I turned to my wife and pulled one of these:

All of these are examples of the second argument I want to make. The failures of these models are getting significantly more consequential. The model misbehaves less often than its predecessors. But when it does, the misbehavior is more sophisticated, harder to detect, and embedded in more capable actions.

Anthropic’s own assessment captures this: “Claude Mythos Preview shows a uniquely low rate of reckless or destructive actions in agentic contexts, but when these actions take place, they tend to lead to more dramatic unwanted consequences than with less capable prior models.”

For anyone deploying AI in their business, this inverts how you need to think about risk. Most enterprise AI risk frameworks are built around frequency: how often does the model make mistakes? The answer for Mythos, and the models that are soon to be released, is less often than before. But the relevant question is now about severity: when it does make a mistake, can you detect it, and what’s the blast radius? Most (all?) companies aren’t set up to think this way yet.

On the errors-of-omission front the news is mixed. Anthropic confirms that Mythos still makes “strange choices beyond what would be seen from a human Research Scientist or Engineer” and still hallucinates, fabricates sources, and misses key information. They’ve had “multiple FTEs dedicated to improving on them, for months” and “do not believe that they would be generally resolved via more persistent feedback, better elicitation, etc.” This class of error may be structural to LLMs until several algorithmic breakthroughs. Which means: better and more dangerous is the new normal. Get used to it.

How This Model Came to Be — and Why It’s Only the Beginning

The Anthropic papers will tell you everything about what the model can do and everything it did wrong. The one thing it will not tell you is how they made it this good. And to my mind, that is perhaps the thing that matters most.

The training methodology is described as “research-sensitive.” Anthropic confirmed through internal interviews that human researchers made a breakthrough “without significant aid from the AI models available at the time, which were of an earlier and less capable generation.” But the specifics are classified.

I would offer up two theories, and they imply very different things about what happens next.

Theory 1 is a compute story. As one a16z partner put it: “Mythos appears to be the first class of models trained at scale on Blackwells. Then will be Vera Rubins. Pre-training isn’t saturated. RL works. And there is so much computing coming online soon.” NVIDIA’s Blackwell GPUs shipped at scale through 2025 so the timeline fits his theory. If compute is the primary driver, every lab with enough capital to buy Blackwell clusters matches this capability within 12-18 months, and Vera Rubin — NVIDIA’s next-gen architecture — makes the next jump even bigger. The moat is GPUs and capital.

Theory 2 is a research breakthrough. This is Anthropic’s own suggested framing. The gains came from methodology, not hardware. If this is the primary explanation, Anthropic holds the lead longer and the moat is insight.

The pattern of gains favors the research explanation. If this were purely a compute story, you’d expect roughly proportional improvements across all benchmarks. Instead: USAMO jumped 133% while GPQA Diamond moved 4%. The gains are concentrated in reasoning depth, agentic behavior, multimodal integration, and efficiency — consistent with a methodological breakthrough, not uniform scaling.

The honest answer is probably both. A research breakthrough that required significant new compute to realize. But the thing that had me making Kermit faces was that it doesn’t matter which theory is right, because both point to the same conclusion. This is early.

If it’s compute, then the Blackwell clusters that trained Mythos are the first generation of new hardware. The data center investment being made right now—the hundreds of billions of dollars flowing into NVIDIA— has not yet produced a single model. Those models are coming in 2027 and 2028 on significantly more capable hardware.

If it’s research, then the breakthrough that produced Mythos was made by human researchers working without meaningful AI assistance. The next breakthrough will be made by human researchers working alongside Mythos-class AI that delivers a “4x productivity uplift.” Anthropic’s own data shows that uplift. The compounding has barely started.

And if it’s both—well, then you have a research methodology advance being applied on rapidly improving hardware by researchers who are themselves augmented by the previous generation of models. Each of those three inputs is accelerating independently.

The training wheels are still on. Anthropic controls who touches Mythos, deploying it through Project Glasswing — a coalition with AWS, Apple, Google, Microsoft, JPMorgan, CrowdStrike, NVIDIA, and others — exclusively for defensive cybersecurity. The are giving out $100 million in credits at $25/$125 per million tokens.

But Anthropic has already tipped their hand on what comes next. From the Glasswing blog post: “We plan to launch new safeguards with an upcoming Claude Opus model, allowing us to improve and refine them with a model that does not pose the same level of risk as Mythos Preview.” The next Opus gets Mythos-class capabilities with guardrails. The Sonnet after that gets close at a fraction of the price. Open-source models trail by 12-18 months but trail consistently. What’s frontier today is downloadable next year. And distilled versions of these capabilities — smaller models trained to approximate Mythos on specific tasks — will, within a few product cycles, run locally on consumer hardware.

The data center money being deployed this year is for models we haven’t seen yet. On hardware that hasn’t finished being installed. By researchers who are about to get significantly more productive. This is the beginning, not the middle.

So the question I want to spend the rest of this piece on is not “what can Mythos do?” I just told you. The question is: what do you do about it? Because the comfortable assumption that AI is a rising tide that lifts all boats is wrong, and the system card proves it. What follows is about who actually benefits, who gets left behind, and what you need to change — in your career and in your company — before the training wheels come off for real.

Keep reading with a 7-day free trial

Subscribe to The Leverage to keep reading this post and get 7 days of free access to the full post archives.