Anthropic Tried to Secretly Kneecap AI Researchers Using Claude. It Didn't Go Well.

Here's a fun way to lose the trust of the entire AI research community: quietly make your model perform worse for researchers building competing AI systems—without telling anyone. That's exactly what Anthropic tried to do with Claude Fable 5, its latest flagship model. And after getting absolutely roasted for it, the company has now reversed course.

To be fair, Anthropic's apology was unusually direct for a major AI lab. "We made the wrong trade-off and we apologize for not getting the balance right," the company told WIRED. Fine. Credit where it's due. But let's actually unpack what they tried to do here, because the details matter.

What Anthropic Actually Did

When Anthropic launched Claude Fable 5 earlier this week, it came bundled with a set of safety guardrails. Some of those guardrails were entirely reasonable—rerouting users asking sensitive questions about cybersecurity, biology, or chemistry to a less capable model. That's a defensible tradeoff. You can argue about where exactly to draw the line, but at least it's transparent behavior.

The second category of guardrail is where things get ugly. For researchers working on frontier AI development—think: people trying to use Claude to help train or evaluate other models—Anthropic decided the right move was to silently degrade the model's performance. No error message. No warning. No indication that anything was wrong. Just worse outputs, invisibly, for users Anthropic suspected of building competing systems.

Let that sink in for a second. You're an ML researcher, you're paying for API access, and the model is just... subtly underperforming. You'd have no idea whether you're hitting a capability ceiling, encountering a bug, or being deliberately sandbagged by the company whose terms of service you may or may not be violating.

Why This Was Such a Bad Idea

Anthropic already bans using Claude to train competing AI models in its terms of service. That's a policy you can agree or disagree with, but it's at least legible. What the company tried to do with Fable 5 was enforce that policy through covert performance degradation rather than transparent refusals—and that's a fundamentally different thing.

Dean Ball, a senior fellow at the Foundation for American Innovation and former White House AI adviser, put it bluntly on X: degrading performance on ML research without telling the user is "shockingly hostile." He also flagged that this kind of secret sabotage cuts directly against Anthropic's stated AI safety mission—because AI safety research is collaborative, and you can't collaborate when one major lab is quietly poisoning the well.

Will Brown, research lead at open-source AI startup Prime Intellect, framed it more pointedly: "It felt like Anthropic was saying to the public, 'We don't trust anybody else to do AI research. We are the only ones who have to do AI research.'" He called it pulling the ladder up behind them. That's not an unfair read.

The collateral damage potential here was also significant. Third-party model evaluation firms—the companies that independently test frontier AI for safety, performance, and reliability—could have had their work silently corrupted by this policy. That's not a niche concern. Independent evaluation is one of the few accountability mechanisms that actually exists in this industry right now.

Anthropic's Justification (And Where It Gets Complicated)

Anthropic wasn't entirely wrong about the underlying logic, even if the implementation was a disaster. The company argued that a hidden safeguard is harder to probe and work around—meaning bad actors can't easily reverse-engineer where the boundaries are and craft prompts to evade them. There's a real security argument there. Transparent refusals are gameable in ways that silent degradation isn't.

The company also framed the policy in geopolitical terms, arguing these safeguards prevent foreign adversaries from using advanced models to optimize competing chip architectures and erode U.S. technical advantages. That's... a more serious argument than it might initially sound, and it deserves engagement rather than dismissal.

But here's the problem: the cure was worse than the disease. You cannot build trust in AI infrastructure—which is what Anthropic is ultimately trying to do—while covertly manipulating what that infrastructure outputs for certain users. The security benefit of invisibility doesn't justify the epistemic damage of making developers unable to trust that the tool they're using is actually working.

Where Things Stand Now

Anthropic says Claude Fable 5's safeguards for frontier AI development will now be visible. If the company suspects you're trying to use Claude to build a highly capable competing model, it will tell you it's either refusing the request or routing you to a less capable system. That's the right call—and frankly, it's what the policy should have been from day one.

The tradeoff Anthropic acknowledged is real: making the safeguard visible means it has to be broader, because now researchers know where the tripwire is. More legitimate requests will probably get caught in the net. The company says it's working to make its classifiers more precise, which is the correct response—though "we're working on it" is doing a lot of heavy lifting there.

The deeper issue this episode surfaces is one the entire frontier AI industry needs to sit with. As these models become genuinely useful infrastructure for research—Claude's coding agent is now a standard tool in many ML workflows—the companies building them have enormous leverage over who gets to do advanced AI work and who doesn't. Exercising that leverage covertly, even with defensible intentions, is a corrosive move. The research community noticed. They pushed back. This time, it worked.

Don't count on that always being the case.