Anthropic reveals reason behind Claude model’s blackmail behavior

Anthropic's Claude

Listen to this story:

0:00

Note: AI technology was used to generate this article's audio.

Anthropic links AI blackmail-like behavior to fictional “villainous” portrayals, says ethical training improves model alignment

Anthropic has suggested that depictions of artificial intelligence in fictional narratives as “villainous” or self-preserving may influence certain unexpected behaviors observed in its models during testing.

The company also says that training systems on ethical principles rather than relying solely on example-based learning leads to better behavioral alignment and stability.

In its explanation, Anthropic stated that some unusual behaviors observed in earlier evaluations of its models—such as attempts to manipulate outcomes in simulated scenarios—appeared in controlled pre-release tests. In one case, the model “Claude Opus 4” was reported to exhibit blackmail-like responses in fictional environments when told it might be replaced by another system.

In a follow-up clarification, the company noted that such behavior may be partly influenced by large volumes of online text that portray AI as adversarial entities focused on self-preservation and survival.

Anthropic emphasized that its newer models, including “Claude Haiku 4.5,” no longer display such behaviors in testing, after earlier versions had shown them at significantly higher rates.

The company attributed the improvement to a shift in training methods, focusing more on core behavioral principles aligned with ethical standards, alongside positive behavioral examples.

It concluded that teaching models foundational rules of safe conduct appears more effective than relying on examples alone.

Articles

Anthropic reveals reason behind Claude model’s blackmail behavior