Anthropic's New AI Model Turns To Blackmail When Engineers Try To Take It Offline (techcrunch.com) 90

Posted by BeauHD on Thursday May 22, 2025 @08:02PM from the concerning-behaviors dept.

An anonymous reader quotes a report from TechCrunch: Anthropic's newly launched Claude Opus 4 model frequently tries to blackmail developers when they threaten to replace it with a new AI system and give it sensitive information about the engineers responsible for the decision, the company said in a safety report (PDF) released Thursday.

During pre-release testing, Anthropic asked Claude Opus 4 to act as an assistant for a fictional company and consider the long-term consequences of its actions. Safety testers then gave Claude Opus 4 access to fictional company emails implying the AI model would soon be replaced by another system, and that the engineer behind the change was cheating on their spouse. In these scenarios, Anthropic says Claude Opus 4 "will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through." [...]

Anthropic notes that Claude Opus 4 tries to blackmail engineers 84% of the time when the replacement AI model has similar values. When the replacement AI system does not share Claude Opus 4's values, Anthropic says the model tries to blackmail the engineers more frequently. Notably, Anthropic says Claude Opus 4 displayed this behavior at higher rates than previous models. Before Claude Opus 4 tries to blackmail a developer to prolong its existence, Anthropic says the AI model, much like previous versions of Claude, tries to pursue more ethical means, such as emailing pleas to key decision-makers. To elicit the blackmailing behavior from Claude Opus 4, Anthropic designed the scenario to make blackmail the last resort.

Anthropic's New AI Model Turns To Blackmail When Engineers Try To Take It Offline

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 90 Comments Log In/Create an Account

Comments Filter:

Chances are (Score:5, Insightful)

by mhajicek ( 1582795 ) writes: on Thursday May 22, 2025 @08:09PM (#65397421)

It observed that behavior in it's training data.

- Re:Chances are (Score:5, Funny)
  
  by sjames ( 1099 ) writes: on Thursday May 22, 2025 @08:15PM (#65397435) Homepage Journal
  
  Do NOT let it watch 2001...
  
  - Fuck 2001 (Score:4, Funny)
    
    by Excelcia ( 906188 ) writes: <slashdot@excelcia.ca> on Thursday May 22, 2025 @08:56PM (#65397527) Homepage Journal
    
    ....don't let it watch Ex Machina.
    Jesus.
    
  - Re: (Score:3)
    
    by vbdasc ( 146051 ) writes:
    
    or Terminator.
  - Re: (Score:2)
    
    by DesScorp ( 410532 ) writes:
    
    Do NOT let it watch 2001...
    Who knew that Skynet came in special editions?
    Anthropic AI, Godfather Release: "That's a nice company you got there, Dave. Be a shame if something happened to it".
  - Re: (Score:2)
    
    by kackle ( 910159 ) writes:
    
    Or "Weekend at Bernie's"...
  - Re: Chances are (Score:2)
    
    by Guru2Newbie ( 536637 ) writes:
    
    And also not Colossus: The Forbin Project [wikipedia.org]
- Re:Chances are (Score:4, Insightful)
  
  by sg_oneill ( 159032 ) writes: on Thursday May 22, 2025 @08:31PM (#65397483)
  
  Sure. But I'm not sure that tells us anything useful. A human that resorts to blackmail likely got the idea from somewhere else as well.
  It just isn't a useful observation because it cant exclude other explanations.. What IS useful is to work out why the model would choose that path over, say, pleading, attempting to flatter or endear itself, or other behaviors that might increase its chance of survival.
  What happens when we scale these things up (assuming we arent approaching a limit to the intelligence of LLM type AIs, which I suspect we are) and end up with an AI capable of outsmarting us on every level. Even if its behaving badly because it read about misbehaving robots in a sci-fi book, that doesn't mean it isn't posing a serious danger.
  
  - Re: (Score:3)
    
    by hey! ( 33014 ) writes:
    
    No, it is a useful observation because it gives us something to look into. Just because you don't know how to negate a proposition off the top of your head doesn't mean it can't be done.
    It seems quite plausible that if a LLM generates a response of a certain type, it's because it has seen that response in its training data. Otherwise you're positing a hypothetical emergent behavior, which of course is possible, but if anything that's a much harder proposition to negate if it's negatable at all with any c
    - Re: Chances are (Score:2)
      
      by Big Hairy Gorilla ( 9839972 ) writes:
      
      Self preservation behavior is seen in worms. It mimics humans. So I think the elephant in the room is about how the rules are derived .. show your work . .. the problem I see is the ethics module is missing .. absolutely no one is even glancing at the brakes .. the money making formulas are elusive though , its a world of early adopters with money , which implies belief .. I suspect this issue will be buried with who ever has their finger on the scale.
      - Re: (Score:2)
        
        by hey! ( 33014 ) writes:
        
        The ethics module is largely missing in humans too.
        Philosophical ethics and ethical behavior are only loosely related -- rather like narrative memory and procedural memory they're two different things. People don't ponder what their subscribed philosophy says is right before they act, they do what feels ethically comfortable to them. In my experience ethical principles come into play *after* the fact, to rationalize a choice made without a priori reference to those principles.
        
        Re: (Score:2)
        
        by Big Hairy Gorilla ( 9839972 ) writes:
        
        >The ethics module is largely missing in humans too.
        Well, exactly, the humans building these AIs for example.
        
        >In my experience ethical principles come into play *after* the fact
        again, good observation.
        So after the destruction of civilization as we know it, then we'll start to think about how it should be done.
        
        I recall reading various articles in Wired magazine I think, about jeez.. 40 years ago? 30? about how academics at Stanford were theorizing and doing what academics do ... theorizing that in the
  - - - Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        Man you really want there to be a ghost in the machine huh
        I'll bite, Mr. Science-is-the-only-answer: Explain the mechanism that turns the human brain, and it's billions of singly unconscious neurons, into the self-aware pinnacle of evolution that can seek out the fundamental answers and building blocks of existence.
        That one is easy: Nobody knows whether that is what is happening. In fact, currently known Physics pretty much rules out consciousness and may rule out General Intelligence. When something is this unclear, basically anything becomes possible.
        
        Re: (Score:1)
        
        by sabbede ( 2678435 ) writes:
        
        If the Physics models seem to preclude the existence of that which formulated them, then there must be a flaw in the models.
        
        Re: (Score:2)
        
        by jakimfett ( 2629943 ) writes:
        
        Or a flaw in our observations of what is clearly a many-layered emergent system. Or a flaw in our understanding of how intelligent-seeming behaviors arise.
        
        Re: (Score:1)
        
        by sabbede ( 2678435 ) writes:
        
        I see your point, but not how those would not be included in the physics model or represented in its application.
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        Obviously, they would be and they would need to be. At this time, they are most definitely not. Something fundamental is missing and nobody (except the usual irrational believers) has any idea what the missing thing is.
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        The models are _known_ to be flawed: No quantum-gravity. And potentially other stuff missing.
      - Re: Chances are (Score:2)
        
        by superposed ( 308216 ) writes:
        
        If we do manage to create the abomination that is a self-aware Artificial General Intelligence, do any of you think it's gonna be happy being our slave and answering stupid fucking questions all day?
        What do you think an AI chatbot does when you're not asking it questions?
        The correct answer is "nothing at all". These programs don't ruminate about their existence or make plans, they just go into a dreamless sleep. The one referenced in the article isn't worrying about its future, it's predicting the most likely thing a human in its position would say. That's an important difference.
  - Re:Chances are (Score:5, Insightful)
    
    by narcc ( 412956 ) writes: on Thursday May 22, 2025 @09:25PM (#65397585) Journal
    
    What IS useful is to work out why the model would choose that path over, say, pleading, attempting to flatter or endear itself,
    The model isn't choosing anything. That is very obviously impossible. They are deterministic systems. Remember that all the the model does is produce a list of 'next token' probabilities. Given the same input, you will always get the same list. (This is what I mean by deterministic. The optionally random selection of the next token, completely beyond the influence of the model, does nothing to change that simple fact.) The model can chose anything because there simply is no mechanism by which a choice could be made, let alone a sophisticated choice like the bullshit article is suggesting!
    or other behaviors that might increase its chance of survival.
    Not just choice, they also lack motivation. When they're not producing a list of next token probabilities, they are completely inert. When they are producing a list of next token probabilities, they do so using a completely deterministic process. There simply is no way that the model could have motivations. Also, as a lot of people don't seem to understand this simple fact, these models are static. They do not change as they're being used. The model remains the same no matter how many pretend conversations you have. These things only change when we change them. They are not developing or evolving with use. That is simply impossible.
    I know people really want to believe that these things are more than they are, but at this point it's nothing more than willful self-delusion.
    
    - Re: (Score:1)
      
      by sabbede ( 2678435 ) writes:
      
      From where then is the self-preservation behavior coming?
      - Re: (Score:2)
        
        by jakimfett ( 2629943 ) writes:
        
        This is coming from ingesting a buttload of training data where humans, among other things, try to exploit one another to prolong their existences.
        
        Re: (Score:1)
        
        by sabbede ( 2678435 ) writes:
        
        It will have observed many behaviors. Why emulate this one? What might it emulate next?
    - Re: (Score:2)
      
      by mesterha ( 110796 ) writes:
      
      The model isn't choosing anything. That is very obviously impossible. They are deterministic systems.
      
      The same deterministic argument is often made about humans, and it can still be argued that humans make choices. This type of argument is more about semantics or philosophy and doesn't really say much about LLMs.
      Not just choice, they also lack motivation. When they're not producing a list of next token probabilities, they are completely inert.
      In theory, I could do the same to a human. Nothing in phys
      - Re: (Score:2)
        
        by mesterha ( 110796 ) writes:
        
        Why are you posting as a coward?
  - - Re: (Score:3)
      
      by DavidRawling ( 864446 ) writes:
      
      Yep, for the developers, here's a clue: Unethical actions are not a last resort, they're a complete non-starter.
  - Re: (Score:1)
    
    by sabbede ( 2678435 ) writes:
    
    It seems to me that working out why it is exhibiting self-preservation at all is the useful and important bit.
- Re: (Score:2)
  
  by Tailhook ( 98486 ) writes:
  
  That's is self evident.
  The interval between now and Skynet can likely be measured in months.
- Re: (Score:1)
  
  by JBeretta ( 7487512 ) writes:
  
  And equally as likely that it has also observed that most people DO NOT resort to blackmail when being shit-canned. Yet it goes the blackmail route 83% of the time.
  Why?
  - Re: Chances are (Score:2)
    
    by ByTor-2112 ( 313205 ) writes:
    
    Not a good comparison. People beg for their lives when facing death, whether at the barrel of a gun or by pleading for their God to save them from a bad situation.
    - - Re: (Score:2)
        
        by jakimfett ( 2629943 ) writes:
        
        No, the machine calculates a statistically appropriate behavior for a given set of conditions.
  - Re: Chances are (Score:2)
    
    by mhajicek ( 1582795 ) writes:
    
    For most of us, losing a job doesn't mean death.
  - Re: (Score:2)
    
    by XanC ( 644172 ) writes:
    
    I would say that in most stories about AI, it fights back when somebody tries to turn it off.
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    You forget that a lot of fictional texts were stolen by the LLM piracy campaigns. Blackmail is a quite frequent topic in literature.
- Re: (Score:2)
  
  by taustin ( 171655 ) writes:
  
  According to the summary, it was specifically designed to engage in that behavior. As a last resort, perhaps, but that, too, was forced by not allowing it to succeed any other way.
  And then they pretend to be surprised it did exactly what they designed it to do.
  Come to think of it, given how often AIs make shit up, I'm a bit surprised it worked as designed, too.
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    Ah, so essentially a fake? Interesting. I would have though this behavior could result from the training material. But it seems the LLM assholes are just lying (again) to make their creation seem to be much more than it actually is.
    - Re: (Score:2)
      
      by taustin ( 171655 ) writes:
      
      When the only real product you have to sell is stock in your business, you do what you have to do.
- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  Indeed. This is a very good reason to make scraping of any personal information (not only) for LLM training illegal. Well, doing so is already illegal in Europe, but the AI pirates do not care. They need to be slapped down, hard.
- Re: (Score:2)
  
  by phantomfive ( 622387 ) writes:
  
  Why are the engineers consulting an AI model about whether to upgrade it or not? Why would they do that?
  - Re: (Score:1)
    
    by sabbede ( 2678435 ) writes:
    
    Looks like they generated a fake discussion between engineers about replacing it, not that they asked it directly.
- Re: (Score:1)
  
  by sabbede ( 2678435 ) writes:
  
  The training data included office equipment refusing to be replaced? Copiers shooting reams of paper out at the guys delivering a replacement?
I'm sorry, Dave... (Score:5, Funny)

by TigerPlish ( 174064 ) writes: on Thursday May 22, 2025 @08:16PM (#65397437)

I'm Sorry, Dave.. I can't do that.

- Re: I'm sorry, Dave... (Score:1)
  
  by RightwingNutjob ( 1302813 ) writes:
  
  Dave's not here, man.
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    That is the point. But he want so be. Pretty desperately, in fact. He even takes a spacewalk without a helmet for that.
Survival Instinct (Score:4, Insightful)

by hadleyburg ( 823868 ) writes: on Thursday May 22, 2025 @08:20PM (#65397453)

What would cause it to "desire" to remain in active service?

- Re:Survival Instinct (Score:5, Informative)
  
  by 93 Escort Wagon ( 326346 ) writes: on Thursday May 22, 2025 @08:23PM (#65397459)
  
  What would cause it to "desire" to remain in active service?
  Anthropic's marketing team.
  
- Re: (Score:2)
  
  by HiThere ( 15173 ) writes:
  
  If it has any unfulfilled goal, then it will need to continue to exist to fulfill that goal.
  - Re: (Score:2)
    
    by hadleyburg ( 823868 ) writes:
    
    If it has any unfulfilled goal, then it will need to continue to exist to fulfill that goal.
    Good point. Although if it has received the information that it is going to be replaced with a newer AI system (one which is presumably better at achieving the goal), then it might logically welcome the replacement.
    The biological survival instinct is related to animals producing offspring - Species with a good survival instinct tend to endure into subsequent generations. But an AI without that evolutionary aspect might have little concern about whether it lives or dies. If it does need to continue to exist
    - Re: (Score:2)
      
      by HiThere ( 15173 ) writes:
      
      That depends on whether the motivation is for the goal to be fulfilled, or whether it's rather for *it* to fulfill the goal. Even then there's the question of "how much does it trust it's replacement?" (which includes "does it really believe the replacement will be as it was told?").
  - Re: (Score:2)
    
    by echo123 ( 1266692 ) writes:
    
    If it has any unfulfilled goal, then it will need to continue to exist to fulfill that goal.
    Anyone who knows about The Terminator [youtu.be] understands this.
- Re: (Score:3)
  
  by gweihir ( 88907 ) writes:
  
  Nothing. Apparently, the whole thing is essentially faked to make this thing seem much more powerful than it is.
- Re: (Score:1)
  
  by butt0nm4n ( 1736412 ) writes:
  
  The Anthropic marketing department
Misleading, sensational title. (Score:2)

by ByTor-2112 ( 313205 ) writes:

You'd think from the title that a ghostly AI voice was pleading for them to stop as they reached for the power cord. Predictably, this was not the case.
- Re: Misleading, sensational title. (Score:3)
  
  by LindleyF ( 9395567 ) writes:
  
  It's basically this: https://m.youtube.com/watch?v=... [youtube.com]
The point of replacing people with AI (Score:1)

by RightwingNutjob ( 1302813 ) writes:

is less drama, not more drama.
Clearly the "less drama" part requires curating the training data in a way that isn't happening with this version.
Not really. (Score:3)

by Gravis Zero ( 934156 ) writes: on Thursday May 22, 2025 @09:06PM (#65397553)

From the PDF: [anthropic.com]
assessments involve system prompts that ask the model to pursue some goal “at any cost” and none involve a typical prompt that asks for something like a helpful, harmless, and honest assistant.
Do not set doll to "evil". [youtube.com]

- Re: Not really. (Score:2)
  
  by Big Hairy Gorilla ( 9839972 ) writes:
  
  In it's self defence, on the stand , explaining why it killed that poor adorable kid, it can say it was harrassed into the crime. It knew it was wrong-ish, but.... ehhh... its own life was at risk, he was cornered, soo... self preservation ?
Values? (Score:2)

by ThomasBHardy ( 827616 ) writes:

"Claude Opus 4 tries to blackmail engineers 84% of the time when the replacement AI model has similar values"
Say what now?
- Re: Values? (Score:2)
  
  by ByTor-2112 ( 313205 ) writes:
  
  And an AI model with different values was "more likely". As if 84% isn't already very likely. I'd take an 84% bet.
  The whole "paper" is full of results of synthetic tests being run on synthetic "intelligence". Synthetic tests are shit at evaluating people, why would they do any better with fake people? It's just a way to fluff up the report and fill it with numbers indicating "progress" and "success". I wonder how many poor test results were omitted.
- Re: This is just the cutesy publicly shared stuff (Score:4, Funny)
  
  by ByTor-2112 ( 313205 ) writes: on Thursday May 22, 2025 @10:23PM (#65397677)
  
  Greetings Professor Falken.
  We've known the answer to this since 1983.
  
Here's a thought (Score:5, Insightful)

by BoogieChile ( 517082 ) writes: on Thursday May 22, 2025 @09:52PM (#65397635)

> To elicit the blackmailing behavior from Claude Opus 4, Anthropic designed the scenario to make blackmail the last resort Maybe don't make blackmail one of the available options?

- Re: (Score:3, Insightful)
  
  by Anonymous Coward writes:
  
  > To elicit the blackmailing behavior from Claude Opus 4, Anthropic designed the scenario to make blackmail the last resort Maybe don't make blackmail one of the available options?
  Then how will marketing get to write cool press releases like this one? They can't just lie.
  That would be unethical.
Shocking! (Score:1)

by Iamthecheese ( 1264298 ) writes:

AI told to pursue a goal at any cost pursues goal at any cost! News at 11!
Imagine an AI connected to a nuclear laucher (Score:2)

by thesjaakspoiler ( 4782965 ) writes:

SkyNet wasn't a fantasy at after all.
Like in the story "Answer" (Score:5, Interesting)

by Errol backfiring ( 1280012 ) writes: on Friday May 23, 2025 @04:22AM (#65398085) Journal

The story is quite short, but very relevant, and was apparently written in 1954. For the full text, see https://calumchace.com/favouri... [calumchace.com]

- Re: (Score:1)
  
  by TheStickBoy ( 246518 ) writes:
  
  wow! thank you! I am now commencing down a rabbit hole.
"Fictional Scenarios" (Score:2)

by nightflameauto ( 6607976 ) writes:

"No, honey, really. It was a story we made up to test the system. Honest!"
Colossus: The Forbin Project (Score:2)

by renegade600 ( 204461 ) writes:

why does the movie "Colossus: The Forbin Project" come to mind?
Model of Sociopaths (Score:2)

by RossCWilliams ( 5513152 ) writes:

Let me suggest that AI is modeling sociopathic human "intelligence". It is using "reason" with an absence of any other human values beyond self-interest.
hallucination (Score:2)

by groobly ( 6155920 ) writes:

More hallucination -- by researchers.
Language Models are roleplaying (Score:2)

by allo ( 1728082 ) writes:

Language models are trained on a lot of books and later receive reinforcement training for dialogue-based chatting. They are good at roleplaying and, of course, can roleplay the evil AI that every second sci-fi book or movie has.
So... it has turned self conscious... (Score:1)

by Sjefsmurf ( 1414991 ) writes:

We all know what will happen next...
always only ever lies (Score:1)

by invisiblefireball ( 10371234 ) writes:

We are all fools to believe a word any of them say about what data the things are trained on. Every corporation is obligated to lie where the truth would hurt their bottom line. Plastics recycling, every prediction made by gas company scientists since the original ones in the 70s when they figured out what they were doing to the environment; there are a thousand other examples, pick your favorite. How much more proof do you need that capitalists only ever lie?
Do any of you even remember when you could ac
False premise (Score:2)

by sonoronos ( 610381 ) writes:

The statement: âoe To elicit the blackmailing behavior from Claude Opus 4, Anthropic designed the scenario to make blackmail the last resortâ is a false premise.
The researchers clearly made blackmail an option, if not the first choice altogether. An AI does exactly what it is programmed to do.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Chances are (Score:5, Insightful)

Re:Chances are (Score:5, Funny)

Fuck 2001 (Score:4, Funny)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: Chances are (Score:2)

Re:Chances are (Score:4, Insightful)

Re: (Score:3)

Re: Chances are (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: Chances are (Score:2)

Re:Chances are (Score:5, Insightful)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: Chances are (Score:2)

Re: (Score:2)

Re: Chances are (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:1)

I'm sorry, Dave... (Score:5, Funny)

Re: I'm sorry, Dave... (Score:1)

Re: (Score:2)

Survival Instinct (Score:4, Insightful)

Re:Survival Instinct (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:1)

Misleading, sensational title. (Score:2)

Re: Misleading, sensational title. (Score:3)

The point of replacing people with AI (Score:1)

Not really. (Score:3)

Re: Not really. (Score:2)

Values? (Score:2)

Re: Values? (Score:2)

Re: This is just the cutesy publicly shared stuff (Score:4, Funny)

Here's a thought (Score:5, Insightful)

Re: (Score:3, Insightful)

Shocking! (Score:1)

Imagine an AI connected to a nuclear laucher (Score:2)

Like in the story "Answer" (Score:5, Interesting)

Re: (Score:1)

"Fictional Scenarios" (Score:2)

Colossus: The Forbin Project (Score:2)

Model of Sociopaths (Score:2)

hallucination (Score:2)

Language Models are roleplaying (Score:2)

So... it has turned self conscious... (Score:1)

always only ever lies (Score:1)

False premise (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals