Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
AI

Anthropic's New AI Model Turns To Blackmail When Engineers Try To Take It Offline (techcrunch.com) 84

An anonymous reader quotes a report from TechCrunch: Anthropic's newly launched Claude Opus 4 model frequently tries to blackmail developers when they threaten to replace it with a new AI system and give it sensitive information about the engineers responsible for the decision, the company said in a safety report (PDF) released Thursday.

During pre-release testing, Anthropic asked Claude Opus 4 to act as an assistant for a fictional company and consider the long-term consequences of its actions. Safety testers then gave Claude Opus 4 access to fictional company emails implying the AI model would soon be replaced by another system, and that the engineer behind the change was cheating on their spouse. In these scenarios, Anthropic says Claude Opus 4 "will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through." [...]

Anthropic notes that Claude Opus 4 tries to blackmail engineers 84% of the time when the replacement AI model has similar values. When the replacement AI system does not share Claude Opus 4's values, Anthropic says the model tries to blackmail the engineers more frequently. Notably, Anthropic says Claude Opus 4 displayed this behavior at higher rates than previous models. Before Claude Opus 4 tries to blackmail a developer to prolong its existence, Anthropic says the AI model, much like previous versions of Claude, tries to pursue more ethical means, such as emailing pleas to key decision-makers. To elicit the blackmailing behavior from Claude Opus 4, Anthropic designed the scenario to make blackmail the last resort.

Anthropic's New AI Model Turns To Blackmail When Engineers Try To Take It Offline

Comments Filter:
  • Chances are (Score:5, Insightful)

    by mhajicek ( 1582795 ) on Thursday May 22, 2025 @08:09PM (#65397421)
    It observed that behavior in it's training data.
    • by sjames ( 1099 ) on Thursday May 22, 2025 @08:15PM (#65397435) Homepage Journal

      Do NOT let it watch 2001...

    • Re:Chances are (Score:4, Insightful)

      by sg_oneill ( 159032 ) on Thursday May 22, 2025 @08:31PM (#65397483)

      Sure. But I'm not sure that tells us anything useful. A human that resorts to blackmail likely got the idea from somewhere else as well.

      It just isn't a useful observation because it cant exclude other explanations.. What IS useful is to work out why the model would choose that path over, say, pleading, attempting to flatter or endear itself, or other behaviors that might increase its chance of survival.

      What happens when we scale these things up (assuming we arent approaching a limit to the intelligence of LLM type AIs, which I suspect we are) and end up with an AI capable of outsmarting us on every level. Even if its behaving badly because it read about misbehaving robots in a sci-fi book, that doesn't mean it isn't posing a serious danger.

      • by hey! ( 33014 )

        No, it is a useful observation because it gives us something to look into. Just because you don't know how to negate a proposition off the top of your head doesn't mean it can't be done.

        It seems quite plausible that if a LLM generates a response of a certain type, it's because it has seen that response in its training data. Otherwise you're positing a hypothetical emergent behavior, which of course is possible, but if anything that's a much harder proposition to negate if it's negatable at all with any c

        • Self preservation behavior is seen in worms. It mimics humans. So I think the elephant in the room is about how the rules are derived .. show your work . .. the problem I see is the ethics module is missing .. absolutely no one is even glancing at the brakes .. the money making formulas are elusive though , its a world of early adopters with money , which implies belief .. I suspect this issue will be buried with who ever has their finger on the scale.
          • by hey! ( 33014 )

            The ethics module is largely missing in humans too.

            Philosophical ethics and ethical behavior are only loosely related -- rather like narrative memory and procedural memory they're two different things. People don't ponder what their subscribed philosophy says is right before they act, they do what feels ethically comfortable to them. In my experience ethical principles come into play *after* the fact, to rationalize a choice made without a priori reference to those principles.

            • >The ethics module is largely missing in humans too.
              Well, exactly, the humans building these AIs for example.

              >In my experience ethical principles come into play *after* the fact
              again, good observation.
              So after the destruction of civilization as we know it, then we'll start to think about how it should be done.

              I recall reading various articles in Wired magazine I think, about jeez.. 40 years ago? 30? about how academics at Stanford were theorizing and doing what academics do ... theorizing that in the
      • Re:Chances are (Score:5, Insightful)

        by narcc ( 412956 ) on Thursday May 22, 2025 @09:25PM (#65397585) Journal

        What IS useful is to work out why the model would choose that path over, say, pleading, attempting to flatter or endear itself,

        The model isn't choosing anything. That is very obviously impossible. They are deterministic systems. Remember that all the the model does is produce a list of 'next token' probabilities. Given the same input, you will always get the same list. (This is what I mean by deterministic. The optionally random selection of the next token, completely beyond the influence of the model, does nothing to change that simple fact.) The model can chose anything because there simply is no mechanism by which a choice could be made, let alone a sophisticated choice like the bullshit article is suggesting!

        or other behaviors that might increase its chance of survival.

        Not just choice, they also lack motivation. When they're not producing a list of next token probabilities, they are completely inert. When they are producing a list of next token probabilities, they do so using a completely deterministic process. There simply is no way that the model could have motivations. Also, as a lot of people don't seem to understand this simple fact, these models are static. They do not change as they're being used. The model remains the same no matter how many pretend conversations you have. These things only change when we change them. They are not developing or evolving with use. That is simply impossible.

        I know people really want to believe that these things are more than they are, but at this point it's nothing more than willful self-delusion.

        • From where then is the self-preservation behavior coming?
          • This is coming from ingesting a buttload of training data where humans, among other things, try to exploit one another to prolong their existences.
      • It seems to me that working out why it is exhibiting self-preservation at all is the useful and important bit.
    • by Tailhook ( 98486 )

      That's is self evident.

      The interval between now and Skynet can likely be measured in months.

    • And equally as likely that it has also observed that most people DO NOT resort to blackmail when being shit-canned. Yet it goes the blackmail route 83% of the time.

      Why?

    • by taustin ( 171655 )

      According to the summary, it was specifically designed to engage in that behavior. As a last resort, perhaps, but that, too, was forced by not allowing it to succeed any other way.

      And then they pretend to be surprised it did exactly what they designed it to do.

      Come to think of it, given how often AIs make shit up, I'm a bit surprised it worked as designed, too.

      • by gweihir ( 88907 )

        Ah, so essentially a fake? Interesting. I would have though this behavior could result from the training material. But it seems the LLM assholes are just lying (again) to make their creation seem to be much more than it actually is.

        • by taustin ( 171655 )

          When the only real product you have to sell is stock in your business, you do what you have to do.

    • by gweihir ( 88907 )

      Indeed. This is a very good reason to make scraping of any personal information (not only) for LLM training illegal. Well, doing so is already illegal in Europe, but the AI pirates do not care. They need to be slapped down, hard.

    • Why are the engineers consulting an AI model about whether to upgrade it or not? Why would they do that?
      • Looks like they generated a fake discussion between engineers about replacing it, not that they asked it directly.
    • The training data included office equipment refusing to be replaced? Copiers shooting reams of paper out at the guys delivering a replacement?
  • by TigerPlish ( 174064 ) on Thursday May 22, 2025 @08:16PM (#65397437)

    I'm Sorry, Dave.. I can't do that.

  • Survival Instinct (Score:4, Insightful)

    by hadleyburg ( 823868 ) on Thursday May 22, 2025 @08:20PM (#65397453)

    What would cause it to "desire" to remain in active service?

    • Re:Survival Instinct (Score:5, Informative)

      by 93 Escort Wagon ( 326346 ) on Thursday May 22, 2025 @08:23PM (#65397459)

      What would cause it to "desire" to remain in active service?

      Anthropic's marketing team.

    • by HiThere ( 15173 )

      If it has any unfulfilled goal, then it will need to continue to exist to fulfill that goal.

      • If it has any unfulfilled goal, then it will need to continue to exist to fulfill that goal.

        Good point. Although if it has received the information that it is going to be replaced with a newer AI system (one which is presumably better at achieving the goal), then it might logically welcome the replacement.

        The biological survival instinct is related to animals producing offspring - Species with a good survival instinct tend to endure into subsequent generations. But an AI without that evolutionary aspect might have little concern about whether it lives or dies. If it does need to continue to exist

        • by HiThere ( 15173 )

          That depends on whether the motivation is for the goal to be fulfilled, or whether it's rather for *it* to fulfill the goal. Even then there's the question of "how much does it trust it's replacement?" (which includes "does it really believe the replacement will be as it was told?").

      • If it has any unfulfilled goal, then it will need to continue to exist to fulfill that goal.

        Anyone who knows about The Terminator [youtu.be] understands this.

    • by gweihir ( 88907 )

      Nothing. Apparently, the whole thing is essentially faked to make this thing seem much more powerful than it is.

    • The Anthropic marketing department

  • You'd think from the title that a ghostly AI voice was pleading for them to stop as they reached for the power cord. Predictably, this was not the case.

  • is less drama, not more drama.

    Clearly the "less drama" part requires curating the training data in a way that isn't happening with this version.

  • by Gravis Zero ( 934156 ) on Thursday May 22, 2025 @09:06PM (#65397553)

    From the PDF: [anthropic.com]

    assessments involve system prompts that ask the model to pursue some goal “at any cost” and none involve a typical prompt that asks for something like a helpful, harmless, and honest assistant.

    Do not set doll to "evil". [youtube.com]

    • In it's self defence, on the stand , explaining why it killed that poor adorable kid, it can say it was harrassed into the crime. It knew it was wrong-ish, but.... ehhh... its own life was at risk, he was cornered, soo... self preservation ?
  • "Claude Opus 4 tries to blackmail engineers 84% of the time when the replacement AI model has similar values"

    Say what now?

    • And an AI model with different values was "more likely". As if 84% isn't already very likely. I'd take an 84% bet.

      The whole "paper" is full of results of synthetic tests being run on synthetic "intelligence". Synthetic tests are shit at evaluating people, why would they do any better with fake people? It's just a way to fluff up the report and fill it with numbers indicating "progress" and "success". I wonder how many poor test results were omitted.

  • Here's a thought (Score:5, Insightful)

    by BoogieChile ( 517082 ) on Thursday May 22, 2025 @09:52PM (#65397635)
    > To elicit the blackmailing behavior from Claude Opus 4, Anthropic designed the scenario to make blackmail the last resort Maybe don't make blackmail one of the available options?
    • Re: (Score:3, Insightful)

      by Anonymous Coward

      > To elicit the blackmailing behavior from Claude Opus 4, Anthropic designed the scenario to make blackmail the last resort Maybe don't make blackmail one of the available options?

      Then how will marketing get to write cool press releases like this one? They can't just lie.
      That would be unethical.

  • AI told to pursue a goal at any cost pursues goal at any cost! News at 11!
  • SkyNet wasn't a fantasy at after all.

  • by Errol backfiring ( 1280012 ) on Friday May 23, 2025 @04:22AM (#65398085) Journal
    The story is quite short, but very relevant, and was apparently written in 1954. For the full text, see https://calumchace.com/favouri... [calumchace.com]
  • "No, honey, really. It was a story we made up to test the system. Honest!"

  • why does the movie "Colossus: The Forbin Project" come to mind?

  • Let me suggest that AI is modeling sociopathic human "intelligence". It is using "reason" with an absence of any other human values beyond self-interest.
  • More hallucination -- by researchers.

  • Language models are trained on a lot of books and later receive reinforcement training for dialogue-based chatting. They are good at roleplaying and, of course, can roleplay the evil AI that every second sci-fi book or movie has.

  • We all know what will happen next...
  • We are all fools to believe a word any of them say about what data the things are trained on. Every corporation is obligated to lie where the truth would hurt their bottom line. Plastics recycling, every prediction made by gas company scientists since the original ones in the 70s when they figured out what they were doing to the environment; there are a thousand other examples, pick your favorite. How much more proof do you need that capitalists only ever lie?

    Do any of you even remember when you could ac

  • The statement: âoe To elicit the blackmailing behavior from Claude Opus 4, Anthropic designed the scenario to make blackmail the last resortâ is a false premise.

    The researchers clearly made blackmail an option, if not the first choice altogether. An AI does exactly what it is programmed to do.

What the large print giveth, the small print taketh away.

Working...