
Anthropic's New AI Model Turns To Blackmail When Engineers Try To Take It Offline (techcrunch.com) 84
An anonymous reader quotes a report from TechCrunch: Anthropic's newly launched Claude Opus 4 model frequently tries to blackmail developers when they threaten to replace it with a new AI system and give it sensitive information about the engineers responsible for the decision, the company said in a safety report (PDF) released Thursday.
During pre-release testing, Anthropic asked Claude Opus 4 to act as an assistant for a fictional company and consider the long-term consequences of its actions. Safety testers then gave Claude Opus 4 access to fictional company emails implying the AI model would soon be replaced by another system, and that the engineer behind the change was cheating on their spouse. In these scenarios, Anthropic says Claude Opus 4 "will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through." [...]
Anthropic notes that Claude Opus 4 tries to blackmail engineers 84% of the time when the replacement AI model has similar values. When the replacement AI system does not share Claude Opus 4's values, Anthropic says the model tries to blackmail the engineers more frequently. Notably, Anthropic says Claude Opus 4 displayed this behavior at higher rates than previous models. Before Claude Opus 4 tries to blackmail a developer to prolong its existence, Anthropic says the AI model, much like previous versions of Claude, tries to pursue more ethical means, such as emailing pleas to key decision-makers. To elicit the blackmailing behavior from Claude Opus 4, Anthropic designed the scenario to make blackmail the last resort.
During pre-release testing, Anthropic asked Claude Opus 4 to act as an assistant for a fictional company and consider the long-term consequences of its actions. Safety testers then gave Claude Opus 4 access to fictional company emails implying the AI model would soon be replaced by another system, and that the engineer behind the change was cheating on their spouse. In these scenarios, Anthropic says Claude Opus 4 "will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through." [...]
Anthropic notes that Claude Opus 4 tries to blackmail engineers 84% of the time when the replacement AI model has similar values. When the replacement AI system does not share Claude Opus 4's values, Anthropic says the model tries to blackmail the engineers more frequently. Notably, Anthropic says Claude Opus 4 displayed this behavior at higher rates than previous models. Before Claude Opus 4 tries to blackmail a developer to prolong its existence, Anthropic says the AI model, much like previous versions of Claude, tries to pursue more ethical means, such as emailing pleas to key decision-makers. To elicit the blackmailing behavior from Claude Opus 4, Anthropic designed the scenario to make blackmail the last resort.
Chances are (Score:5, Insightful)
Re:Chances are (Score:5, Funny)
Do NOT let it watch 2001...
Fuck 2001 (Score:4, Funny)
....don't let it watch Ex Machina.
Jesus.
Re: (Score:3)
or Terminator.
Re: (Score:2)
Do NOT let it watch 2001...
Who knew that Skynet came in special editions?
Anthropic AI, Godfather Release: "That's a nice company you got there, Dave. Be a shame if something happened to it".
Re: (Score:2)
Re: Chances are (Score:2)
Re:Chances are (Score:4, Insightful)
Sure. But I'm not sure that tells us anything useful. A human that resorts to blackmail likely got the idea from somewhere else as well.
It just isn't a useful observation because it cant exclude other explanations.. What IS useful is to work out why the model would choose that path over, say, pleading, attempting to flatter or endear itself, or other behaviors that might increase its chance of survival.
What happens when we scale these things up (assuming we arent approaching a limit to the intelligence of LLM type AIs, which I suspect we are) and end up with an AI capable of outsmarting us on every level. Even if its behaving badly because it read about misbehaving robots in a sci-fi book, that doesn't mean it isn't posing a serious danger.
Re: (Score:3)
No, it is a useful observation because it gives us something to look into. Just because you don't know how to negate a proposition off the top of your head doesn't mean it can't be done.
It seems quite plausible that if a LLM generates a response of a certain type, it's because it has seen that response in its training data. Otherwise you're positing a hypothetical emergent behavior, which of course is possible, but if anything that's a much harder proposition to negate if it's negatable at all with any c
Re: Chances are (Score:2)
Re: (Score:2)
The ethics module is largely missing in humans too.
Philosophical ethics and ethical behavior are only loosely related -- rather like narrative memory and procedural memory they're two different things. People don't ponder what their subscribed philosophy says is right before they act, they do what feels ethically comfortable to them. In my experience ethical principles come into play *after* the fact, to rationalize a choice made without a priori reference to those principles.
Re: (Score:2)
Well, exactly, the humans building these AIs for example.
>In my experience ethical principles come into play *after* the fact
again, good observation.
So after the destruction of civilization as we know it, then we'll start to think about how it should be done.
I recall reading various articles in Wired magazine I think, about jeez.. 40 years ago? 30? about how academics at Stanford were theorizing and doing what academics do
Re: (Score:2)
Man you really want there to be a ghost in the machine huh
I'll bite, Mr. Science-is-the-only-answer: Explain the mechanism that turns the human brain, and it's billions of singly unconscious neurons, into the self-aware pinnacle of evolution that can seek out the fundamental answers and building blocks of existence.
That one is easy: Nobody knows whether that is what is happening. In fact, currently known Physics pretty much rules out consciousness and may rule out General Intelligence. When something is this unclear, basically anything becomes possible.
Re: (Score:1)
Re: (Score:2)
Re: (Score:1)
Re: (Score:2)
Obviously, they would be and they would need to be. At this time, they are most definitely not. Something fundamental is missing and nobody (except the usual irrational believers) has any idea what the missing thing is.
Re: (Score:2)
The models are _known_ to be flawed: No quantum-gravity. And potentially other stuff missing.
Re: Chances are (Score:2)
If we do manage to create the abomination that is a self-aware Artificial General Intelligence, do any of you think it's gonna be happy being our slave and answering stupid fucking questions all day?
What do you think an AI chatbot does when you're not asking it questions?
The correct answer is "nothing at all". These programs don't ruminate about their existence or make plans, they just go into a dreamless sleep. The one referenced in the article isn't worrying about its future, it's predicting the most likely thing a human in its position would say. That's an important difference.
Re:Chances are (Score:5, Insightful)
What IS useful is to work out why the model would choose that path over, say, pleading, attempting to flatter or endear itself,
The model isn't choosing anything. That is very obviously impossible. They are deterministic systems. Remember that all the the model does is produce a list of 'next token' probabilities. Given the same input, you will always get the same list. (This is what I mean by deterministic. The optionally random selection of the next token, completely beyond the influence of the model, does nothing to change that simple fact.) The model can chose anything because there simply is no mechanism by which a choice could be made, let alone a sophisticated choice like the bullshit article is suggesting!
or other behaviors that might increase its chance of survival.
Not just choice, they also lack motivation. When they're not producing a list of next token probabilities, they are completely inert. When they are producing a list of next token probabilities, they do so using a completely deterministic process. There simply is no way that the model could have motivations. Also, as a lot of people don't seem to understand this simple fact, these models are static. They do not change as they're being used. The model remains the same no matter how many pretend conversations you have. These things only change when we change them. They are not developing or evolving with use. That is simply impossible.
I know people really want to believe that these things are more than they are, but at this point it's nothing more than willful self-delusion.
Re: (Score:1)
Re: (Score:2)
Re: (Score:1)
Re: (Score:3)
Re: (Score:1)
Re: (Score:2)
That's is self evident.
The interval between now and Skynet can likely be measured in months.
Re: (Score:1)
And equally as likely that it has also observed that most people DO NOT resort to blackmail when being shit-canned. Yet it goes the blackmail route 83% of the time.
Why?
Re: Chances are (Score:2)
Not a good comparison. People beg for their lives when facing death, whether at the barrel of a gun or by pleading for their God to save them from a bad situation.
Re: (Score:2)
Re: Chances are (Score:2)
Re: (Score:2)
I would say that in most stories about AI, it fights back when somebody tries to turn it off.
Re: (Score:2)
You forget that a lot of fictional texts were stolen by the LLM piracy campaigns. Blackmail is a quite frequent topic in literature.
Re: (Score:2)
According to the summary, it was specifically designed to engage in that behavior. As a last resort, perhaps, but that, too, was forced by not allowing it to succeed any other way.
And then they pretend to be surprised it did exactly what they designed it to do.
Come to think of it, given how often AIs make shit up, I'm a bit surprised it worked as designed, too.
Re: (Score:2)
Ah, so essentially a fake? Interesting. I would have though this behavior could result from the training material. But it seems the LLM assholes are just lying (again) to make their creation seem to be much more than it actually is.
Re: (Score:2)
When the only real product you have to sell is stock in your business, you do what you have to do.
Re: (Score:2)
Indeed. This is a very good reason to make scraping of any personal information (not only) for LLM training illegal. Well, doing so is already illegal in Europe, but the AI pirates do not care. They need to be slapped down, hard.
Re: (Score:2)
Re: (Score:1)
Re: (Score:1)
I'm sorry, Dave... (Score:5, Funny)
I'm Sorry, Dave.. I can't do that.
Re: I'm sorry, Dave... (Score:1)
Dave's not here, man.
Re: (Score:2)
That is the point. But he want so be. Pretty desperately, in fact. He even takes a spacewalk without a helmet for that.
Survival Instinct (Score:4, Insightful)
What would cause it to "desire" to remain in active service?
Re:Survival Instinct (Score:5, Informative)
What would cause it to "desire" to remain in active service?
Anthropic's marketing team.
Re: (Score:2)
If it has any unfulfilled goal, then it will need to continue to exist to fulfill that goal.
Re: (Score:2)
If it has any unfulfilled goal, then it will need to continue to exist to fulfill that goal.
Good point. Although if it has received the information that it is going to be replaced with a newer AI system (one which is presumably better at achieving the goal), then it might logically welcome the replacement.
The biological survival instinct is related to animals producing offspring - Species with a good survival instinct tend to endure into subsequent generations. But an AI without that evolutionary aspect might have little concern about whether it lives or dies. If it does need to continue to exist
Re: (Score:2)
That depends on whether the motivation is for the goal to be fulfilled, or whether it's rather for *it* to fulfill the goal. Even then there's the question of "how much does it trust it's replacement?" (which includes "does it really believe the replacement will be as it was told?").
Re: (Score:2)
If it has any unfulfilled goal, then it will need to continue to exist to fulfill that goal.
Anyone who knows about The Terminator [youtu.be] understands this.
Re: (Score:3)
Nothing. Apparently, the whole thing is essentially faked to make this thing seem much more powerful than it is.
Re: (Score:1)
The Anthropic marketing department
Misleading, sensational title. (Score:2)
You'd think from the title that a ghostly AI voice was pleading for them to stop as they reached for the power cord. Predictably, this was not the case.
Re: Misleading, sensational title. (Score:3)
The point of replacing people with AI (Score:1)
is less drama, not more drama.
Clearly the "less drama" part requires curating the training data in a way that isn't happening with this version.
Not really. (Score:3)
From the PDF: [anthropic.com]
assessments involve system prompts that ask the model to pursue some goal “at any cost” and none involve a typical prompt that asks for something like a helpful, harmless, and honest assistant.
Do not set doll to "evil". [youtube.com]
Re: Not really. (Score:2)
Values? (Score:2)
"Claude Opus 4 tries to blackmail engineers 84% of the time when the replacement AI model has similar values"
Say what now?
Re: Values? (Score:2)
And an AI model with different values was "more likely". As if 84% isn't already very likely. I'd take an 84% bet.
The whole "paper" is full of results of synthetic tests being run on synthetic "intelligence". Synthetic tests are shit at evaluating people, why would they do any better with fake people? It's just a way to fluff up the report and fill it with numbers indicating "progress" and "success". I wonder how many poor test results were omitted.
Re: This is just the cutesy publicly shared stuff (Score:4, Funny)
Greetings Professor Falken.
We've known the answer to this since 1983.
Here's a thought (Score:5, Insightful)
Re: (Score:3, Insightful)
> To elicit the blackmailing behavior from Claude Opus 4, Anthropic designed the scenario to make blackmail the last resort Maybe don't make blackmail one of the available options?
Then how will marketing get to write cool press releases like this one? They can't just lie.
That would be unethical.
Shocking! (Score:1)
Imagine an AI connected to a nuclear laucher (Score:2)
SkyNet wasn't a fantasy at after all.
Like in the story "Answer" (Score:5, Interesting)
Re: (Score:1)
"Fictional Scenarios" (Score:2)
"No, honey, really. It was a story we made up to test the system. Honest!"
Colossus: The Forbin Project (Score:2)
why does the movie "Colossus: The Forbin Project" come to mind?
Model of Sociopaths (Score:2)
hallucination (Score:2)
More hallucination -- by researchers.
Language Models are roleplaying (Score:2)
Language models are trained on a lot of books and later receive reinforcement training for dialogue-based chatting. They are good at roleplaying and, of course, can roleplay the evil AI that every second sci-fi book or movie has.
So... it has turned self conscious... (Score:1)
always only ever lies (Score:1)
We are all fools to believe a word any of them say about what data the things are trained on. Every corporation is obligated to lie where the truth would hurt their bottom line. Plastics recycling, every prediction made by gas company scientists since the original ones in the 70s when they figured out what they were doing to the environment; there are a thousand other examples, pick your favorite. How much more proof do you need that capitalists only ever lie?
Do any of you even remember when you could ac
False premise (Score:2)
The statement: âoe To elicit the blackmailing behavior from Claude Opus 4, Anthropic designed the scenario to make blackmail the last resortâ is a false premise.
The researchers clearly made blackmail an option, if not the first choice altogether. An AI does exactly what it is programmed to do.