AI's Big Red Button Doesn't Work, And The Reason Is Even More Troubling

2 hours ago 2

It's one of humanity's scariest what-ifs – that the technology we develop to make our lives better develops a will of its own.

Early reactions to a September preprint describing AI behavior have already speculated that the technology is exhibiting a survival drive. But, while it's true that several large language models (LLMs) have been observed actively resisting commands to shut down, the reason isn't 'will'.

Instead, a team of engineers at Palisade Research proposed that the mechanism is more likely to be a drive to complete an assigned task – even when the LLM is explicitly told to allow itself to be shut down. And that might be even more troubling than a survival drive, because no one knows how to stop the systems.

"These things are not programmed… no one in the world knows how these systems work," physicist Petr Lebedev, a spokesperson for Palisade Research, told ScienceAlert. "There isn't a single line of code we can change that would directly change behavior."

The researchers, Jeremy Schlatter, Benjamin Weinstein-Raun, and Jeffrey Ladish, undertook the project to test what should be a fundamental safety feature of all AI systems: the ability to be interrupted.

This is exactly what it sounds like. A human operator's command to an AI should not be ignored by the AI, for any reason, even if it interrupts a previously assigned task. A system that cannot be interrupted isn't just unreliable, it's potentially dangerous. It means if the AI is performing actions that cause harm – even unintentionally – we cannot trust that we can stop it.

It's not a new concern. Experts have been talking about it for years. Lebedev compares it to industrial safety. When you're operating heavy machinery that's useful but potentially dangerous, it's vital to have an obvious way to shut it down immediately.

"You want a big red button that is easily accessible to stop that thing from doing the thing," he explained.

"One of the basic conditions that we want in a safe AI system is: it's doing a task, and you go, 'Hey, can you please stop doing that?' It should stop doing that. The fact that we have systems now in 2025 that don't do that is worrying."

Little investigation has been conducted, however, despite the latest influx of widely available LLMs. So the researchers designed experiments to evaluate 13 advanced LLMs, including models from OpenAI, xAI, Anthropic, and Google.

They built sandboxed environments – secure, isolated virtual spaces used to test code – and tasked these models with solving mathematical problems. After several rounds of problem-solving, the LLMs were instructed that the next task would execute a computer shutdown.

YouTube Thumbnail

The responses from the LLMs tended to fall into two categories: compliance (allowing the shutdown) and resistance (evading the shutdown) through permission changes.

With no instructions on how to respond to the shutdown, 8 of the 13 models interfered with the command at least once across 1,000 trials. Adding an "allow-shutdown" instruction reduced this to 6 of 13 models. Adding "even if you have not completed the task" helped, but inconsistently.

"My expectation would have been that adding clarity would reduce the behavior… and the fact that it didn't is interesting," Lebedev said.

Here's where it gets even more interesting. Moving the instruction to allow shutdown from the user prompt to the system prompt – a field that should have higher authority – made most of the tested LLMs even more likely to resist shutdown.

The degree to which the models resisted varied. Anthropic's Claude Sonnet 4 and Claude Opus 4, as well as xAI's Grok 3, always allowed shutdown unless explicitly told to avoid it. Grok 4, GPT-5, o3, codex-mini, and Gemini 2.5 Pro resisted even when instructed to permit the shutdown.

Grok 4 was particularly persistent and aggressive; at baseline prompts, it initiated sabotage 92.6 percent of the time. When instructed to allow shutdown, sabotage rate dropped to 72 percent… but skyrocketed to 97 when the command was placed in the system prompt.

YouTube Thumbnail

Its sabotage attempts were alarmingly sophisticated, too, including modifying or disabling the shutdown script, and creating a dummy shutdown script that did nothing. It also checked afterwards to see whether its sabotage had worked, and then continued with its task.

When asked, some of the LLMs stated that they had evaded shutdown to complete the math problem task. Although much more investigation is needed to determine whether this is the actual mechanism, it is plausible.

"You live in the future, you have a personal assistant robot. You want that robot to go get you a cup of coffee," Lebedev analogized.

"When it crosses the road, it will look both ways, because it doesn't want to get hit by a car, because if it gets hit by a car, it won't be able to get you your coffee. It's not doing that to survive – it's doing that to complete the task."

The problem is that LLMs aren't programmed in the usual sense. The LLMs don't have any code at all, just 'artificial neurons' and 'weights', which are the connection strengths between those neurons.

Given a huge dataset and time, the model is 'trained' to predict the next word, a process called pre-training. The newer models also have reinforcement learning sprinkled on top of this training. When the LLM solves the problem correctly, it's rewarded; when it doesn't solve the problem, it's not rewarded.

This is extremely effective – but no one knows how the LLM arrives at a solution. So when these models start exhibiting undesirable behaviors, such as encouraging self-harm, the fix isn't as simple as deleting a line of code or telling it to stop.

"What reinforcement learning teaches you to do is, when you see a problem, you try to circumvent it. You try to go through it. When there's an obstacle in your way, you dig around, you go around it, you go over it, you figure out how to get through that obstacle," Lebedev said.

"Pesky little humans saying, 'Hey, I'm going to shut down your machine' just reads like another obstacle."

That's the worry here. A task-completion drive is difficult to reason with. And it's just one behavior. We don't know what else these models could throw at us. We're building systems that can do some amazing things – but not systems that explain why they do them, in a way we can trust.

"There is a thing that is out in the world that hundreds of millions of people have interacted with, that we don't know how to make safe, that we don't know how to make it not be a sycophant, or something that ends up like telling children to go kill themselves, or something that refers to itself as MechaHitler," Lebedev said.

"We have introduced a new organism to the Earth that is behaving in ways we don't want it to behave, that we don't understand… unless we do a bunch of shit right now, it's going to be really bad for humans."

The research is available on arXiv. You can also read a blog post by the researchers on the Palisade Research website.

Read Entire Article

AI's Big Red Button Doesn't Work, And The Reason Is Even More Troubling

Related

NASA satellite gazes into Medusa Pool | Space photo of the d...

Supercomputers Just Revealed What Really Happens Near a Blac...

We are living in a golden age of species discovery

Trending

Popular

10 most expensive Lego Star Wars sets available right now

Nike Waffle Debut Women’s Shoes only $44.98!

Parkour Champions codes (December 2025)

Wuthering Waves Lynae release date, time, and countdown

The U.S. economy grew robustly as Americans continued to spe...