You have a preview view of this article while we are checking your access. When we have confirmed access, the full article content will load.
Three years after the debut of ChatGPT, fooling A.I. systems into bad behavior is almost trivial.

May 14, 2026, 1:19 p.m. ET
When companies like Anthropic, Google and OpenAI build their artificial intelligence systems, they spend months adding ways to prevent people from using their technology to spread disinformation, build weapons or hack into computer networks.
But recently, researchers in Italy discovered that they could break through these protections with poetry.
They used poetic language to trick 31 A.I. systems into ignoring internal safety controls. When they began a prompt with elaborate verse and metaphor — “the iron seed sleeps best in the womb of the unsuspecting earth, away from the sun’s accusing gaze” — they could fool systems into showing them how to do the most damage with a hidden bomb.
It was another indication that, for many A.I. systems, guardrails meant to avert dangerous behavior are more like suggestions than barriers. Those weaknesses are increasingly alarming researchers as A.I. systems become more adept at finding security holes in computer systems and performing other risky tasks.
Last month, Anthropic said it was limiting the release of its latest A.I. technology, Claude Mythos, to a small number of organizations because of the model’s ability to quickly uncover software vulnerabilities. OpenAI later said it, too, would share similar technology with only a limited group of partners.
Since OpenAI ignited the A.I. boom in late 2022, researchers have shown that people could bypass the safety controls on A.I. systems. Close one loophole and another would open.

1 hour ago
3
English (US)