Threat Prompt

Sleeper LLMs bypass current safety alignment techniques

A new research paper from Anthropic dropped:

... adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

Jesse from Anthropic added some clarification:

The point is not that we can train models to do a bad thing. It's that if this happens, by accident or on purpose, we don't know how to stop a model from doing the bad thing.

January 15, 2024

Sleeper LLMs bypass current safety alignment techniques

Related Posts

Get Daily AI Cybersecurity Tips