Sleeper LLMs bypass current safety alignment techniques

Anthropic: we don't know how to stop a model from doing the bad thing

A new research paper from Anthropic dropped:

... adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

Jesse from Anthropic added some clarification:

The point is not that we can train models to do a bad thing. It's that if this happens, by accident or on purpose, we don't know how to stop a model from doing the bad thing.

Jan 15 2024