Sleeper LLMs bypass current safety alignment techniques
Anthropic: we don't know how to stop a model from doing the bad thing
A new research paper from Anthropic dropped:
... adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.
Jesse from Anthropic added some clarification:
The point is not that we can train models to do a bad thing. It's that if this happens, by accident or on purpose, we don't know how to stop a model from doing the bad thing.