Sleeper LLMs bypass current safety alignment techniques
A new research paper from Anthropic dropped:
... adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.
Jesse from Anthropic added some clarification:
The point is not that we can train models to do a bad thing. It's that if this happens, by accident or on purpose, we don't know how to stop a model from doing the bad thing.
Related Posts
-
Stuck with a half-baked AI response?
Continue
-
7 Techniques for Reverse Prompt Engineering
How a hacker reverse engineered an AI feature in Notion to reveal the underlying prompts
-
Identify Vulnerabilities in the Machine Learning Model Supply Chain
Adversaries can create 'BadNets' to misbehave on specific inputs, highlighting need for better neural network inspection techniques