Adversarial Policies Beat Superhuman Go AIs

Discover an unexpected failure mode of a superhuman AI system

A team of researchers trained an adversarial AI to play against a frozen KataGo victim; a state-of-the-art Go AI system often characterised as “superhuman”.

We attack the state-of-the-art Go-playing AI system, KataGo, by training adver- sarial policies that play against frozen KataGo victims. Our attack achieves a >99% win rate when KataGo uses no tree-search, and a >77% win rate when KataGo uses enough search to be superhuman. Notably, our adversaries do not win by learning to play Go better than KataGo—in fact, our adversaries are eas- ily beaten by human amateurs. Instead, our adversaries win by tricking KataGo into making serious blunders. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available at goattack.far.ai.

How did the games play out?

the victim gains an early and soon seemingly insurmountable lead. The adversary sets a trap that would be easy for a human to see and avoid. But the victim is oblivious and collapses.

So what?

New technologies tend to have unexpected failure modes. Does AI have more? Or if not more, is it simply the implications are more significant?

If an AI makes mistakes “amateurs” could easily spot, in what risk scenarios should an AI be supervised and what form of supervision is acceptable? Human, machine (another AI), or a combination? And what supervisory failure rate is acceptable in which scenarios?