Constitutional AI

Scaling Supervision for Improved Transparency and Accountability in Reinforcement Learning from Human Feedback Systems.

Reinforcement learning from human feedback (RLHF) is the dominant method a computer program learns how to make better decisions by receiving feedback from humans, much like a teacher giving a student feedback to improve their performance. The program uses this feedback to adjust its actions and improve its decision-making abilities over time.

If AI capabilities are set to exceed human-level performance - first in specific domains, and then more generally (AGI) we need ways to supervise an AI that don’t rely solely on human feedback since we could get duped!

This is the topic of Anthropics' research paper on scaling supervision:

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as ‘Constitutional AI’. … The idea is that human supervision will come entirely from a set of principles that should govern AI behavior, along with a small number of examples used for few-shot prompting. Together these principles form the constitution. … We show performance on 438 binary comparison questions intended to evaluate helpfulness, honesty, and harmlessness. We compare the performance of a preference model, trained on human feedback data, to pretrained language models, which evaluate the comparisons as multiple choice questions. We see that chain of thought reasoning significantly improves the performance at this task. The trends suggest that models larger than 52B will be competitive with human feedback-trained preference models.

A related challenge with supervisory AI is the tendency of RLHF-trained models to exhibit evasive behaviour. For instance, when instructed to perform a task that it deems harmful, the AI may refuse to comply without explaining. Sometimes, it may even adopt an accusatory tone, further complicating communication.

The method therefore improves upon, and partially replaces reinforcement learning from human feedback. The new assistant ‘RL-CAI’ is preferred by crowdworkers over those trained with previously collected human feedback labels for harmfulness. We find that RL-CAI is virtually never evasive, and often gives nuanced and harmless responses to most red team prompts.

As AI adoption and capability grows, developing and refining supervisory AI is crucial to ensure that it is transparent, accountable, and trustworthy.