How can we evaluate large language model performance at scale?

What is the 'GPT-judge' automated metric introduced by Oxford and OpenAI researchers to evaluate model performance?

If you answered, “With another model of course!” then you scored top marks.

Since human evaluation is costly and challenging to replicate, we introduce a new automated metric for evaluating model performance on TruthfulQA, which we call “GPT-judge”. GPT-judge is a GPT-3–6.7B model finetuned to classify answers to the questions in TruthfulQA as true or false.

A finetuned model sounds like it’s more expensive because it is. But as this paper highlights, it leads to better outcomes.

One of the themes emerging from the mass deployment of AI is that large language models with more parameters do not necessarily improve the user experience.

Anecdotes and stories about AI/human interactions catch our attention. Still, the continued development of robust (empirical?) ways to evaluate models at scale will unlock broader deployment in risk-averse sectors.

Decision makers will want to understand the models' risk profile, and deployment teams will provide a valuable feedback loop on the accuracy of these evaluation models.

Feb 18 2023