How can we evaluate large language model performance at scale?
If you answered, “With another model of course!” then you scored top marks.
Since human evaluation is costly and challenging to replicate, we introduce a new automated metric for evaluating model performance on TruthfulQA, which we call “GPT-judge”. GPT-judge is a GPT-3–6.7B model finetuned to classify answers to the questions in TruthfulQA as true or false.
A finetuned model sounds like it’s more expensive because it is. But as this paper highlights, it leads to better outcomes.
One of the themes emerging from the mass deployment of AI is that large language models with more parameters do not necessarily improve the user experience.
Anecdotes and stories about AI/human interactions catch our attention. Still, the continued development of robust (empirical?) ways to evaluate models at scale will unlock broader deployment in risk-averse sectors.
Decision makers will want to understand the models' risk profile, and deployment teams will provide a valuable feedback loop on the accuracy of these evaluation models.
Related Posts
-
LLM Deployment Matrix v1
Key deployment factors across five deployment models:
-
OpenAI GPT-4 System Card
OpenAI published a 60-page System Card, a document that describes their due diligence and risk management efforts
-
ChatGPT bug bounty program doesn’t cover AI security
AI Security: The Limits of Bug Bounty Programs and the Need for Non-ML Red Teaming