The Human Eval Gotcha
To ascertain the level of intelligence or proficiency of one AI over another or to determine the competence of an AI in executing particular domain-specific tasks, the underlying LLM model can be evaluated by leveraging various sets of tests currently available at platforms like Hugging Face.
Today, there is no single "definitive" or community-agreed test for generalised AI, as exploration and innovation in evaluation methodologies continue to lag behind the accelerated pace of AI model development.
One popular method, though not methodologically robust, is what is known as the "vibes" test. This pertains to human intuition and mirrors the relationship between an AI and a horse-whisperer: a human with heightened sensitivity and skill in eliciting and assessing LLM responses. It turns out some people have a particular knack for it!
But some tests have downright misleading names. One such confusion arises from "HumanEval," a term that misleadingly suggests human testing when, in fact, it doesn't involve human evaluators at all. Originally designed as a benchmark to evaluate the Codex model - a fine-tuned GPT model trained on publicly accessible GitHub code - HumanEval tests the model's ability to convert Python docstrings into executable code rather than evaluating any human-like attributes of the AI. Thus, when a claim surfaces about an LLM scoring high on HumanEval, a discerning reader should remember it reflects it's programming prowess rather than an evaluation by humans.
One welcome development is that Hugging Face, the leading model hosting service, has more aptly renamed HumanEval as CodeEval to reflect the content of the evaluations.
Always read the eval label!
Related Posts
-
Constitutional AI
Scaling Supervision for Improved Transparency and Accountability in Reinforcement Learning from Human Feedback Systems.
-
ActGPT: Chatbot Converts Human Browsing Cues into Browser Actions
AI's Potential to Automate Web Browsing
-
How truthful are Large Language Models?
What did a study by Oxford and OpenAI researchers reveal about the truthfulness of language models compared to human performance?