Codex (and GPT-4) can’t beat humans on smart contract audits
I sit up when a highly reputable outfit applies their domain knowledge to assess the utility of GPT for auditing smart contracts. This was no “drive-by prompting” effort either. What’s the TLDR?
- The models are not able to reason well about certain higher-level concepts, such as ownership of contracts, re-entrancy, and fee distribution.
- The software ecosystem around integrating large language models with traditional software is too crude and everything is cumbersome; there are virtually no developer-oriented tools, libraries, and type systems that work with uncertainty.
- There is a lack of development and debugging tools for prompt creation. To develop the libraries, language features, and tooling that will integrate core LLM technologies with traditional software, far more resources will be required.
The pace of LLM improvement means they remain optimistic:
LLM capability is rapidly improving, and if it continues, the next generation of LLMs may serve as capable assistants to security auditors. Before developing Toucan, we used Codex to take an internal blockchain assessment occasionally used in hiring. It didn’t pass–but if it were a candidate, we’d ask it to take some time to develop its skills and return in a few months. It did return–we had GPT-4 take the same assessment–and it still didn’t pass, although it did better. Perhaps the large context window version with proper prompting could pass our assessment. We’re very eager to find out!
Related Posts
-
Is Codex just the name of an AI or the future name of an cyber implant?
Generate exploit code using an AI
-
Enrich SOC tickets with remediation plans generated by AI
Take a security alert and use an AI to generate a remediation plan
-
The Human Eval Gotcha
Always read the Eval label