Codex (and GPT-4) can’t beat humans on smart contract audits

GPT's Potential in Smart Contract Auditing: Current Limitations and Future Optimism as AI Capabilities Rapidly Improve.

I sit up when a highly reputable outfit applies their domain knowledge to assess the utility of GPT for auditing smart contracts. This was no “drive-by prompting” effort either. What’s the TLDR?

  1. The models are not able to reason well about certain higher-level concepts, such as ownership of contracts, re-entrancy, and fee distribution.
  2. The software ecosystem around integrating large language models with traditional software is too crude and everything is cumbersome; there are virtually no developer-oriented tools, libraries, and type systems that work with uncertainty.
  3. There is a lack of development and debugging tools for prompt creation. To develop the libraries, language features, and tooling that will integrate core LLM technologies with traditional software, far more resources will be required.

The pace of LLM improvement means they remain optimistic:

LLM capability is rapidly improving, and if it continues, the next generation of LLMs may serve as capable assistants to security auditors. Before developing Toucan, we used Codex to take an internal blockchain assessment occasionally used in hiring. It didn’t pass–but if it were a candidate, we’d ask it to take some time to develop its skills and return in a few months. It did return–we had GPT-4 take the same assessment–and it still didn’t pass, although it did better. Perhaps the large context window version with proper prompting could pass our assessment. We’re very eager to find out!