Reverse Engineering Neural Networks

Building Trust in AI: Seeking Mechanistic Interpretability for AI Explainability and Safety

Trust in AI systems should be built on understanding them, not on believing we can create sandboxes or guardrails an AI cannot escape.

So how can we trace back an AI model’s reasoning or decision-making process?

How do we explain how it arrived at a particular result? What do we say to a regulator who asks us to explain an AI-powered key control protecting a systemically important system?

Or, put a different way, what is a helpful way to make AI explainable and safe?

This brings us to @NeelNanda5’s pioneering work:

Mechanistic Interpretability (MI) is the study of reverse engineering neural networks. Taking an inscrutable stack of matrices where we know _that _it works, and trying to reverse engineer _how _it works. And often this inscrutable stack of matrices can be decompiled to a human interpretable algorithm! In my (highly biased) opinion, this is one of the most exciting research areas in ML.

To get more folks on board, Neel just announced a Google Sheet with 341 “Concrete Problems in Interpretability”:

My main goal in writing this sequence was to give clear direction on where to start, and accessible ways to contribute. I hope this serves as a good entry point to do research in the field! In my opinion, the best way to learn is by trying to make progress on real problems.

If you or someone you know has a background in machine learning and are looking for a meaningful research challenge, pick a problem, put your name in the Sheet and start working on it.