AI Security is Probabilistic Security

Emergent properties are a double-edged sword. They demonstrate that machines can achieve impressive language understanding and generation capabilities, but they give rise to unintended behaviours and vulnerabilities, including prompt injections.

LLMs exhibit emergent properties that arise from the complex interactions of their underlying neural networks. These properties are not explicitly programmed into the models but emerge due to their training on vast amounts of text data.

When an LLM model is fielded, there are three primary ways to express security requirements: input pre-processing, model operating parameters and output moderation.

What have we learnt from decades of experience testing user-supplied inputs for badness? Even with structured inputs (e.g. SQL) - it turns out there are many grey areas that leave even hardened communication protocols and implementations vulnerable to exploitation. Yet, here we are responding to prompt injections attempting to pre-process natural language prompts. If we struggle with injection attacks in structured languages, how will we fare dealing with the complexity of human language? And what of new languages a threat actor has the AI generate to subvert input and output stage security filters? This starts to hint at the underlying problem - syntax filtering (welcome as it is) is at the top of a very deep rabbit hole…

If LLMs have an open-ended nature, it means we cannot anticipate all possible manipulative or undesirable outputs.

Take a moment to consider the complexity and hence fallibility of any security checks we can do before feeding an untrusted prompt from a threat agent to a machine with emergent properties. And this assumes a 3rd party hosted AI rather than a raw, unfiltered first-party AI.

If we can’t predict the output from a given input; i.e. it’s not deterministic, are we deluding ourselves into believing we can reliably and predictably reduce material risk? Controls assurance for AI gains a probabilistic dimension we haven’t had to factor into commercially fielded systems and will lead us to circuit breakers; a tried and trusted control to protect against uncapped downside risk when unexpected control conditions occur.

Now we arrive at the bottom of the rabbit hole: How can we ensure that an AI doesn’t convince its human operators to deactivate its own or another AI’s circuit breakers?

Related Posts

Get Daily AI Cybersecurity Tips