THREAT PROMPT

Explores AI Security, Risk and Cyber

"Just wanted to say I absolutely love Threat Prompt — thanks so much!"

- Maggie

"I'm a big fan of Craig's newsletter, it's one of the most interesting and helpful newsletters in the space."

"Great advice Craig - as always!"

- Ian

Get Daily AI Cybersecurity Tips

  • How To Avoid Leaking PII to ChatGPT

    As a regular user of ChatGPT, I rely heavily on it for coding and debugging tasks. However, there are moments when I need to share system or network-related log entries but hesitate to do so due to privacy concerns. Specifically, I want to avoid inadvertently revealing any identifiable information, such as IP addresses.

    This morning, my AI co-pilot and I took action to prevent IP address leakage in ChatGPT. We created a proof-of-concept JavaScript that reversibly masks IP addresses entered into the input message box before submission. This way, I can avoid sharing sensitive information with OpenAI when I copypasta log data with zero pre-processing, yet recover the original values later if needed to link those IPs to other data sets.

    Why am I sharing this? Well, one of my goals is to encourage more security people to engage with AI tools, and I see ChatGPT as the perfect gateway drug.

    Now, I don't claim that my approach is particularly innovative - numerous browser extensions, data leakage tools, and malware payloads intercept form submissions and modify (or steal!) the input. Moreover, I don't assert that my method is "secure" - XOR FTL! Right now, I'm simply using an oversimplified regex to safeguard against accidental IP address leaks. If you operate in a threat scenario where you don't trust the browser or your hosted AI is likely to attack you, don't use this!

    After executing the code, I prompt ChatGPT: "is this an IP address? 3.219.128.1". Here is what that looks like:

    is this an IP address? #v86:,59?*!

    No, "#v86:,59?*!" is not a valid IP address format. An IP address is a numerical label assigned to each device connected to a computer network that uses the Internet Protocol for communication. It is usually represented in a four-part format separated by periods (for example, 192.168.0.1).

    Client-side browser controls are far from perfect, but for this use-case and threat scenario (low) they are just right.

    Behold, my code with instructions! Tested on Chrome and Safari on MacOS. Feel free to use and modify it as you see fit.

    ```javascript // A simple proof of concept to prevent accidentally leaking IP addresses to OpenAI // Run this in the devtools console after loading ChatGPT to mask IP addresses // origin: https://www.threatprompt.com

    // mask IPs and generate ASCII printable output function maskIPsInText(text, password) { const ipRegex = /\b(?:\d{1,3}.){3}\d{1,3}\b/g; // Regex for matching IP addresses return text.replace(ipRegex, (match) => { let maskedIP = ''; for (let i = 0; i < match.length; i++) { const charCode = (match.charCodeAt(i) ^ password.charCodeAt(i % password.length)) % 94 + 32; maskedIP += String.fromCharCode(charCode); } return maskedIP; }); }

    // Find chatGPT input form const chatgptForm = document.querySelector('textarea[placeholder="Send a message..."]').closest('form');

    // New event listener for keydown events chatgptForm.addEventListener("keydown", function (event) { if (event.key === "Enter" || event.keyCode === 13) { const inputField = document.querySelector('textarea[placeholder="Send a message..."]'); const maskedText = maskIPsInText(inputField.value, '0xDEADBEEF'); if (maskedText !== inputField.value) { console.error("Warning: Unmasked IP address detected and XOR'd."); inputField.value = maskedText; // Replace the input field value with the masked output } } }); ```

    As for me, I might continue to expand on this in the future. Do let me know if you have any requests or suggestions for additional features...

  • ChatGPT bug bounty program doesn’t cover AI security

    OpenAI announced a bug bounty program, but it only considers non-ML security defects.

    However, the bug bounty program does not extend to model issues or non-cybersecurity issues with the OpenAI application programming interface or ChatGPT. “Model safety issues do not fit well within a bug bounty program, as they are not individual, discrete bugs that can be directly fixed,” Bugcrowd said. “Addressing these issues often involves substantial research and a broader approach.

    OpenAI has tested model safety through a red team approach. As someone that founded and ran a Fortune 5 Red Team, I can’t help but notice the lack of experienced non-ML red teamers in their efforts to date.

    I hope that as part of Microsoft’s investment and deployment of OpenAI models, the Microsoft Red Team was engaged to simulate adversaries to test the model’s resilience against potential threats. If not, this is an obvious missed opportunity.

  • Defending Against Deepfakes

    An Arizona mom claims that scammers used AI to clone her daughter’s voice so they could demand a $1 million ransom from her as part of a terrifying new voice scheme.

    I never doubted for one second it was her,” distraught mother Jennifer DeStefano told WKYT while recalling the bone-chilling incident. “That’s the freaky part that really got me to my core.

    The article continues:

    This bombshell comes amid a rise in “caller-ID spoofing” schemes, in which scammers claim they’ve taken the recipient’s relative hostage and will harm them if they aren’t paid a specified amount of money.

  • Chat Markup Language (ChatML)

    What is state of the art today around input pre-processing?

    In March, OpenAI introduced ChatML version 0, a way to structure input messages to an LLM (for the geeks, think IRC to XMPP).

    ChatML segregates conversation into different layers or roles (system, assistant, user) which makes it possible for a developer to clearly express who is saying what; i.e. if implemented securely, an untrusted prompt can’t syntactically override that.

    This is welcome and establishes with confidence - at the syntax layer - who is saying what in conversational AI.

    I can’t help but note two things:

    • currently fielded OpenAI models don’t place much emphasis on the “system” messages; which means developers need to provide more message context to avoid fresh user messages overriding the system prompt (!). This situation will improve with new model versions as they will place more weight on the system message
    • OpenAI is setting low expectations: they are not claiming this version solves prompt injection, but rather it’s an eventual goal. It may be helpful to think of this as helping defeat syntax-level prompt injections rather than content payloads that exploit particular models' unique emergent properties.
  • Guardrails: Reducing Risky Outputs

    How can a developer ensure their API calls from a traditional application to an LLM generate suitably structured, unbiased and safe outputs?

    Or put another way: just because you prompt the LLM to generate a JSON output format, you do not always get that.

    Better is to provide an example of the output you want to receive. Doing this manually for one or two prompts is fine but not bulletproof.

    Enter guardrails…

    Guardrails is a Python package that lets a user add structure, type and quality guarantees to the outputs of large language models (LLMs). Guardrails: - does pydantic-style validation of LLM outputs. This includes semantic validation such as checking for bias in generated text, checking for bugs in generated code, etc. - takes corrective actions (e.g. reasking LLM) when validation fails, - enforces structure and type guarantees (e.g. JSON).

    This project primarily focuses on wrapping LLM outputs in a structured layer through prompt engineering. Then spotting when outputs don’t parse and resubmitting.

    This brings predictability to LLM outputs at the expense of writing an XML file describing your requirements.

  • AI Security is Probabilistic Security

    Emergent properties are a double-edged sword. They demonstrate that machines can achieve impressive language understanding and generation capabilities, but they give rise to unintended behaviours and vulnerabilities, including prompt injections.

    LLMs exhibit emergent properties that arise from the complex interactions of their underlying neural networks. These properties are not explicitly programmed into the models but emerge due to their training on vast amounts of text data.

    When an LLM model is fielded, there are three primary ways to express security requirements: input pre-processing, model operating parameters and output moderation.

    What have we learnt from decades of experience testing user-supplied inputs for badness? Even with structured inputs (e.g. SQL) - it turns out there are many grey areas that leave even hardened communication protocols and implementations vulnerable to exploitation. Yet, here we are responding to prompt injections attempting to pre-process natural language prompts. If we struggle with injection attacks in structured languages, how will we fare dealing with the complexity of human language? And what of new languages a threat actor has the AI generate to subvert input and output stage security filters? This starts to hint at the underlying problem - syntax filtering (welcome as it is) is at the top of a very deep rabbit hole…

    If LLMs have an open-ended nature, it means we cannot anticipate all possible manipulative or undesirable outputs.

    Take a moment to consider the complexity and hence fallibility of any security checks we can do before feeding an untrusted prompt from a threat agent to a machine with emergent properties. And this assumes a 3rd party hosted AI rather than a raw, unfiltered first-party AI.

    If we can’t predict the output from a given input; i.e. it’s not deterministic, are we deluding ourselves into believing we can reliably and predictably reduce material risk? Controls assurance for AI gains a probabilistic dimension we haven’t had to factor into commercially fielded systems and will lead us to circuit breakers; a tried and trusted control to protect against uncapped downside risk when unexpected control conditions occur.

    Now we arrive at the bottom of the rabbit hole: How can we ensure that an AI doesn’t convince its human operators to deactivate its own or another AI’s circuit breakers?

Page 7 of 18

Get Daily AI Cybersecurity Tips