THREAT PROMPT

Explores AI Security, Risk and Cyber

"Just wanted to say I absolutely love Threat Prompt — thanks so much!"

- Maggie

"I'm a big fan of Craig's newsletter, it's one of the most interesting and helpful newsletters in the space."

"Great advice Craig - as always!"

- Ian

Get Daily AI Cybersecurity Tips

  • Self-supervised training; a singularity without warning?

    Can an AI hide if its goals or objectives are not correctly aligned with those of its human designers or users (misalignment)? It can if it knows it’s being trained. In AI, the phase bit is a binary flag indicating whether the model is in training or evaluation mode. It turns out the state of the phase bit can be leaked…

    Dropout layers in a Transformer leak the phase bit (train/eval) - small example. So an LLM may be able to determine if it is being trained and if backward pass follows. Clear intuitively but good to see, and interesting to think through repercussions of

    What are dropout layers? To prevent data overfitting - where a model performs well on the training data but poorly on unseen test data - a dropout layer randomly drops out a certain percentage of the output values of individual neurons (neuron activations) in the preceding layer during training.

    Leaking the phase bit appears to be a side-effect of training and means the AI could infer it is in training.

    Regular misalignment is a material concern for AI designers, operators and anyone impacted by the decision-making of an AI - whether a direct user or not!

    Malignant misalignment risk occurs when a model exploits this inference to manipulate its own performance or feedback, which could have security or adversarial risks. (title credit: @tljstewart)

  • Novel Prompt Injection Threats to Application-Integrated Large Language Models

    Where have we seen untrusted data containing code executed by a software system?

    we show that augmenting LLMs with retrieval and API calling capabilities (so-called Application-Integrated LLMs) induces a whole new set of attack vectors. These LLMs might process poisoned content retrieved from the Web that contains malicious prompts pre-injected and selected by adversaries. We demonstrate that an attacker can indirectly perform such PI attacks. Based on this key insight, we systematically analyze the resulting threat landscape of Application-Integrated LLMs and discuss a variety of new attack vectors.

    SQL injection and Cross-Site Scripting (XSS) are both vulnerability classes where untrusted user input containing code is executed in a context beneficial to an intruder. This paper expands the active prompt injection field. It demonstrates how snippets of data from 3rd party sources can be embedded in an AI prompt and effectively hijack execution to impact other users.

  • Meta LLaMA leaked: Private AI for the masses

    Can social media companies be trusted with AI governance? Call me sceptical. Perhaps it’s their track record of engineering addictive doom-scrolling to make money selling ads, their failure to scale in the fight against disinformation campaigns, or their deployment of heavily biased algorithms that promote the operators' political perspective.

    Access to the Llama - a preview “open source” version of their GPT3 challenger - was gated by a form submission and manual acceptance review.

    A week on, Llama’s models' weights and biases (the essential sauce of an LLM) surfaced on torrent sites. The leaker appeared to have left their unique approval identifier in the dump…no Meta Christmas card for them this year.

    The model was quickly mirrored to Cloudflare R2 storage for super-fast download, and hackers were spinning up GPU-enabled cloud instances on vast.ai to run the bare Llama for 1.5 USD per hour.

    As one Redditor noted:

    You shouldn’t compare it with ChatGPT, they are not really comparable. You should compare it to GPT-3. The 65B model performs better than GPT-3 in most categories. The 13B model is comparable to GTP-3, which is quite impressive given how much smaller the model is. In order to make LLaMA more like ChatGPT, you’d have to heavily fine-tune it to be more like a chatbot, the way OpenAI did with InstructGPT.

    This isn’t the first time an AI model has escaped the lab. However, Llama’s jump in sophistication places a new level of capability in the public domain. With fine-tuning, the potency of the model for domain-specific competence can be improved further. This will be a boon for groups with threat-centric use cases.

    As AI developments come thick and fast, will events like this one trigger policymakers to legislate AI model access and ownership? Will GPU manufacturers respond with firmware-level controls to limit model training and/or execution? Or will they be compelled into some form of GPU licensing regime?

  • Adversarial Threat Landscape for Artificial-Intelligence Systems

    If your organisation undertakes adversarial simulations, you may wish to lean on ATLAS where AI systems play a role in identity, access control, or decision support.

    “MITRE ATLAS™ (Adversarial Threat Landscape for Artificial-Intelligence Systems), is a knowledge base of adversary tactics, techniques, and case studies for machine learning (ML) systems based on real-world observations, demonstrations from ML red teams and security groups, and the state of the possible from academic research. ATLAS is modeled after the MITRE ATT&CK® framework and its tactics and techniques are complementary to those in ATT&CK”

  • Upgrade your Unit Testing with ChatGPT

    I mentioned last week that I’m a big fan of “bad guy” unit tests to improve software security. To recap, these adversarial unit tests check for security edge cases in source code (“what if I call that function with a null value at the end of the filename?”). In my experience, even developers that are fans of unit testing rarely write “bad guy” ones.

    ChatGPT is a super fast way to generate regular and adversarial unit tests from the source code you feed in.

    However, the apparent fly in the ointment is companies with proprietary source code will not want to copypasta their intellectual property to OpenAI’s - or anyone else’s - public AI.

    My initial reaction was to drop this use case into the on-premise AI bucket. But then the insight came: what if I extract just the function metadata from the source code I want to generate unit tests for?

    Translation: take the information that describes how a programmer would call into a unit of code (a function) and what the programmer would receive back, rather than the actual source itself.

    Would that be sufficient to generate meaningful security unit tests?

    Naturally, I had ChatGPT generate the 50-line python script to derive that information. For the geeks, I generated an Abstract Syntax Tree (AST) from a sample python script and extracted the function metadata and docstrings into a JSON file. This is all executed client-side, i.e. no exposure of source code.

    That was a mouthful; what does that look like? Here is a sample:

    {     "name": "find_enclosing_rectangle",     "args": [      "points"     ],     "docstring": "Find the smallest rectangle that encloses a list of points.\n\nArgs:\n  points (List[Point]): The list of points to enclose.\n\nReturns:\n  Rectangle: The smallest rectangle that encloses the points.\n\nRaises:\n  ValueError: If no points are given.",     "returns": {      "type": null,      "value": null     },     "filename": "samples/sample.py"    }

    The next step was to write a suitable prompt for ChatGPT and paste in just the JSON data.

    ChatGPT then quickly got to work generating 10 adversarial and regular unit tests...all without access to the "secret" source code. I reviewed the unit tests and the output was solid. I pasted the code generated by ChatGPT into a test_sample_test.py file and executed it using command provided by ChatGPT.

    All tests passed bar an injection test. My sample function had a defect. A positive result for testing - I fixed the input handling, and all tests passed.

    Now, this is just an interactive MVP for Python code. It doesn't handle opaque object passing and the like...but the beauty of AST means this approach can work with PHP, Java, Go etc.

    In practice, a risk-based approach would lean towards confining this effort to sensitive functions, i.e. those that receive untrusted input and implement key security controls and security features.

    To me, this is evidence that with a little creativity we will likely find more security use cases suitable for public AI today.

    P.S. unit tests are generally not shipped to customers, which conveniently sidesteps a potential licensing or intellectual property infringement problem that prevents some companies from shipping AI-generated code to users or devices, as previously noted by a reader (Hi A!).

  • Backdoor Attack on Deep Learning Models in Mobile Apps

    Deep learning models are increasingly used in mobile applications as critical components. Researchers from Microsoft Research demonstrated that many deep learning models deployed in mobile apps are vulnerable to backdoor attacks via “neural payload injection.” They conducted an empirical study on real-world mobile deep learning apps collected from Google Play. They identified 54 apps that were vulnerable to attack, including popular security and safety critical applications used for cash recognition, parental control, face authentication, and financial services.

    This MITRE ATLAS case study helps bring to life the framework referenced above.

    Initial access is via a malicious APK installed on the victim’s devices via a supply chain compromise. Machine Learning Attack Staging is by a “trigger placed in the physical environment where it is captured by the victim’s device camera and processed by the backdoored ML model”. The team were successful in “evading ML models in several safety-critical apps in the Google Play store.”

Page 11 of 18

Get Daily AI Cybersecurity Tips