THREAT PROMPT

Explores AI Security, Risk and Cyber

"Just wanted to say I absolutely love Threat Prompt — thanks so much!"

- Maggie

"I'm a big fan of Craig's newsletter, it's one of the most interesting and helpful newsletters in the space."

"Great advice Craig - as always!"

- Ian

Get Daily AI Cybersecurity Tips

  • LLMs for Evaluating LLMs

    Arthur's ML Engineers Max Cembalest & Rowan Cheung on LLMs evaluating other LLMs. Topics covered:

    • Evolving Evaluation: LLMs require new evaluation methods to determine which models are best suited for which purposes.
    • LLMs as Evaluators: LLMs are used to assess other LLMs, leveraging their human-like responses and contextual understanding.
    • Biases and Risks: Understanding biases in LLM responses when judging other models is essential to ensure fair evaluations.
    • Relevance and Context: LLMs can create testing datasets that better reflect real-world context, enhancing model applicability assessment.
  • Karpathy on Hallucinations

    Andrey Karpathy posted:

    I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines. We direct their dreams with prompts. The prompts start the dream, and based on the LLM's hazy recollection of its training documents, most of the time the result goes someplace useful. It's only when the dreams go into deemed factually incorrect territory that we label it a "hallucination". It looks like a bug, but it's just the LLM doing what it always does.

    Andrey goes on to explain that when people complain about hallucinations, what they really mean is they don't want their LLM assistants hallucinating.

    This implies an attempt to shift the focus away from LLMs - the source of the problem - ("too hard to fix") and attempt to shift the problem to some kind of safety layer within an AI assistant.

    Cart, meet horse.

  • LLMonitor Benchmarks

    Vince writes:

    Traditional LLMs benchmarks have drawbacks: they quickly become part of training datasets and are hard to relate to in terms of real-world use-cases.

    I made this as an experiment to address these issues. Here, the dataset is dynamic (changes every week) and composed of crowdsourced real-world prompts.

    We then use GPT-4 to grade each model's response against a set of rubrics (more details on the about page). The prompt dataset is easily explorable.

    Everything is then stored in a Postgres database and this page shows the raw results.

    Each benchmarked LLM is ranked by score, linked to detailed results. You can also compare two LLMs scores side by side.

    If you are applying LLMs within a security context, having a non-AI execute benchmark will highlight things an AI wouldn't. It may also help skeptics who (with some merit) challenge an AI benchmarking another AI as a potential risk that should be carefully controlled.

  • Adversarial dataset creation challenge for text-to-image generators

    Adversarial Nibbler is a data-centric competition that aims to collect a large and diverse set of insightful examples of novel and long tail failure modes of text-to-image models that zeros in on cases that are most challenging to catch via text-prompt filtering alone and cases have the potential to be the most harmful to end users.

    Crowdsourcing evil prompts from humans to build an adversarial dataset for text-to-image generators seems worthwhile.

    As with many things, a single strategy to detect out-of-policy generated images is probably not the final answer.

    The low-cost option might be a "language firewall" to detect non-compliance in user-submitted prompt vocabulary prior to inference (this approach is evidently already part of OpenAI safety filters. Whilst incomplete and faulty as a standalone control, it might be good enough in low-assurance situations.

    The higher-end ($) countermeasure is to pass the user provided prompt, along with the generated LLM image to another LLM that can analyse and score the image against a set of policy statements contained in a prompt. This could be blind (the evaluating LLM is not initially provided with the user prompt), or the user prompt included (some risks here).

    The LLM evaluating another LLM approach can provide a more meaningful risk score rather than a binary answer, which can make policy enforcement a lot more flexible; e.g. John in accounts does not need to see images of naked people at work, whereas someone investigating a case involving indecency at work may need to (there's an argument that image analysing LLMs will avoid the need for humans to look at indecent images, but that's a different conversation).

    Beyond detecting image appropriateness, how can this help security? Suppose we can analyse an image with an LLM and a prompt containing a set of policy statements. In that case, we can flag and detect all sorts of data leaks, e.g. a screenshot of a passport or a photograph of an industrial product blueprint.

    Returning to Adversarial Nibbler, their evil prompt examples are worth checking out (NSFW) (bottom left link).

  • Frontier Group launches for AI Safety

    OpenAI, Microsoft, Google, Antropic and other leading model creators started the Frontier Model Forum "focused on ensuring safe and responsible development of frontier AI models".

    The Forum defines frontier models as large-scale machine-learning models that exceed the capabilities currently present in the most advanced existing models, and can perform a wide variety of tasks.

    Naturally, there's only a handful of big tech companies that have the resources and talent to develop frontier models, other members . The forums stated goals are:

    Identifying best practices: Promote knowledge sharing and best practices among industry, governments, civil society, and academia, with a focus on safety standards and safety practices to mitigate a wide range of potential risks.

    Advancing AI safety research: Support the AI safety ecosystem by identifying the most important open research questions on AI safety. The Forum will coordinate research to progress these efforts in areas such as adversarial robustness, mechanistic interpretability, scalable oversight, independent research access, emergent behaviors and anomaly detection. There will be a strong focus initially on developing and sharing a public library of technical evaluations and benchmarks for frontier AI models.

    Facilitating information sharing among companies and governments: Establish trusted, secure mechanisms for sharing information among companies, governments and relevant stakeholders regarding AI safety and risks. The Forum will follow best practices in responsible disclosure from areas such as cybersecurity.

    Meta, who in July '23 released the second generation of their open-sourced Llama 2 Large Language model (including for commercial use), is notably absent from this group.

  • AI Engineer Summit

    A big shout out to @swyx and @Benghamine for conceiving and organising the AI Engineer Summit. Watch the session recordings on their YouTube channel.

Page 6 of 18

Get Daily AI Cybersecurity Tips