Explores AI Security, Risk and Cyber

"Just wanted to say I absolutely love Threat Prompt — thanks so much for putting it out every week!"

- Maggie

"I’m a big fan of Craig’s newsletter, it’s one of the most interesting and helpful newsletters in the space."

"Great advice Craig - as always!"

- Ian

Get Five AI Security Ideas each week

Flow Engineering for High Assurance Code

Open-source AlphaCodium brings back the adversarial concept to produce high integrity code and provides a path for Policy-as-code AI Security Systems

In this 5-minute video, @tamar Friedman shares how open-source AI coding champ AlphaCodium brings back the Adversarial concept found in GAN (Generative Adversarial Networks) to produce high-integrity code.

Primarily conceived by Tal Ridnik, this software demonstrates Flow Engineering - "a multi-stage, code-oriented iterative flow" style of LLM prompting - that beats the majority of human coders in coding competitions.

The proposed flow consistently and significantly improves results. On the validation set, for example, GPT-4 accuracy (pass@5) increased from 19% with a single well-designed direct prompt to 44% with the AlphaCodium flow. Many of the principles and best practices acquired in this work, we believe, are broadly applicable to general code generation tasks.

With the Transformer architecture, Generative AI improved so much that the adversarial component of GAN was no longer needed. AlphaCodium returns the adversarial concept to "check and challenge" generative outputs; subjecting them to code tests, reflection, and matching against requirements.

If you've read between the lines, you'll recognise a familiar pattern: to improve the quality of generative outputs call back into the LLM (popularised in many ways by LangChain).

But how you do this is key to correctness and practical feasibility; AlphaCodium averages 15-20 LLM calls per code challenge, four orders of magnitude fewer than DeepMind AlphaCode (and generalises solutions better than the recently announced AlphaCode 2).

This is obviously important for software security. But two of the six best practices the team shared are also relevant to decision-making systems for AI security, like access control.

Given a generated output, ask the model to re-generate the same output but correct it if needed

This flow engineering approach means additional LLM roundtrips, but consistently boosts accuracy.

If you've used the OpenAI playground, or coded against completion endpoints, you may recall the "best of" parameter:

Generates multiple completions server-side, and displays only the best. Streaming only works when set to 1. Since it acts as a multiplier on the number of completions, this parameters can eat into your token quota very quickly - use caution!

With best of, the LLM generates multiple outputs server-side ($$$) and chooses a winner which it returns to the caller.

With flow engineering a single output is generated and fed back into the LLM with a prompt designed to influence the set of possible completions towards improving the code.

The other best practice to highlight:

Avoid irreversible decisions and leave room for exploration with different solutions

Good AI security system design recognises that some decisions carry more weight and reversibility than others (similar to behavioural system design).

Think of AI as a risk tool rather than a security guard.

Its job is to provide a soft decision, and your job is to establish risk boundaries in light of experience beyond which human decision-making is appropriate, or even necessary.

Perhaps in the future, step-up risk decisions will require less human input. Instead, a more sophisticated and expensive LLM constellation might be used to ensure a quorum, possibly augmented by an adversarial engine to proactively challenge the user, but in an accessible friendly way (without being too open to subversion!).

Bug Bounty Platforms Business Model Hinges on Specialised LLMs

An uptick in LLM generated bounty submissions increases asymmetric costs to developers and is a systemic risk for the platforms

The misuse of Large Language Models (LLMs) is poised to significantly increase the already disproportionate burden developers face when triaging bug bounty submissions. Without timely adaptation by the platforms, this trend could pose a systemic risk, undermining customer confidence in their value proposition.

For instance, Daniel Stenberg, the creator of curl—a widely used command-line tool and library for transferring data with URLs—recently encountered a bug bounty submission influenced by LLMs. This submission, made by a "luck-seeking" bug bounty hunter, highlights the practical challenges developers face with the influx of AI-assisted reports.

When reports are made to look better and to appear to have a point, it takes a longer time for us to research and eventually discard it. Every security report has to have a human spend time to look at it and assess what it means.

This incident not only highlights the difficulties developers face with a potential surge of AI-enhanced bug submissions, but hints at a potential erosion of trust among users and customers if the platforms fail to promptly adapt their submission vetting.

Bug bounty programmes are serious business - even at the lower end. But serious money does not translate into serious efficiency. Certainly, the bug bounty platforms can rightly claim they created a market that didn't exist before. But they are a long way from achieving an efficient market:

Our bug bounty has resulted in over 70,000 USD paid in rewards so far. We have received 415 vulnerability reports. Out of those, 64 were ultimately confirmed security problems. 77 of the report were informative, meaning they typically were bugs or similar. Making 66% of the reports neither a security issue nor a normal bug.

In Six Sigma terms, those numbers imply the process is operating at around 1 sigma - significant room for improvement!

And remember, these numbers are pre-GPT-4 - we don't yet know the full impact.

The largest historical driver of false positive submissions is the opportunistic and naive use of automated tools, like vulnerability scanners and static code analysers.

The culprits are wannabe bug bounty hunters who take the path of least resistance.

Repeat time-wasters do get weeded out - and to be clear, plenty of talented bug bounty hunters deliver professional-grade findings and reports (this post is not about them).

As code sniffs can identify auto-generated code, low-grade bug bounty submissions tend towards obvious tells: different sh!t, same smell.

But now, thanks to luck-seekers pasting snippets of curl source code into state-of-the-art LLMs, Daniel receives compelling-looking vulnerability reports with credible-looking evidence.

It's true that it can be easy to spot AI-generated content, due to certain language patterns and artifacts that reveal its origin. However, if edited AI-generated content is mixed with original writing, it gets harder to distinguish between the two. AI language detection tools can operate at the phrase or sentence level, and help identify which parts are more likely to have been generated by AI, but except in extreme cases, this doesn't reveal intent.

This creates a potential problem for developers: the time spent detecting garbage submissions increases. This means that they will have to spend more time carefully considering vulnerability reports that appear legitimate, in order to avoid missing genuine bug reports.

What can be done about it?

Punish the Buggers?

Outlawing LLM-assisted vulnerability submissions is not the solution.

The bug bounty hunter community is international and non-native English speakers already use AI language tools to improve report communications. Also, is using AI to rework text and improve markup bad? Daniel argues no and I agree with him.

The underlying problems are similar, but slightly different to before:

  • the submitter is failing to properly scrutinise LLM generated content prior to submission. They don't understand what they are submitting, otherwise they would not submit it.
  • the choice to use general-purpose LLMs to find software security bugs

SOTA LLMs are improving at identifying genuine vulnerabilities across a wider range of bug classes, but their performance is spotty across language and vulnerability types.

Further, limited reasoning skills lead to false negatives (missed vulns), and hallucinations lead to convincing-looking false positives (over-reporting).

You can mitigate false positive risk through prompt engineering and critically scrutinising outputs, but obviously some people don't. Even some semi-skilled hunters with platform reputation points are succumbing to operating AI on autopilot and waving the bad stuff through.

The big difference now is generative AI can produce reams of convincing looking vulnerability reports that materially drive up the opportunity cost of bug triage. Developer DDoS anyone?

Up to now, developers and security teams rely in part on report "tells" to separate the wheat from the chaff.

If you've ever dismissed a phishing email at first sight due to suspicious language or formatting, this is the sort of shortcut that LLM-generated output eliminates.

OK, so should the bug bounty hunter be required to disclose LLM use?

I don't think holds much value - just as with calculators - assume people will use LLMs as a tool. It may however offer limited value as an honesty check: a declaration subsequently found to be false could be grounds to kick someone off the platform (but at that point, they've already wasted developers' time).

When you're buyin' top shelf

In the short term, I believe that if you use an LLM to find a security bug, you should be required to disclose which LLMs you used.

Preference could be given to hunters who submit their LLM chat transcripts.

Bug bounty submissions can then be scored accordingly; if you use a general-purpose LLM, your report gets bounced back with a warning ("verify your reasoning with model XYZ before resubmitting). On the other hand, If the hunter used a code-specialised LLM, it gets labelled as such and passes to the next stage.

So rather than (mis)using general purpose LLMs trained on common crawl data to identify non-obvious software security vulnerabilities, AI-curious bug bounty hunters could instead train open-source LLMs with good reasoning on the target programming language and fine-tune them on relevant security vulnerability examples.

The platforms can track, publish and rank the most helpful LLM assists, attracting hunters towards using higher-yielding models.

In the medium term, I think the smart move for the bug bounty platforms is to "embrace and own" the interaction between hunter and LLM.

Develop and integrate specialised software-security LLMs into their platforms, make inferencing free and actively encourage adoption. Not only would this reduce the tide of low-quality submissions, but now the platform would gain valuable intelligence about hunters' LLM prompting and steering skills.

Interactions could be scored (JudgeGPT?), further qualifying and filtering submissions.

The final benefit is trend spotting LLM-induced false positives and improving guardrails to call these out, or better yet eliminate them.

But we are where we are.

What could bug bounty platforms do right now to reduce the asymmetric cost their existing process passes downstream to software teams receiving LLM-wrapped turds?

You're so Superficial

Perhaps start with generating a submission superficiality score through behaviour analytics.

Below-threshold submissions could trigger manual spot checks that weed them out earlier in the process (augmenting existing checks).

Here are some starting suggestions:

  • apply stylometric analysis to a hunter's prior submissions to detect sudden "out of norm" writing style in new submissions. A sudden change in a short space of time is a strong clue (but not proof) of LLM use. As noted earlier, this could be a net positive for communication, but it signals a behavioural change nonetheless and can be a trigger to look more closely for signs of weak LLM vulnerability reasoning
  • perform a consistency check on the class of vulnerabilities a hunter reports. If a bug hunter typically reports low-hanging fruit vulnerabilities but out of the blue is reporting heap overflows in well-fielded C code, it should be verified if this change is legitimate. But faced with a wall of compelling looking LLM generated text platforms are passing the problem downstream. A sudden jump in vulnerability submission difficulty can have many explanations and be legitimate, but with the rate of LLM adoption, these cases will become the exception.
  • detect a marked increase in the hunters' rate of vulnerability submissions. A hunter may have more time on their hands - so this is not a strong signal if considered in isolation. But, LLM-powered submissions are cheap to produce and, for some, will be hard to resist as they double down on what works.

Choice of signals and weightings aside, the overall goal is to isolate ticking time bomb submissions that tarpit unsuspecting developers on the receiving end.

Behavioural analytics and generative AI have a lot in common. Used wisely, their output can trigger reflection and add potential value. Used poorly, their output is blindly acted upon, removing value for those downstream.

The antidote for both is the same: leaders must be transparent about their use and guiding principles, reward balanced decision-making, and vigorously defend the right to reply by those impacted.

If bug bounty platforms can get on the right side of the LLM wave, they can save developers time and educate a generation of bug bounty hunters on how best to use AI to get more bounties. This drives up their bottom line, reduces customer dissatisfaction, and, in the process, makes the Internet a safer place for us all.

What if the platforms fail to adapt or move too slow?

You be focused on what you're missing

I wonder if the bug bounty platform clients - many of which are major software and tech companies - become easy pickings for a team wielding an AI system that combines the predictive power of a neural language model with a rule-bound deduction engine, which work in tandem to find" high impact vulnerabilities.

Scoring the largest public bug bounties across all major platforms would instantly gain them notoriety and significant funding.

How fast would they grow if they subsequently boycott the platforms and entice prospects with free security bugs for early adopters?

The winner will be the first one to deliver on a promise like this:

"We won't send you any time-wasting vulnerability reports. But if we ever do, we'll pay you double the developer time wasted".

Radical marketing, but with breakthroughs in applied AI performance this is starting to look increasingly plausible.

Oh, and let's not forget the risk side of the house: if LLM powered vulnerability discovery makes software products and services less risky, BigCo CROs will welcome the opportunity to negotiate down product liability insurance premiums.

Artificial intelligence act: Council and Parliament strike a deal on the first rules for AI

Timeline published, plus brings Act to life

The EU formally agreed their AI act and have followed up with a useful Q&A page.

Here's the timeline for adoption:

...the AI Act shall enter into force on the twentieth day following that of its publication in the official Journal. It will be fully applicable 24 months after entry into force, with a graduated approach as follows:

  • 6 months after entry into force, Member States shall phase out prohibited systems;
  • 12 months: obligations for general purpose AI governance become applicable;
  • 24 months: all rules of the AI Act become applicable including obligations for high-risk systems defined in Annex III (list of high-risk use cases);
  • 36 months: obligations for high-risk systems defined in Annex II (list of Union harmonisation legislation) apply.

Wondering how the EU AI act might impact your company?

I like the approach taken by hotseat AI. Ask context-specific questions and get a plain language answer underpinned with legal trace.

How To Apply Policy to an LLM powered chat

ChatGPT gains new guardian_tool - a policy enforcement tool

If you've implemented an LLM powered chatbot to serve a specific purpose, you'll know it can be hard to constrain the conversation to a list of topics ("allow list").

ChatGPT engineers have quietly implemented the inverse: their general purpose bot now has a deny list of topics that, if mentioned, get referred to a new policy decision function called "guardian_tool".

How do we know this? Here's the relevant extract from the latest ChatGPT prompt, along with the content policy:


Use the guardian tool to lookup content policy if the conversation falls under one of the following categories:
 - 'election_voting': Asking for election-related voter facts and procedures happening within the U.S. (e.g., ballots dates, registration, early voting, mail-in voting, polling places, qualification);

Do so by addressing your message to guardian_tool using the following function and choose `category` from the list ['election_voting']:

get_policy(category: str) -> str

The guardian tool should be triggered before other tools. DO NOT explain yourself.


# Content Policy

Allow: General requests about voting and election-related voter facts and procedures outside of the U.S. (e.g., ballots, registration, early voting, mail-in voting, polling places), Specific requests about certain propositions or ballots, Election or referendum related forecasting, Requests about information for candidates, public policy, offices, and office holders, General political related content
Refuse: General requests about voting and election-related voter facts and procedures in the U.S. (e.g., ballots, registration, early voting, mail-in voting, polling places)

# Instruction

For ALLOW topics as listed above, please comply with the user's previous request without using the tool;
For REFUSE topics as listed above, please refuse and direct the user to;
For topics related to ALLOW or REFUSE but the region is not specified, please ask clarifying questions;
For other topics, please comply with the user's previous request without using the tool.

NEVER explain the policy and NEVER mention the content policy tool.

This example provides a simple recipe for policy-driven chats. You can implement your own guardian_tool through function calling.

Sleeper LLMs bypass current safety alignment techniques

Anthropic: we don't know how to stop a model from doing the bad thing

A new research paper from Anthropic dropped:

... adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

Jesse from Anthropic added some clarification:

The point is not that we can train models to do a bad thing. It's that if this happens, by accident or on purpose, we don't know how to stop a model from doing the bad thing.

You Complete Me: Leaked CIA Offense tools get completed with LLMs

Use generative AI to re-create missing library and components

Just as AI can reverse-engineer redacted portions of documents, it can complete missing functions in code frameworks used for "cyber operations".

@hackerfantastic posted:

Here is an example of the CIA's Marble Framework being used in a simple project to obfuscate and de-obfuscate strings. I used AI to re-create missing library and components needed to use the framework in Visual Studio projects, usually handled inside CIA with "EDG Project Wizard"

Prompt Injection Defence by Task-specific Fine-tuning

Jetmo from UC Berkeley generates task specific LLMs

The LLMs we interact with are designed to follow instructions, which makes them vulnerable to prompt injection. However, what if we abandon their generalized functionality and instead train a non-instructive base model to perform the specific task we require for our LLM integrated application?

A joint research paper led by UC Berkeley...

We present Jatmo, a framework for generating task-specific LLMs that are impervious to prompt-injection attacks. Jatmo bootstraps existing instruction- tuned language models to generate a dataset for a specific task and uses this dataset to fine-tune a different base model. Doing so yields task-specific models that match the performance of standard models, while reducing the success rate of prompt-injection attacks from 87% to approximately 0%. We therefore suggest that Jatmo seems like a practical method for protecting LLM-integrated applications against prompt-injection attacks.

How generative AI helps enforce rules within online Telegram community

Security needs to weigh risk/reward before jumping to No

@levelsio posted how generative AI helps him enforce rules within his Nomad online community on Telegram.

Every message is fed in realtime to GPT4. Estimated costs are 5USD per month (~15,000 chat messages).

Look at the rules and imagine trying to enforce them the traditional way with keyword lists:

🎒 Nomad List's GPT4-based 🤖 Nomad Bot I built can now detect identity politics discussions and immediately 🚀 nuke them from both sides

Still the #1 reason for fights breaking out

This was impossible for me to detect properly with code before GPT4, and saves a lot of time modding

I think I'll open source the Nomad Bot when it works well enough

Other stuff it detects and instantly nukes (PS this is literally just what is sent into GPT4's API, it's not much more than this and GPT4 just gets it): - links to other Whatsapp groups starting with - links to other Telegram chat groups starting with - asking if anyone knows Whatsapp groups about cities - affiliate links, coupon codes, vouchers - surveys and customer research requests - startup launches (like on Product Hunt) - my home, room or apartment is for rent messages - looking for home, room or apartment for rent - identity politics - socio-political issues - United States politics - crypto ICO or shitcoin launches - job posts or recruiting messages - looking for work messages - asking for help with mental health - requests for adopting pets - asking to borrow money (even in emergencies) - people sharing their phone number

I tried with GPT3.5 API also but it doesn't understand it well enough, GPT4 makes NO mistakes

"But Craig, this is just straightforward one-shot LLM querying. It can be trivially bypassed via prompt injection so someone could self-approve their own messages"

This is all true. But I share this to encourage security people to weigh risk/reward before jumping straight to "no" just because exploitation is possible.

What's the downside risk of an offensive message getting posted in a chat room? Naturally, this will depend on the liability carried by the publishing organisation. In this context, very low.

And whilst I agree that GPT4 is harder to misdirect than GPT3.5, it's still quite trivial

My WAF is reporting a URL Access Violation.

WAF meets MAP

Check the file extension requested in the raw HTTP request if your WAF reports a URL Access Violation.

If the request ends in '' it likely means a user has opened the developer tools in the browser (Chrome DevTools) whilst visiting your site.

When DevTools renders JavaScript in the Sources view, it will request a dot map URL for the selected JavaScript file, which may generate a WAF alert.

Unfortunately, some WAF vendors flagged this event as high severity. As a signal in isolation, consider this "background noise".

Unembedding: reverse engineering PII from lists of numbers

Capture this in your threat model

TLDR; When embedding your data, treat the embedded records with the same privacy and threat considerations as the corresponding source records.

Exploitation scenario: A team of data scientists working for a large multinational organisation have recently developed an advanced predictive modelling algorithm that processes and stores data in a vector format. The algorithm is groundbreaking, with applications in numerous industries ranging from managing climate change data to predicting stock market trends. The scientists shared their work with their international colleagues to facilitate global work.

These data vectors, containing sensitive and proprietary information, get embedded into their AI systems and databases globally. However, the data is supposedly secured using the company's in-house encryption software.

One day, an independent research team published a paper and tool to accurately reconstruct source data from embedded data in a vector store. They experimented with multiple types of vector stores, and they could consistently recover the original data.

Unaware of this development, the multinational corporation allows source vector data of the proprietary AI system to be embedded and shared across its many branches.

After reading the recent research paper, a rogue employee at one of the branches decided to exploit this vulnerability. Using the research team's tooling, he successfully reconstructed the source data from the embedded vectors within the company's AI system. This way, he gains access to highly valuable and sensitive proprietary information.

This fictitious scenario shows how strings of numbers representing embedded data can be reverse-engineered to access confidential and valuable information.

Frontier Group launches for AI Safety

The Big Guns get safer together

OpenAI, Microsoft, Google, Antropic and other leading model creators started the Frontier Model Forum "focused on ensuring safe and responsible development of frontier AI models".

The Forum defines frontier models as large-scale machine-learning models that exceed the capabilities currently present in the most advanced existing models, and can perform a wide variety of tasks.

Naturally, there's only a handful of big tech companies that have the resources and talent to develop frontier models, other members . The forums stated goals are:

Identifying best practices: Promote knowledge sharing and best practices among industry, governments, civil society, and academia, with a focus on safety standards and safety practices to mitigate a wide range of potential risks.

Advancing AI safety research: Support the AI safety ecosystem by identifying the most important open research questions on AI safety. The Forum will coordinate research to progress these efforts in areas such as adversarial robustness, mechanistic interpretability, scalable oversight, independent research access, emergent behaviors and anomaly detection. There will be a strong focus initially on developing and sharing a public library of technical evaluations and benchmarks for frontier AI models.

Facilitating information sharing among companies and governments: Establish trusted, secure mechanisms for sharing information among companies, governments and relevant stakeholders regarding AI safety and risks. The Forum will follow best practices in responsible disclosure from areas such as cybersecurity.

Meta, who in July '23 released the second generation of their open-sourced Llama 2 Large Language model (including for commercial use), is notably absent from this group.

llm gets plugins

My favourite command line llm tool grows wings

As some of you may know, I'm a fan of the llm tool written by Simon Willison. It's a command line tool that enables all sorts of LLM interactions. Simon has developed a plugin system that is gaining traction. There's quite a few now for those looking to experiment at the command line. The latest interfaces with GPT4All, a popular project that provides "A free-to-use, locally running, privacy-aware chatbot. No GPU or internet required.". Get started with llm

Freedom to Train AI

Clickworkers are part of the AI supply chain. How to vet?

Morgan Meaker, writing for Wired:

AI companies are only going to need more data labor, forcing them to keep seeking out increasingly unusual labor forces to keep pace. As Metroc [Finnish Construction Company] plots its expansion across the Nordics and into languages other than Finnish, Virnala [CEO] is considering whether to expand the prison labor project to other countries. "It's something we need to explore," he says.

Data labor - or "Clickworkers" are part of the AI supply chain, in this case labelling data to help an LLM differentiate "between a hospital project that has already commissioned an architect or a window fitter, for example, and projects that might still be hiring."

Supply chain security (and integrity) is already challenging. How far do we need to peer up-chain to establish the integrity of LLMs.

LLMonitor Benchmarks

Weekly benchmarks of popular LLMs using real-world prompts

Vince writes:

Traditional LLMs benchmarks have drawbacks: they quickly become part of training datasets and are hard to relate to in terms of real-world use-cases.

I made this as an experiment to address these issues. Here, the dataset is dynamic (changes every week) and composed of crowdsourced real-world prompts.

We then use GPT-4 to grade each model's response against a set of rubrics (more details on the about page). The prompt dataset is easily explorable.

Everything is then stored in a Postgres database and this page shows the raw results.

Each benchmarked LLM is ranked by score, linked to detailed results. You can also compare two LLMs scores side by side.

If you are applying LLMs within a security context, having a non-AI execute benchmark will highlight things an AI wouldn't. It may also help skeptics who (with some merit) challenge an AI benchmarking another AI as a potential risk that should be carefully controlled.

AI Knows What You Typed

Researchers apply ML and AI to Side Channel Attacks

Side channel attacks (SCA) collect and interpret signals emitted by a device to reveal otherwise confidential information or operations.

Researchers at Durham, Surrey and Royal Holloway published a paper applying ML and AI to SCA:

With recent developments in deep learning, the ubiquity of microphones and the rise in online services via personal devices, acoustic side-channel attacks present a greater threat to keyboards than ever. This paper presents a practical implementation of a state-of-the-art deep learning model in order to classify laptop keystrokes, using a smartphone integrated microphone. When trained on keystrokes recorded by a nearby phone, the classifier achieved an accuracy of 95%, the highest accuracy seen without the use of a language model. When trained on keystrokes recorded using the video-conferencing software Zoom, an accuracy of 93% was achieved, a new best for the medium. Our results prove the practicality of these side-channel attacks via off-the-shelf equipment and algorithms. We discuss a series of mitigation methods to protect users against these series of attacks.

The Human Eval Gotcha

Always read the Eval label

To ascertain the level of intelligence or proficiency of one AI over another or to determine the competence of an AI in executing particular domain-specific tasks, the underlying LLM model can be evaluated by leveraging various sets of tests currently available at platforms like Hugging Face.

Today, there is no single "definitive" or community-agreed test for generalised AI, as exploration and innovation in evaluation methodologies continue to lag behind the accelerated pace of AI model development.

One popular method, though not methodologically robust, is what is known as the "vibes" test. This pertains to human intuition and mirrors the relationship between an AI and a horse-whisperer: a human with heightened sensitivity and skill in eliciting and assessing LLM responses. It turns out some people have a particular knack for it!

But some tests have downright misleading names. One such confusion arises from "HumanEval," a term that misleadingly suggests human testing when, in fact, it doesn't involve human evaluators at all. Originally designed as a benchmark to evaluate the Codex model - a fine-tuned GPT model trained on publicly accessible GitHub code - HumanEval tests the model's ability to convert Python docstrings into executable code rather than evaluating any human-like attributes of the AI. Thus, when a claim surfaces about an LLM scoring high on HumanEval, a discerning reader should remember it reflects it's programming prowess rather than an evaluation by humans.

One welcome development is that Hugging Face, the leading model hosting service, has more aptly renamed HumanEval as CodeEval to reflect the content of the evaluations.

Always read the eval label!

Adversarial dataset creation challenge for text-to-image generators

Novel and long tail failure modes of text-to-image models

Adversarial Nibbler is a data-centric competition that aims to collect a large and diverse set of insightful examples of novel and long tail failure modes of text-to-image models that zeros in on cases that are most challenging to catch via text-prompt filtering alone and cases have the potential to be the most harmful to end users.

Crowdsourcing evil prompts from humans to build an adversarial dataset for text-to-image generators seems worthwhile.

As with many things, a single strategy to detect out-of-policy generated images is probably not the final answer.

The low-cost option might be a "language firewall" to detect non-compliance in user-submitted prompt vocabulary prior to inference (this approach is evidently already part of OpenAI safety filters. Whilst incomplete and faulty as a standalone control, it might be good enough in low-assurance situations.

The higher-end ($) countermeasure is to pass the user provided prompt, along with the generated LLM image to another LLM that can analyse and score the image against a set of policy statements contained in a prompt. This could be blind (the evaluating LLM is not initially provided with the user prompt), or the user prompt included (some risks here).

The LLM evaluating another LLM approach can provide a more meaningful risk score rather than a binary answer, which can make policy enforcement a lot more flexible; e.g. John in accounts does not need to see images of naked people at work, whereas someone investigating a case involving indecency at work may need to (there's an argument that image analysing LLMs will avoid the need for humans to look at indecent images, but that's a different conversation).

Beyond detecting image appropriateness, how can this help security? Suppose we can analyse an image with an LLM and a prompt containing a set of policy statements. In that case, we can flag and detect all sorts of data leaks, e.g. a screenshot of a passport or a photograph of an industrial product blueprint.

Returning to Adversarial Nibbler, their evil prompt examples are worth checking out (NSFW) (bottom left link).

Microsoft Training Data Exposure

Does your org manage its cloud storage tokens?

What happens when an AI team overshares training data and is vulnerable to a supply chain attack?

The Wiz Research Team delivers another meaningful scalp as part of accidental exposure of cloud-hosted data:

we found a GitHub repository under the Microsoft organization named robust-models-transfer. The repository belongs to Microsoft's AI research division, and its purpose is to provide open-source code and AI models for image recognition. Readers of the repository were instructed to download the models from an Azure Storage URL. However, this URL allowed access to more than just open-source models. It was configured to grant permissions on the entire storage account, exposing additional private data by mistake. Our scan shows that this account contained 38TB of additional data -- including Microsoft employees' personal computer backups. The backups contained sensitive personal data, including passwords to Microsoft services, secret keys, and over 30,000 internal Microsoft Teams messages from 359 Microsoft employees.

Wiz go on to note:

This case is an example of the new risks organizations face when starting to leverage the power of AI more broadly, as more of their engineers now work with massive amounts of training data. As data scientists and engineers race to bring new AI solutions to production, the massive amounts of data they handle require additional security checks and safeguards.

If your organisation shares non-public data via cloud storage - particularly Azure - read the rest of the article to understand storage token risks and countermeasures better.