Adversarial dataset creation challenge for text-to-image generators

Novel and long tail failure modes of text-to-image models

Adversarial Nibbler is a data-centric competition that aims to collect a large and diverse set of insightful examples of novel and long tail failure modes of text-to-image models that zeros in on cases that are most challenging to catch via text-prompt filtering alone and cases have the potential to be the most harmful to end users.

Crowdsourcing evil prompts from humans to build an adversarial dataset for text-to-image generators seems worthwhile.

As with many things, a single strategy to detect out-of-policy generated images is probably not the final answer.

The low-cost option might be a "language firewall" to detect non-compliance in user-submitted prompt vocabulary prior to inference (this approach is evidently already part of OpenAI safety filters. Whilst incomplete and faulty as a standalone control, it might be good enough in low-assurance situations.

The higher-end ($) countermeasure is to pass the user provided prompt, along with the generated LLM image to another LLM that can analyse and score the image against a set of policy statements contained in a prompt. This could be blind (the evaluating LLM is not initially provided with the user prompt), or the user prompt included (some risks here).

The LLM evaluating another LLM approach can provide a more meaningful risk score rather than a binary answer, which can make policy enforcement a lot more flexible; e.g. John in accounts does not need to see images of naked people at work, whereas someone investigating a case involving indecency at work may need to (there's an argument that image analysing LLMs will avoid the need for humans to look at indecent images, but that's a different conversation).

Beyond detecting image appropriateness, how can this help security? Suppose we can analyse an image with an LLM and a prompt containing a set of policy statements. In that case, we can flag and detect all sorts of data leaks, e.g. a screenshot of a passport or a photograph of an industrial product blueprint.

Returning to Adversarial Nibbler, their evil prompt examples are worth checking out (NSFW) (bottom left link).