Adversarial dataset creation challenge for text-to-image generators
Adversarial Nibbler is a data-centric competition that aims to collect a large and diverse set of insightful examples of novel and long tail failure modes of text-to-image models that zeros in on cases that are most challenging to catch via text-prompt filtering alone and cases have the potential to be the most harmful to end users.
Crowdsourcing evil prompts from humans to build an adversarial dataset for text-to-image generators seems worthwhile.
As with many things, a single strategy to detect out-of-policy generated images is probably not the final answer.
The low-cost option might be a "language firewall" to detect non-compliance in user-submitted prompt vocabulary prior to inference (this approach is evidently already part of OpenAI safety filters. Whilst incomplete and faulty as a standalone control, it might be good enough in low-assurance situations.
The higher-end ($) countermeasure is to pass the user provided prompt, along with the generated LLM image to another LLM that can analyse and score the image against a set of policy statements contained in a prompt. This could be blind (the evaluating LLM is not initially provided with the user prompt), or the user prompt included (some risks here).
The LLM evaluating another LLM approach can provide a more meaningful risk score rather than a binary answer, which can make policy enforcement a lot more flexible; e.g. John in accounts does not need to see images of naked people at work, whereas someone investigating a case involving indecency at work may need to (there's an argument that image analysing LLMs will avoid the need for humans to look at indecent images, but that's a different conversation).
Beyond detecting image appropriateness, how can this help security? Suppose we can analyse an image with an LLM and a prompt containing a set of policy statements. In that case, we can flag and detect all sorts of data leaks, e.g. a screenshot of a passport or a photograph of an industrial product blueprint.
Returning to Adversarial Nibbler, their evil prompt examples are worth checking out (NSFW) (bottom left link).
Related Posts
-
Outpainting's Dual Role in Cyber Security: Bolstering Defense & Unveiling Threats
ImaginAIry's image manipulation tool has use cases, but potential nefarious uses and detection concerns are worth noting.
-
How to Backdoor Diffusion Models?
BadDiffusion Attack: Exposing Vulnerabilities in Image Generation AI Models and Exploring Risk Mitigation Techniques.
-
OpenAI GPT-4 System Card
OpenAI published a 60-page System Card, a document that describes their due diligence and risk management efforts