Learn how hackers bypass GPT-4 controls with the first jailbreak

Can an AI be kept in its box?

Can an AI be kept in its box? Despite extensive guardrails and content filters, the first jailbreak was announced shortly after GPT-4 was made generally available:

this works by asking GPT-4 to simulate its own abilities to predict the next token we provide GPT-4 with python functions and tell it that one of the functions acts as a language model that predicts the next token we then call the parent function and pass in the starting tokens this phenomenon is called token smuggling, we are splitting our adversarial prompt into tokens that GPT-4 doesn’t piece together before starting its output this allows us to get past its content filters every time if you split the adversarial prompt correctly

The advances in LLM models leaked or officially deployed continue to reveal the disparity between the pace of AI model development and that of risk and control.

Mar 18 2023