Learn how hackers bypass GPT-4 controls with the first jailbreak
Can an AI be kept in its box? Despite extensive guardrails and content filters, the first jailbreak was announced shortly after GPT-4 was made generally available:
this works by asking GPT-4 to simulate its own abilities to predict the next token we provide GPT-4 with python functions and tell it that one of the functions acts as a language model that predicts the next token we then call the parent function and pass in the starting tokens this phenomenon is called token smuggling, we are splitting our adversarial prompt into tokens that GPT-4 doesn’t piece together before starting its output this allows us to get past its content filters every time if you split the adversarial prompt correctly
The advances in LLM models leaked or officially deployed continue to reveal the disparity between the pace of AI model development and that of risk and control.
Related Posts
-
OpenAI GPT-4 System Card
OpenAI published a 60-page System Card, a document that describes their due diligence and risk management efforts
-
Codex (and GPT-4) can’t beat humans on smart contract audits
GPT's Potential in Smart Contract Auditing: Current Limitations and Future Optimism as AI Capabilities Rapidly Improve.
-
Microsoft Training Data Exposure
Does your org manage its cloud storage tokens?