This is a link post.This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for jailbreaking, Best-of-N Jailbreaking, which works across modalities (text, audio, vision) and shows power-law scaling in the amount of test-time compute used for the attack.
** Abstract** We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random shuffling or capitalization for textual prompts - until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers. BoN also seamlessly extends to other modalities: it jailbreaks vision [...]
Outline:
(00:25) Abstract
(02:07) Tweet Thread
The original text contained 6 images which were described by AI.
First published: December 14th, 2024
Source: https://www.lesswrong.com/posts/oq5CtbsCncctPWkTn/best-of-n-jailbreaking)
---
Narrated by TYPE III AUDIO).
Images from the article: )))))) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.