“Best-of-N Jailbreaking” by John Hughes, saraprice, Aengus Lynch, Rylan Schaeffer, Fazl, Henry Sleight, Ethan Perez, mrinank_sharma

2024/12/14

LessWrong (30+ Karma)

Frequently requested episodes will be transcribed first

This is a link post.This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for jailbreaking, Best-of-N Jailbreaking, which works across modalities (text, audio, vision) and shows power-law scaling in the amount of test-time compute used for the attack.

** Abstract** We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random shuffling or capitalization for textual prompts - until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers. BoN also seamlessly extends to other modalities: it jailbreaks vision [...]

Outline:

(00:25) Abstract

(02:07) Tweet Thread

The original text contained 6 images which were described by AI.

First published: December 14th, 2024

Source: https://www.lesswrong.com/posts/oq5CtbsCncctPWkTn/best-of-n-jailbreaking)

---

Narrated by TYPE III AUDIO).

Images from the article: undefined )))))) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.

“Best-of-N Jailbreaking” by John Hughes, saraprice, Aengus Lynch, Rylan Schaeffer, Fazl, Henry Sleight, Ethan Perez, mrinank_sharma 05:59 Share

LessWrong (30+ Karma)

Shownotes Transcript

“Best-of-N Jailbreaking” by John Hughes, saraprice, Aengus Lynch, Rylan Schaeffer, Fazl, Henry Sleight, Ethan Perez, mrinank_sharma