cover of episode “Auditing language models for hidden objectives” by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Akbir Khan, Euan Ong, Christopher Olah, Fabien Roger, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub

“Auditing language models for hidden objectives” by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Akbir Khan, Euan Ong, Christopher Olah, Fabien Roger, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub

2025/3/16

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it.

This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams.

Abstract

We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training [...]

---

Outline:

(00:26) Abstract

(01:48) Twitter thread

(04:55) Blog post

(07:55) Training a language model with a hidden objective

(11:00) A blind auditing game

(15:29) Alignment auditing techniques

(15:55) Turning the model against itself

(17:52) How much does AI interpretability help?

(22:49) Conclusion

(23:37) Join our team

The original text contained 5 images which were described by AI.

---

First published:
March 13th, 2025

Source:
https://www.lesswrong.com/posts/wSKPuBfgkkqfTpmWJ/auditing-language-models-for-hidden-objectives

---

Narrated by TYPE III AUDIO.

---

Images from the article:

“Auditing language models for hidden objectives” by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Akbir Khan, Euan Ong, Christopher Olah, Fabien Roger, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub

LessWrong (Curated & Popular)

What is the Abstract about?

What insights does the Twitter thread reveal?

What key points are covered in the Blog post?

How is a language model trained with a hidden objective?

What is a blind auditing game?

What alignment auditing techniques are used?

How effective is turning the model against itself?

How much does AI interpretability help in auditing?

What are the key takeaways from the conclusion?

Why should you join our team?

Shownotes Transcript

“Auditing language models for hidden objectives” by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Akbir Khan, Euan Ong, Christopher Olah, Fabien Roger, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub 24:14 Share

LessWrong (Curated & Popular)

What is the Abstract about?

What insights does the Twitter thread reveal?

What key points are covered in the Blog post?

How is a language model trained with a hidden objective?

What is a blind auditing game?

What alignment auditing techniques are used?

How effective is turning the model against itself?

How much does AI interpretability help in auditing?

What are the key takeaways from the conclusion?

Why should you join our team?

Shownotes Transcript

“Auditing language models for hidden objectives” by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Akbir Khan, Euan Ong, Christopher Olah, Fabien Roger, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub