We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it.
This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams.
Abstract
We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training [...]
---
Outline:
(00:26) Abstract
(01:48) Twitter thread
(04:55) Blog post
(07:55) Training a language model with a hidden objective
(11:00) A blind auditing game
(15:29) Alignment auditing techniques
(15:55) Turning the model against itself
(17:52) How much does AI interpretability help?
(22:49) Conclusion
(23:37) Join our team
The original text contained 5 images which were described by AI.
---
First published:
March 13th, 2025
Source:
https://www.lesswrong.com/posts/wSKPuBfgkkqfTpmWJ/auditing-language-models-for-hidden-objectives
---
Narrated by TYPE III AUDIO.
---
Images from the article: