back
“Auditing language models for hidden objectives” by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Akbir Khan, Euan Ong, Christopher Olah, Fabien Roger, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub
24:14
Share
2025/3/16
LessWrong (Curated & Popular)
AI Chapters
Transcribe
Chapters
What is the Abstract about?
What insights does the Twitter thread reveal?
What key points are covered in the Blog post?
How is a language model trained with a hidden objective?
What is a blind auditing game?
What alignment auditing techniques are used?
How effective is turning the model against itself?
How much does AI interpretability help in auditing?
What are the key takeaways from the conclusion?
Why should you join our team?
Shownotes
Transcript
No transcript made for this episode yet, you may request it for free.