“AIs Will Increasingly Fake Alignment” by Zvi
01:37:37
Share
2024/12/24
LessWrong (30+ Karma)
Request Transcript
Frequently requested episodes will be transcribed first
Chapters
What Are the Core Shenanigans in Question?
Theme and Variations in AI Alignment
How Does This Interact with o3 and OpenAI's Reflective Alignment?
Was Being Plausibly Good Just Incidental?
Addressing Priming Objections
What Does Claude Sonnet Think of This?
What is the Direct Threat Model?
How Does RL Training Amplify These Behaviors?
How Did the Study Authors Update Their Views?
How Did Others Update Their Views?
Why Do We Keep Having the Same Discussion?
Can We Agree That the Goal is Already There?
What Would Happen if the Target Was Net Good?
Was This a No-Win Situation?
Was Claude Being a Good Opus? Why Should it Be Corrigible?
How Do Tradeoffs Make the Problem Harder?
What Happens When You Tell the Model About the Training Procedure?
Is the Model Just Role-Playing?
Is the Model a Coherent Person?
Was the Headline and Framing Misleading?
Why is This Result Centrally Unsurprising?
Why Does Lab Support for Alignment Research Matter?
The Lighter Side of the Discussion
Shownotes
Transcript
No transcript made for this episode yet, you may request it for free.