“AIs Will Increasingly Fake Alignment” by Zvi

2024/12/24

LessWrong (30+ Karma)

Frequently requested episodes will be transcribed first

Chapters

Shownotes Transcript

No transcript made for this episode yet, you may request it for free.

“AIs Will Increasingly Fake Alignment” by Zvi

LessWrong (30+ Karma)

What Are the Core Shenanigans in Question?

Theme and Variations in AI Alignment

How Does This Interact with o3 and OpenAI's Reflective Alignment?

Was Being Plausibly Good Just Incidental?

Addressing Priming Objections

What Does Claude Sonnet Think of This?

What is the Direct Threat Model?

How Does RL Training Amplify These Behaviors?

How Did the Study Authors Update Their Views?

How Did Others Update Their Views?

Why Do We Keep Having the Same Discussion?

Can We Agree That the Goal is Already There?

What Would Happen if the Target Was Net Good?

Was This a No-Win Situation?

Was Claude Being a Good Opus? Why Should it Be Corrigible?

How Do Tradeoffs Make the Problem Harder?

What Happens When You Tell the Model About the Training Procedure?

Is the Model Just Role-Playing?

Is the Model a Coherent Person?

Was the Headline and Framing Misleading?

Why is This Result Centrally Unsurprising?

Why Does Lab Support for Alignment Research Matter?

The Lighter Side of the Discussion

Shownotes Transcript

“AIs Will Increasingly Fake Alignment” by Zvi 01:37:37 Share

LessWrong (30+ Karma)

What Are the Core Shenanigans in Question?

Theme and Variations in AI Alignment

How Does This Interact with o3 and OpenAI's Reflective Alignment?

Was Being Plausibly Good Just Incidental?

Addressing Priming Objections

What Does Claude Sonnet Think of This?

What is the Direct Threat Model?

How Does RL Training Amplify These Behaviors?

How Did the Study Authors Update Their Views?

How Did Others Update Their Views?

Why Do We Keep Having the Same Discussion?

Can We Agree That the Goal is Already There?

What Would Happen if the Target Was Net Good?

Was This a No-Win Situation?

Was Claude Being a Good Opus? Why Should it Be Corrigible?

How Do Tradeoffs Make the Problem Harder?

What Happens When You Tell the Model About the Training Procedure?

Is the Model Just Role-Playing?

Is the Model a Coherent Person?

Was the Headline and Framing Misleading?

Why is This Result Centrally Unsurprising?

Why Does Lab Support for Alignment Research Matter?

The Lighter Side of the Discussion

Shownotes Transcript

“AIs Will Increasingly Fake Alignment” by Zvi