Thanks to Holden Karnofsky, David Duvenaud, and Kate Woolverton for useful discussions and feedback.Following up on our recent “Sabotage Evaluations for Frontier Models” paper, I wanted to share more of my personal thoughts on why I think catastrophic sabotage is important and why I care about it as a threat model. Note that this isn’t in any way intended to be a reflection of Anthropic's views or for that matter anyone's views but my own—it's just a collection of some of my personal thoughts.First, some high-level thoughts on what I want to talk about here: I want to focus on a level of future capabilities substantially beyond current models, but below superintelligence: specifically something approximately human-level and substantially transformative, but not yet superintelligent. While I don’t think that most of the proximate cause of AI existential risk comes from such models—I think most of the direct takeover [...]
---Outline:(02:31) Why is catastrophic sabotage a big deal?(02:45) Scenario 1: Sabotage alignment research(05:01) Necessary capabilities(06:37) Scenario 2: Sabotage a critical actor(09:12) Necessary capabilities(10:51) How do you evaluate a model's capability to do catastrophic sabotage?(21:46) What can you do to mitigate the risk of catastrophic sabotage?(23:12) Internal usage restrictions(25:33) Affirmative safety cases--- First published: October 22nd, 2024 Source: https://www.lesswrong.com/posts/Loxiuqdj6u8muCe54/catastrophic-sabotage-as-a-major-threat-model-for-human) --- Narrated by TYPE III AUDIO).