Home
cover of episode #452 – Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

#452 – Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

2024/11/11
logo of podcast Lex Fridman Podcast

Lex Fridman Podcast

Key Insights

Why is Dario Amodei optimistic about the beauty within neural networks?

He believes that the simplicity of neural network rules generates complexity, similar to how simple evolutionary rules give rise to complex biology. This simplicity creates a rich structure of beauty within the networks that is yet to be fully discovered and understood.

What does Dario Amodei compare the process of neural network development to?

He compares it to evolution, where simple rules over time lead to complex outcomes, such as the development of life and ecosystems. This comparison highlights the potential for deep beauty within neural networks.

What does Dario Amodei find fascinating about the current state of AI?

He is fascinated by the fact that we have created systems (neural networks) that can perform tasks we don't know how to program directly. This mystery is a compelling question that drives curiosity and exploration.

How does Dario Amodei view the relationship between simplicity and complexity in AI?

He sees simplicity as a generative force for complexity. The simple rules of neural networks and evolution create intricate and beautiful structures that are often overlooked due to their complexity.

What does Dario Amodei appreciate about the work being done on AI safety?

He appreciates the dual focus on safety and the beauty of discovery within AI. He recognizes the importance of both ensuring AI safety and exploring the profound beauty that lies within the systems being developed.

Chapters

Dario Amodei discusses scaling laws, the concept that increasing model size, data, and compute leads to better performance. He reflects on his experience with scaling laws and how they apply to various AI domains. He also explores the potential limits of scaling laws, such as data limitations and compute costs, and how these challenges might be overcome.
  • Scaling laws involve increasing model size, data, and compute.
  • Language models show strong scaling law behavior.
  • Limits to scaling laws include data limitations and compute costs.
  • Synthetic data generation and new architectures could overcome these limits.

Shownotes Transcript

Translations:
中文

The following is the conversation with darro on a day CEO of anthropic, the company that created claude that is currently and often at the top of most eleven benchmark lead boards.

On top of that, dario and the anthropic team have been outspoken advocates for taking the topic of A I safety very seriously, and they have continued to publish a lot of fascinating AI research on this and other topics, and also joined afterwards by two other bRiley people from methyl c first, AManda esco, who is a researcher working on alignment and fine tuning of claude, including the design of clause, character and personality, a few folks, told me he has probably talk with claude more than any human at anthropic. So he was definitely a fascinating person to talk to about prompt engineering and practical advice on how to get the best out of cloud. After that, chisolm stopped by for a chat.

He's one of the pioneers of the field of mechanistic and interpret ability, which is an exciting set of efforts that aims to reverse engineer in neural networks to figure out what's going on inside inferring behaviors from neural activation patterns inside the network. This is a very promising approach for keeping future super intelligent AI systems safe, for example, by detecting from the activation when the model is trying to deceive the human IT is talking to. Now a quick few second mention of the sponsor checking out in the description is the best way to support the we got uncalled for machine learning notion for machine learning power, no taking and team collaboration.

Shop a fire for selling stuff online. Better help for your mind and element for your health. Choose wise and my friends also, if you want to work with our amazing team, we just want to get in touch with me for whatever reason.

Got a extreme of conflicting act. And now under the full ad reads, I try to make this interesting, but if you skip on, please, to check out our sponsors, I enjoy their stuff. Maybe you would do. This episode is brought by on cord, a platform that provides data focused A I tooling for data annotation, curation, management and for model evaluation. We talk a little bit about public benchMarks in this podcast, I think mostly focused on sort engineering sweep entry.

There's a lot of exciting developments about how do you have a benching market you can cheat on, but if it's not public, then you can use IT the right way, which is evaluate how well is the indication, the dedication, the training, the retraining, the post training, all that. How's that working? But a lot of the fainted conversation in the anthropic fox was focused on the language side.

There's a lot of really incredible work that accord is doing about Anita and organizing visual data. And they make IT accessible for searching for visualizing, for a granular curation kind itself. Some a big final data continues to be the most important thing.

The nature of data would I need to be a good data with? Is human generate or synthetic data keeps changing. But he continues to be the most important component of what the makes for a general intelligence system, I think, and also for specialized intelligent systems as well.

Go try out on chd to curate to energy and manage your A I data at onora consul legs. That's onced docs slash legs. This episode de is brought to you by the thing that keeps getting Better and Better and Better.

Notion IT used to be an awesome no taking to. Then he started being a great team collaboration, so not taking for many people and management of all kinds of other project stuff across large teams. Now, more and more and more is becoming A A I superpowered no taking a team collaboration to really integrating A I probably Better than any no taking tool i've used.

Not close, honestly, notions truly incredible. I haven't gonna chase this notion on a large team. I imagine that that's really when is been to shine, but on a small team is just just really, really, really amazing.

The integration of the assistant inside a particular file for summer zone, for generation, all that kind of stuff, but also the integration of our assistant to a build to ask questions about, you know, across dogs, across wicky, across projects, across multiple files, to be able to, uh, summarize everything, maybe, uh, investigate project progress based on all the different stuff, gone on a different files. Sop, really, really nice integration of A I try notion A I for free when you get a notion of concept. Gs, that's all lower case notion that comes slash legs to try the power of notion AI today.

The episode de is also brought by a sharp fy, a platform designed for anyone to sell anywhere with a great looking online store. I keep one to mention by CEO toby, who is brilliant, and i'm not sure why he has been on the pockets. He said, I need to figure that out every time after cisco I wanna talk to.

So he's brilliant. All kinds of domain is not just entrepreneur, ship or attack, just the last life, just this way of being plus x and adds to the favor profile of the conversation. I've been watching a cooking show a little bit really.

I think my first cooking show is called class wars as the south korean show weh chefs with muslim stars, computer against chefs without missile and stars. And um there's about one of the judges that just just the chism a in the way that you describes every single detail of flavor, of texture, of what makes for good dish. Yeah, so contagious.

I don't really even are. I'm not a food here about food in that way, but he makes me want to care. So anyway, that's why is the term flavor profile referring to toe, which has nothing to do with what I should probably be saying.

And that is that you should use opp fy. I use job fy. Super easy. Create a store extreme at consort ash store to sell a few shirts anyway, sign up for a one I promote trial period that sharp fy that come slash looks that's all lower case.

Got a sharp that slash legs to take your business to the next level today. This episode is also brought you buy Better help about H E L P, help they figure out what you need to match you with a license SE step forty eight hours. It's for individuals.

It's for couples. It's easy to create, affordable, available worldwide. I saw a few books by Young psychologist, and I was like a in a delicious state of sleepiness, and i'd forgot to write his name down.

But I need to do some research. I need to go back. I need to go back to my Younger self when I dream of being a psychiatrist and reading signing ford and reading call Young, reading the way Young kids may be read coming books.

There are super hair of sorts, come moon as well. Kafka nature has say, that's f ski, the sort of sixteen and twenty the century literary philosophers of sorts. Anyway, I need to go back to that.

Maybe have a few conversations above royd. Anyway, those folks, even if in part wrong or true revolutionaries, were truly brave to explore the in the way they did. They showed the power of talking and delving deep into the human mind, into the shadow, through the use of words.

So how do you recommend and Better help us supervise? Way to start, check them out. A Better helped account lux, and save in the first month that's Better helped account sh luks.

This episode also brought you by element my daily zero sugar and delicious electoral mix that i'm going to take a sip of. Now it's been so long that i've been drinking. Element, that I don't remember life before. Element, I guess I used to take salt pills because it's such a big component of my exercise routine to make sure get enough water and get off. Electorate, yeah.

So combined with the first thing that i've exposed a lot and continue to do to the stay, combined with low carb diet that am a little bit of the wagon on that one i'm consuming up probably like sixty, seventy, eighty may be a hundred some days, grams of carbon hybrids. Not good, not good. My happiness is one on below twenty grounds, or ten grams of carbon hydrates.

I'm not like measuring IT out of just using numbers to sound smart. But I don't take dating seriously, but I do take the signals that my body sons quite seriously. So without question, making sure I get enough magnesium and sodium and get enough water as Prices.

A lot of times when I have headaches just felt off or whatever were fixed newer immediately, sometimes after thirty minutes, which just drinking water with the watch is beautiful and is delicious. What are on salt the greatest flavor of all time? Get a semel back for free with any purchase.

Try IT to drink element to comb flash legs. This is like treatment podcast. To support IT, please check on our sponsors in the description and now their first historia every day.

Let's start with a big idea of scaling laws and the scaling hypothesis. What is that? What is its history and what we stand today.

So I can only describe IT as IT, as you know, as IT relates to kind of my own experience. But i've been in the AI field for about ten years, and IT was something I noticed very early on. So I first joined the AI world when I was working at by do with Andrew ing in late twenty fourteen, which is almost exactly ten years ago now.

And the first thing we worked on was speech recognition systems. And in those days, I think deeper learning was a new thing that had made lots of progress. But everyone was always saying we don't have the algorithms we need to succeed, you know, we we're not we're only matching a tiny, tiny fraction there so much we need to kind of discover algorithms.

Ally, we haven't found the picture of how to match the human brain um and when you in some ways I was fortunate, I was kind of you can have almost beginners lock right? I was like A A newcomer to the field, and now I looked at the neural net that we were using for speech, the recurrent neural networks, and I said, I don't know, what if you make bigger and given more layers? And what if you scale up the data along with this? right? I just saw these as as like independent dials that you could turn and I notice that the model started to do Better and that are as you gave them more data, as you as you made the models larger or as you train them for longer.

Um and I didn't measure things precisely in those days but but along with with colleagues, we very much got the informal sense that the more data and the more compute and the more training you put into these models, the Better they perform and so initially my thinking was, hey, maybe that is just true for speech rec mission systems, right? Maybe maybe that's just one particular cork, one particular area. I think IT wasn't until twenty seventeen when I first saw the results from GPT, one that IT clicked.

For me, that language is probably the area in which we can do this. We can get trillions of words of language data. We can train on them. And the models we were trained in those days were tiny. You could train them on one to A G, P, U.

Wherein know, now we trained jobs on tens of thousands, going to hundreds of thousands of GPU and so when I when I saw those two things together um and you know there were a few people like ellia suits giver who who you interview who had somewhat similar reviews right he might have been the first one though I think a few people came to came to similar reviews around the same time right there was, you know rich suttons bitter less than there was gorn rote about the scaling hypothesis. But I think somewhere between twenty fourteen and twenty seventeen was when IT really clicked for me when I really got conviction that, hey, we're going to be able to do these incredibly wide coalition tasks. If we just if we just scale up the models and at at every stage of scaling, there are always arguments.

And you know, when I first heard them, honestly, I thought probably i'm the one who's wrong. And you know, all these, all these x roots in the field are right. They know this situation Better, Better than I do, right? There's, you know, the chomsky given about, like, you can get some tactics, but you can get seman antics.

There is this idea. You can make a sentence make sense, but you can make a paragraph make sense. The latest one we have today is ah you know we're going to run out of data or the data is in high quality enough for models can't reason and and each time every time we manage to we manage to either find a way around or scaling just is the way around.

Sometimes it's one, sometimes it's the other. And and so i'm now at this point, I I I still think you know it's it's always quite uncertain. We have nothing but indulge tive inference to tell us that the next few years are going to be like the last ten years. But but i've seen i've seen the movie enough times, i've seen the story happen for for enough time to really believe that probably the scaling is going to continue and that there are some magic to IT that we haven't really explained on a theoretical basis yet.

And of course, the scaling here is bigger networks, bigger data, bigger .

computer. Yes, in particular, lenie scaling up of bigger networks, bigger training times and more and and more data. So all of these things almost like a chemical reaction.

You know, you have three ingredients in the chemical reaction, and you need to linear arly scale up the three ingredient. If you scale up one, not the others, you run out of the other reagents and and the reaction stops. But if you scale up everything, every everything in series, then then the reaction can proceed.

And of course, now they have this kind of empirical science flash are you can apply to other more nuances, things like scaling laws applied to interpretations, or scanning was applied to post training, or just seeing how does this thing scale. But the big scaling law, I guess the underline scaling hypothesis, has to do with big networks. Big data leads to intelligence.

Yeah, we have we've documented scaling laws in loss of domains other than language, right? So initially the paper we did that first showed IT was an early twenty twenty, where we first showed IT for a language. There was then some work late in twenty twenty where we showed the same thing for other modalities like images, video, text, image, image to text, math, that they all had the same pattern. And and you're right now, there are other stages like post training or their new types of reasoning models. And in in, in all of those cases that we've measured, we see similar, similar types of scaling laws.

A bit of a first sofa question, but what's your intuition about why bigger is Better in terms of network size and data size? Why does IT lead to more intelligent models? So in my previous .

career as as a BIOS yust, I did physics undergrad and then bio physics in grad school. So I think back to what I know is a physicist, which is actually much less than what some of my colleagues said anthropic have in terms of in terms of expertise in physics. Uh there's this there's this concept called the one over ethnos SE and one over x distributions.

Um we're often um uh you know just like if you add up a bunch of natural processes, you get a if you add up a bunch of kind of differently distributed natural processes, if you like if you like, take A M probe and and hooked up to a resister the distribution of the thermal noise and registry goes as one over the frequency. It's some kind of natural convergent distribution. And and and and I think what an amount to is that if you look at a lot of things that that are produced by some natural process that has a lot of different scales, right, not a gousset which is kind of narrow ly distributed.

But you know, if I look at kind of like large and small from fluctuations that lead to lead electrical noise, um they have this decline one over x distribution. And so now I think of like patterns in the physical world, right? If I, if, or or in language, if I think about the patterns in language, there are some really simple patterns.

Some words are much more common than others, like a va. Then there's basic knowns verb structure. Then there is the fact that, you know, nouns and verbs have to agree, they have to coordinate.

There is a higher level setting structure, and there is a thematic structure of paragraphs. And so the fact that there is this regressing structure, you can imagine that as you make the network's larger first, they capture the really simple correlation, is the really simple patterns. And there's this long tale of other patterns.

And if that long tale of other patterns is really smooth, like IT is with the one over ethnos, and you know, physical processes like like resisters, then you can imagine, as you make network larger, it's kind of capturing more and more of that distribution. And so that smoothness gets reflected in how well the models are at predicting that, how well they perform. Language isn't evolved process, right? We've developed language.

We have common words and less common words. We have common expressions and less common expressions. We have ideas who shares that are expressed frequently, and we have novel ideas. And that process has has developed, has evolved with humans over millions of years. And so the the guess, and this is pure speculation, would be would be that there is there's some kind of one tail distribution of of of the distribution of these ideas.

So there's the long tail, but also there is the height of the hierarchy of concepts that you're building up. So the bigger network presumably have a higher capacity to .

exactly if you have a small network, you only get the common stuff right. If if I take a tiny neural network, it's very good understanding that you know a sentence has to have very objective now, right? But it's it's terrible at deciding what there was very objective and now should be in whether they should make sense. If I make IT just a little bigger, IT gets good at that, then suddenly it's good at the sentences, but it's not good at the paragraphs. And so did these these rare and more complex patterns get picked up as I add as I add more capacity to the network?

Well, the natural question then is, what's the ceiling of this? yeah. How complicated and complex is the real world? How much this stuff is there to learn?

I don't think any of us knows the answer to that question. I my strong instinct would be that there is no ceiling belova level of humans, right? We humans are able to understand these various patterns.

And so that makes me think that if we continue to you know scale up these these these models to kind of develop new methods for training them and scaling them up ah, that will at least get to the level that have gotten to with humans. There's then a question of, you know, how much more is IT possible? Understand that humans do.

How much? How much is IT possible to be smarter and more perceptive than humans? I would guess the answer has, has got to be domain dependent. If I look at an area like biology, and, you know, I wrote this S S. A machines of loving Grace IT seems to me that humans are struggling to understand the complexity of biology, right?

If you go to stanford and harvard to berkeley, you have whole departments of you know folks trying to study like the immune system or meta lic pathways. And and each person understands only a tiny bit. Part of IT specializes and they're struggling to combine their knowledge with that with that of other humans.

And so I have an instinct that there's there's a lot of room at the top for a is to get smarter if I think of something like materials in in the physical world or um like a dressing, no conflicts between humans or something like that. I mean, you know IT IT may be there's only some of these problems are not tractable but much harder and and IT IT may be that there's only there's only so well you can do with some of these things, right? Just like with speech recognition, there's only so clear I can hear your speech.

So I think in some areas there may be ceilings. You know that very close to what humans have done in other areas, those ceilings may be very far away. And I think we only find out when we build these systems, there's it's very hard to known in advance. We can speculate.

but we can't be sure. And in some domains, the ceiling might have to do with human bureaucracies and things like this as you are write about yes, so human fundamental part of the loop that the cause the ceiling not may be the limits of the intelligence .

yeah I think in many cases, um you know in theory, technology could change very fast, for example, all the things that we might invent with respective biology. Um but remember, there's there's there's a clinical trial system that we have to go through to actually administer these things to humans.

I think that's a mixture of things that are unnecessary and burek rates and things that kind of protect the integrity of society and the whole chAllenges that it's hard to tell. It's hard to tell what's going on. It's hard to tell which is which, right? My view is definitely I think in terms of drug development, we my view is that we're too slow and we're too conservative.

But certainly, if you get these things wrong, you know it's possible to to risk people's lives by by being, by being, by being too reckless. And so at least at least some of these human institutions are in fact protecting people. So it's it's all about finding the baLance. I strongly suspect that baLance is kind of more on the side of pushing to make things happen faster.

But there is a baLance if we do hit a limit, if we do hit a slowdown in the scaling laws, what do you think would be the reason is that computer limited, data limited, is something else. I deal limited.

So a few things. Now we're talking about having the limit before before we get to the level of of humans and the skill of humans. So so I think one that's one that's popular today and I think you know could be a limit that we run into. I like most of the limit, I would bet against IT, but it's definitely possible is we simply run out of data.

There's only so much data on the internet and there's issues with the quality of the data, right? You can get hundreds of trillions of words on the internet, but a lot of IT is is repetitive or its surge engine sort changing optimization rival or maybe the future will even be tex generated by AI itself. Uh and and so I think there are limits to what to what can be produced in this way.

That said, we and I would guess other companies are working on ways to make data synthetic uh, where you can you can use the model to generate more data of the type that you that you have already or even generate data from scratch. If you think about, uh, what was done with a deep minds, alphago zero, they managed to get a boat all away from, you know, no ability to play, go whatsoever to above human level just by plane against itself. There was no example data from humans required in the the alphago zero version of IT.

The other direction, of course, is these reasoning models that do chain thought and stop to think um and and reflect on their own thinking in a way that's another kind of synthetic data coupled with reinforcement learning. So my guessers with one of those messages will get around the data limitation or there maybe other sources of data that that are available. Um we could just observe that even if there is no problem with data, as we start to scale models up, they just stop getting Better IT seemed to be our our reliable observation that we've gotten Better that could just stop at some point for a reason we don't understand.

Um the answer could be that we need to uh you know we need to invent some new architecture. Um it's been there been problems in the past with a numerical stability of models where IT looked like things were were leveling off. But but actually we found the right on blocker.

They didn't end up doing so. So perhaps there's new some new optimization method or some new technique we need to turn block things. I've seen no evidence of that so far. But if things were to slow down, that perhaps could be one reason.

What about the limits of computer meaning, uh, the expensive, a nature of building bigger, bigger data centers.

So right now, I think um you know most of the frontier model companies, I would guess our our roughly you know one billion dollar scale plus reminds of factor three, right? Those in the models that exist now are being trained now. Uh, I think next year we're going to go to a few billion and then uh twenty twenty six, we may go to uh uh you know billion and probably by twenty twenty seven, their ambitions to build one hundred billion dollar, hundred billion dollar clusters. And I think all of that actually will happen.

There's a lot of determination to build the computer to do IT within this country ah and I would guess that, that actually does happen now if we get to one hundred billion, that still not enough compute, that's still not enough scale than either we need even more scale or we need to develop some way of doing IT more efficiently of shifting the curve. Um I think between all of these, one of the reasons I bullishness about powerfully I happening so fast is just that if you the next few points on the curve, we're very quickly getting towards human level ability, right? Some of the new models that that we develop, some some reasoning models that have come from other companies, they're starting to get to what I would call the PHD professional level, right? If you look at their their coating ability.

Um the latest model we released, unit three point five, the new updated version IT gets something like fifty percent on sweet enter and sweet and ches, an example of a bunch of professional real world's of engineering tasks. At the beginning of the year, I think the state of the art was three or four percent. So in ten months we've gone from three percent to fifty percent on this task.

And I think in another year will probably be at ninety percent. I mean, I don't know, but might even be might even be less than that ah we've seen similar things in graduate level, math, physics and biology from models like open the eyes or one. Uh, so uh, if we just continue to extrapolate this right in terms of skill, skill that we have, I think if we extrapolate the straight curve within a few years, we will get to these models being above that, the highest professional level in terms of humans.

Now will that curve continue? You point into and I point into a lot of reasons what you know possible reasons why that might not happen. But if if the extra oil tion curve continues, that is the .

projector we're on. So anthropic has several competitors have be interesting, get you out of view of at all OpenAI, google X A I mea. What is to take to win in the broad sense of win in the space?

yeah. So I want to separate out a couple things, right? So anthropos anthropic mission is to kind of try to make this all go well, right? And and we have a theory of change called race to the top.

right. Race to the top is about trying to push the other players to do the right thing by setting an example. It's not about being the good guy. It's about setting things up so that all of us can be the good guy.

I'll give a few examples of this early in the history of and propel one of our confounders, Chris ola, who I believe your interviewing soon, the cofounder, the field of mechanistic interpret ability, which is an attempt to understand what's going on inside AI models. So we had him and one of our early teams focus on this area of interpretation service, which we think is good for making models safe and transparent. For three or four years that had no commercial application.

What survivor still doesn't today, we're doing early beta with IT and probably IT will eventually. But you know this is a very, very long research bad in one in which we we've built in public and share our results publicly. And and we did this because, you know, we think it's a way to make model safer.

An interesting thing is that as we've done this, other companies have started doing this as well. In some cases because they've been inspired by IT, in some cases because they're worried that, uh, you know, if if other companies are doing this, that they look more responsible, they want to look more responsible to, no one wants to look like the irresponsible actor. And and so they adopt this.

They adopt this as well. When folks come to anthropic interpret ability, often to draw, and I tell them the other places you didn't go, tell them why you came here um and and then you see soon that there there's interpretative teams as elsewhere as well and in a way that takes away our competitive advantage because it's like go they now others are doing you as well. But it's good.

It's good for the broader system. And so we have to invent some new thing that we're doing that others aren't doing as well on. The hope is to basically bid up bid up the importance of of of doing the right thing.

And it's not it's not about Austin particular, right? It's not about having one particular good guy. Other companies can do this as well. If they they if they join the race to do this, you know that the best news ever, right? It's just it's about kind of shaping the incentives to point upward instead of shaping the incentives to point to point downward.

And we should say this example, the field of mechanistic control ability is just A A rigorous non hand wave wave doing A I safety. What's tending that way?

Trying to I mean, I I think we're still early um in terms of our ability to see things, but i've been surprised at how much we've been able to look inside these systems and understand what we see, right? Unlike with the scaling laws where IT feels like there are some law that is deriving these models to perform.

Better on on the inside the models aren't you know there's no reason why they should be designed for us to understand them, right? They are designed to Operate the design to work just like the human brain or human biochemistry. They're not designed for a human to open up the hatched look inside and understand them. But we have found, and you know you can talk in much more detail about this to Chris, that when we open them up, when we do look inside them, we find things that are surprisingly interesting.

And as the side effect you also get to see the beauty of these models, you get to explore the of the beautiful nature of large neil networks through the maccan .

turn kind of theology mazed at how clean IT spin. Amazed at things like induction heads. I'm amazed at things like, uh, you know that that we can use sparse auto encoder to find these directions within the networks, and that the directions correspond to these very clear concepts.

We demonstrated this a bit with the golden gate bridge. d. So this was an experiment where we found a direction inside one of the the neural network layers that corresponds to the golden gate bridge.

And we just turned that way up. And so we released this model as a demo. He is kind of half a joke, uh, for a couple days of. But IT IT was a lustration of of the method we developed. And, uh, you could you could take the golden gate are you could take the model.

You could ask you about anything you would be like, how you could say, how is your day? Anything you asked? Because this feature was activated, IT would connect to the golden get bridge. So I would say, you know, i'm feeling relaxed and expensive, much like the archives of the gold gate bridge. Or.

you know, I would master ably change topic, use the gold gate bridge and the integrated there is also a sadness to to to the focus ahead on the going gay bridge. I think people quickly found a love. L I think so people already miss IT because I was taking down anything after a day.

Somehow these interventions on the model um where you kind of adjust this behavior somehow emotionally made IT seem more human a than any other version of the model.

Strong personality, strong ideas.

strong personality. IT has these kind of like obsessive interests. We can all think of someone who's like obsessed with something. So IT does make you feel somehow .

a bit more human. I talk about the present, talk about claude. So this year, a lot has happened.

In march, cloth three oppos ona hku were released. Then close three five sona in july, with an updated version just now released. And then also close three five high school was released.

okay. Can you explain the difference between open sonet and I cho? And how do you think about the different versions?

yeah. So let's go back to march when we first released these three models. So know our thinking was different companies produce kind of large and small models, Better and worse models. We felt that there was demand. But for a really powerful model, you know when you that might be a little slower that you'd have to pay more for and also for fast, cheap models that are as smart they can be for how fast and cheap, right, whenever you want to do some kind of like, you know difficult analysis, like if I know I want to write code, for instance, or you know I want I want to bring store ideas or I want to do creative writing, I want to really powerful model.

But then there's a lot of practical applications in a business sense where it's like i'm interacting with the website you like i'm like doing my taxes or I know talking to a you to like a legal adviser and I want to analyze the contract or we have plenty of companies that are just like, you know, I wanted to auto complete on my on my I D E or something. And for all of those things, you want to act fast and you want to use the model very broadly. So we want to serve that whole spectrum of needs.

Um so we ended up with this um you this kind of poetry theme and so what's a really short poem? It's a high cool and so high kw is the small, fast, cheap model that is was at the time was released surprisingly, surprisingly intelligent for how fast and cheap IT was. Uh unit is a is a medium size poem, write a couple paragraphs.

And so sn IT was the middle model. IT is smarter, but also a little bit lower, a little bit more expensive. And an opus, like a magnum opus, is a large work.

Oppos was the the largest, smartest model at the time. Um so that that was the original kind of thinking behind IT. Um and our our thinking then was, well, each new generation of models should shift that trade off curve.

So when we really sun at three point five, IT has the same roughly the same, you know cost and speed as the senate three model uhh, but uh IT IT increased its intelligence to the point where I was smarter than the original ops three model, especially for code, but but also just in general. And so now we've shown results for, uh, high school three point five. And I believe high school three point five, the smallest new model is about as good as ops three, the largest old model.

So basically the aim here is to shift the curve and then at some point, there's going to be in over three point five. Um now every new generation of models has the one thing. They use new data, their personality changes in ways that we kind of you try to steer but are not fully able to steer.

And and so uh, there's never quite that exact equivalence where the only thing you're changing is intelligence. We always try to improve other things and some things change without us, without us knowing or measuring. So it's it's very much in an exact science. In many ways, the manor and personality of these models is more and art than that is the science.

So what is of the reason for a the span of time between sah CloudOps, three and three, five? What is what takes that time if you can speak to?

yes. Oh, there's different there's different processes. Uh, there's pretrail ing, which is just kind of the Normal language model training and that takes a very long time that uses you know these days s tens of thousands, sometimes many tens of thousands of uh T P use or T P use or traum or we use different platforms.

But you know accelerator chips um often often training for months. Uh there's then a kind of post training phase where we do reinforcement learning from human feedback as well as other kinds of reinforcement learning that that phase is getting larger and larger now. And you know you know often that's less of an exact science IT often takes effort to get IT right.

Um models are then tested with some of our early partners to see how good they are. And they're then tested both internally externally for their safety, particularly for catastrophic and autonomy risks. Ah oh, we do internal testing according to our responsible scaling policy, which I you know can talk more about bad in detail.

And then we have an agreement with the us. And the U K. A I safety institute as well as other third party testers and specific domains to test the models for water called C B R N risk, chemical, biological, radiological and nuclear, which are you know we don't think that models impose this race seriously yet.

But but every new model we want to evaluate to see if we're starting to get close to some of these these more dangerous um these more dangerous capability. So those are the phases. And then and then IT just takes some time to get the model working in terms of inference and launched in the API.

So there's just, just a lot of steps to to actually to actually make a model work. And of course, you know, we're always trying to make the processes as stream mind as possible, right? We want our safety testing to be rigorous, but we want you to be rigorous and to to be automatic.

IT happens as fast as I can without compromising on rigor. Same with our free training process and our post training process. So you know, it's just like building anything else is just like building air planes. You want to make them you know, you want to make them safe, but you want to make the process streamline. And I think the creative tension between those is he knows an important thing and making the models work .

yeah rumor on the street I forget who was saying that a anthropic is really good tooling. So I probably a lot of the chAllenge is on the software engineering side is to build the tooling to have a like a efficient low friction interaction with the .

infrastructure. You would be surprised to how much of the chAllenges of you building these models comes down to you. Soft engineering, performance engineering from the outside, you might think, oh, and we had this uo breakthrough, right? You, this movie with the size we discovered that we figured that out.

But but but I think I think all things even even even you incredible discoveries like they almost always come down to the details um and and often super, super boring details. I can't speak to whether we have Better tooling than than other companies. I mean, haven't been at those other companies at least at least not recently um but it's something we give a lot .

of attention to. I don't know if you can say, but from three, from cloth three to cloth three, five is there in the extra pretrail ing going on as they mostly focus on the post training has been leaves in performance.

Yeah, I think I think at any given stage were focused on improving everything at once. Um just just naturally like their different teams, each team makes progress in a particular area in in in making a particular their particular segment of the real race Better. And it's just natural that when we make a new model, we put we put all of these things in at once.

So h the data you have, like the preference data you get from our is that applicable? Is their ways to apply IT to newer models as you get trained up. Yeah preference data .

from old models sometimes get to use for new models, although of course, h IT performs somewhat Better when it's you know trained on, trained on the new models. Note that we have this constitutional I methods such that we don't only use preference data. We kind of there's also a post training process where we train the model against itself, and there's new types of post training the model against itself that are used every day. So not just real like jeff, it's a bunch of other methods as well. Post training, I think, you know, is becoming more and more sophisticated.

Well, what explains the big leap and performance for the new salt three, five? I mean, at least in the programing side and maybe this is a good place to talk about benchMarks. What does that mean to get Better? Just the number went up.

But you know I programmed but I also love programing and I um quality five IT through courses what I use uh to assist programme. And there was at least experientially anaconda. It's gotten smarter at programing. So what like what what is the take to get a to get a smarter? We observe that as well.

By the way, there were a couple uh very strong engineers here topic um who all previous code models both produce BIOS and produce all the other companies hadn't really been useful to have really been useful to them. Maybe maybe this using beginner, it's not useful to me. But senate three point five, the original one, for the first time, they said, oh my god, this helped me with something that IT would have taken me hours to do.

This is the first model is actually save me time. So again, the waterline is rising. And and then I think, you know, the new senate has has been even Better in terms of what what IT takes.

I mean, i'll just say it's been across the board. It's in the free training, is in the post training. It's in various evaluations that we do. We've observe this as well. And if we go into the details of the benchmark, so swe benches basically sin a programmer know you'll be familiar with like pull requests and you know uh just pull request like know they like a sort a sort of atomic unit of work.

You know you could say i'm you know i'm implementing one I implementing one thing um uh and and so swe bench actually gives you kind of a real world situation where the code basis in the current state and i'm trying implement something that know that described in described in language, we have internal benchMarks. We measure the same thing and you say just give the model free rain to like you know do anything, run, run, run anything, added anything um how how well as IT able to complete these tasks and it's that advances, mark, that's gone from IT can do IT three percent of the time to IT can do IT about fifty percent of the time. So I actually do believe that if we get you can gain benchmark, but I think if we get a hundred percent on that benchmark and in a way that isn't kind of like over trained or or or game for that particular benchmark, IT probably represents our real and serious increase in kind of in kind of programming programing ability. And and I would suspect that if we can get to ninety ninety five percent, will IT will represent ability to autonomously do a significant fraction of soft engineering tasks.

Well, ridiculous time line question. Uh, when is glad of three point five coming up?

Uh, not giving an exact date. Uh, but you know, there there using, as far as we know, the plan is still to have a cloth. Three point five. Ops.

I was gonna get IT before G T. A six like like .

duke nuke forever was that game. There was some game that was delayed fifteen years because I do you forever.

And I think ga is not just releasing trailers.

It's only been three months since we released saw IT. Yeah.

it's the incredibl Epace I T j ust t.

ells y ou a bout t hE pace. Yeah the activation for one things you're going to come out.

So ah what about four o so how do you think about these models get bigger, bigger about versioning and also just versioning in general, why senate three, five updated with the day, why not summer three points.

which are actually naming, is actually interesting chAllenge here, right? Because I think a year ago, most of the model was pretrail ing. And so you could start from the beginning and just say, okay, we're going to have models of different sizes, were going to train them all together and, you know, have a family of naming schemes and then will put some new magic into them, and then you will have the next, the next generation. The trouble starts already when some of them take a lot longer.

Another train right, that already mess up your time time a little bit but as you make big improvements in as you make big improvements in retraining, uh, then you suddenly notice, oh, I can make Better retrain model that doesn't take very long to do and but you know clearly has the same no size and shape of previous models um uh so I think those two together as well as the timing issues, any kind of scheme you come up with, uh, you know the reality tends to kind of frustrate that scheme, right? You tends to kind of break out of the break out of the scheme. It's not like software.

You can say, oh, this is like, you know, three point seven. This is three point eight. You have models with different, different tradeoffs.

You can change some things in your models. You can train. You can change other things.

Some are faster and slower in france. Some have to be more expensive. Some have to be less expensive. And so I think all the companies have struggled with this.

Um I think we did very you know, I think think we were in a good, good in terms of naming when we had high q saw IT. Oh great, great. We're trying to maintain IT, but it's not it's not it's not perfect.

So well, trying get back to the simplicity but um uh just the the the nature, the field. I feel like no one's figured out naming is somehow a different paradigms like Normal software. And and so we we just none of the companies have been perfect at IT. I said something we struggle surprisingly much relative. You know relationship vial is you know for the the grand science .

of training the models. So from the user side, the user experience of the updates on IT three five is just different than the previous uh, june twenty, twenty four thousand or three five. That would be nice to come up with some kind of labelling that embodies that because people talk about sa three, five, but now there's a different one. And so how do you refer to the previous one and the new one and IT a when there's a distinct improvement? IT just makes conversation about IT uh, this chAllenging yeah yeah.

I I definitely think this question of there are a lots of properties of the models that are not reflected in the benchMarks. Um I I think I think that that's definitely the case and everyone agrees and not all of them are capabilities. Some of them are your models can be polite or brask.

They can be uh you know uh very reactive or they can ask you questions. Um they can have what what feels like a warm personality or a cold personality. They can be boring or they can be very distinctive like golden gate laud was. And we have a whole and we have a whole team kind of focused on, I think we call claude character, uh, AManda leads to that team and will talk to about that. But it's still a very inexact science.

Often we find that models have properties that were not aware of the the fact of the matter is that you can know, talk to a model ten thousand times and there are some behaviors you might not see, uh, just like, just like with a human, right. I can know someone for a few months and you know not know they have a certain skill or not know that there is a certain side to them. And so I think I think we just have to get used to this idea. M, we're always looking for Better ways of testing our models to to demonstrate these capabilities and and and also to decide which are which are the which are the personality properties we want models to have in which we don't want to have that itself the norm. Ative question is also super interesting.

I got to ask you question from redit.

from redit. Oh boy.

you know, there's just as fascinating to me at least, is a psychological social phenomenon where people report that claude has gotten dumper for them over time. And so um the question is, does the user complaint to about the dumb down the close three, five unit hold any water. So are these aneela reports a kind of social phenomenon? Or did cause is there any cases where cod will get dummer?

So uh this actually doesn't apply this. This isn't just about claude. I I believe this I believe i've seen these complaints for every foundation model produced by major company. Um people said this about G P T four. They said about GPT for turbo. Um so so so couple things on one, the actual weight of the right, the actual brain of the model that does not change unless we introduce a new model.

Um there are just a number of reasons why I would not make sense practically to be randomly substitutes substitute an new versions of the model is difficult from an influence perspective and it's actually hard to control all the consequences of changing the way to the model. The see you wanted to find to in the model to be like I don't know to like to say certainly less which you know all version is saying that used to do. Um you actually end up changing one hundred things as well.

So we have a whole process for and we have a whole process for modifying in the model. We do a bunch of testing on IT. We do a bunch of um like we do a bunch of user testing in early customers.

So we both have never changed the way to the model without without telling anyone. And IT wouldn't certainly in the current set up, IT would not make sense to do that. Now there are a couple things that we do occasionally do.

Um one is sometimes we run A B tests um but those are typically very close to when the models being is being a released and for a very small fraction of time. Um so you know like the the day before the new senate three point five, I I agree we had Better name and this is clock to referred to IT. Um there were some comments from people that like it's got to got a lot Better and that's because you know fraction were exposed to to A B test for for those for those one or two days.

Um the other is that occasionally that the system prompt will change on the system prompt can have some effects, although it's IT IT unlikely to dumb down models is unlikely to make them dummer. And we've seen that while these two things, which I am listening to be very complete um happen relatively happen quite infrequently. Um the complaints about for us and for other model companies about the model change, the model isn't good at this.

The model got more sensor, the model was dumb down. Those complaints are constant. And so I don't want to say people imagine there's anything but like the models are for the most part, not changing.

Um if I were to offer a theory. Um I I think IT actually relate to one of the things I said before, which is that models have many are very complex and have many aspects to them. And so often you know if I if if if I asked model question, you know if i'm like if I like do task acs versus can you do task acts?

The model might respond in different ways are and and so there are all kinds of subtle things that you can change about the way you interact with the model that can give you very different results. Um to be clear, this this itself is like a fAiling by by us and by the other model providers that that the models are are just just often sensitive to like small, small changes in wardian is yet another way in which science of how these models work is very poorly developed. And so you know, if I go to sleep one night and I was like talking the model in a certain way, and I like slightly change the freezing of how I talk to the model, you know I I could get different results.

So that's that's one possible way. The other thing is, man is just hard to quantify this stuff. It's hard to quantify this stuff. I think people are very excited by new models when they come out. And then as time goes on, they become very aware of they become very ware of the limitations. So that may be another effect, but that's all a very long winded way of sane, for the most part, with some fairly narrow exceptions. The models are not changing.

I think there is a psychological effect. You should start in use to a the baseline raises like when people have first gotten wifi on airplanes, it's like a maze yeah and then now thing work.

this is such a piace of craft exactly. So it's easy .

to have the conspiracy theory of they're making wifi slow, slower. This is probably something i'll talk to a man image more about.

But another read IT question ah, when will cloud stop trying to be my, a pure technical grandmother, imposing its moral world view on me as a paying customer? And also, what is the psychology behind making claud over the apologetic? So this kind reports about the experience, a different angle on the frustration has to do with the character.

yes. So a couple points on this. First one is um like things that people say on redit in twitter actual whatever IT is. Um there's actually a huge distribution shift between like the stuff that people complain that I would be about on social media and what actually kind of like, you know statistically users care about and that drives people to use the models.

Like people are frustrated with you know things like, you know the model not writing out all the code or the model uh you know just just not been as good as a code as IT could be even though it's the best model the world on code. Um I I think the majority of things of things are about that um um but uh certainly A A A kind of vocal minority are uh you know kind of kind of raise these concerns right are frustrated by the model refusing things that shouldn't refuse or like apologizing too much or just having these kind of like annoying verbal ticks. The second caveat, and I just want to say this like super clearly because I think it's like some people don't know what others like kind of know IT.

But forget IT like IT is very difficult to control across the board how the models behave. You cannot just reach in there and say, oh, I want the model to like apologize less like you can do that. You can include trading data that that says, like go, the model should like apologize less.

But then in some other situation they end up being like super rude or like overconfident. In a way, it's like misleading people. So there there are all these trade ffs. Um uh for example, another thing is if there was a period during which models are and I think others as well were two were both right, they would like repeat themselves, they would say too much um you can cut down on the verbosity by analyzing the models for for just talking for too long. What happens when you do that, if you do with in a crude way, is when the models are coding.

Sometimes i'll say rest of the code goes here, right? Because they've learned that that's a way to economize and that they see IT and and and then so that leaves the model to be so called lazy encoding where they where they they're just like I you can finish the rest of IT. It's it's not because we wanna you know save on computer or because you know the models are lazy and you know during winter break or any the other kind of conspiracy theories that that have come up.

It's actually is just very hard to control the behavior of the model, to steer the behavior of the model in all circumstances. Ces, at once, you can kind of there's this this valuable aspect where you push on one thing and like you know these these you know these other things start to move as well that you may not even notice or measure. And so one of the reasons that I that I care so much about uh kind of grand alignment of these I systems in the future is actually these systems are actually quite unpredictable.

They're actually quite hard to steer in control. Um and this version were seen today of you make one thing Better IT makes another thing worse. Uh, I think that that's like a present day analog of future control problems and A I systems that we can start to study today, right? I think I think that that that difficulty in in steering the behavior in making sure that if we push in a ee system in one direction, IT doesn't push in another direction in some in some other ways that we didn't want. Um I think that that's kind of that's kind of an early sign of things to come.

And if we can do a good job of solving this problem, right of like you asked the model to like to like making distribute small pox and IT says no, but it's willing to like help you in your graduate level of biology class, like how do we get both of those things at once? It's hard. It's very easy to go to one sider of the other and it's a multi dimensional problem.

And so ai, you know, I think these questions of like shaping the models personality, I think they're very hard. I think we haven't done perfectly on them. I think we've actually done the best of all the ei companies, but still so far from perfect.

Uh, and I think if we can get this right, if we can control the the control the false positives and false negatives in this this very kind of controlled present day environment will be much Better at doing IT for the future. When our worry is, will the models be super autonomous? Will they be able to, you know, make very dangerous things? Will they be able to autonomously, you know, build whole companies? And are those companies so so I I I think of this this present task has both vaccine, but also a good practice for the future.

What's the current best way of gathering sort of user feedback like, uh, not in a goto data but just large scale data, pain points or the upside of pain points. Positive things on is IT intern testing is IT yeah a specific group testing, A B testing. So so typically .

will have internal model bashings where all of anthropic anthropic is almost a thousand people. Um you people just just try and break the model. They try and interact with IT various ways. Um we have a sweet of evils uh for you know always the model refuse in ways that IT couldn't.

I think we even had a certainly eval because are are again one point all had this problem would like he had this annoying take what I would like respond to a wide range of questions by saying certainly I can help you with that. Certainly I would be happy to do that. Certainly this is correct.

And so we had to like, certainly eval. Just like how often does the model say? Certainly uh, but look, this is just a vacation.

Like what if it's which is from certain definitely like, uh uh so you know every time we add a new evil and we we're always evaluating for all of the old things. So we have hundreds of these evaluations, but we find that there is no substitute for human interacting with IT. And so it's very much like the ordinary product development process.

We have like hundreds of people with an anthropic bash the model. Then we do uh you then we do external A B tests, sometimes will run tests with contractors. We pay contractors to interact with the model.

Um so you put all of these things together and it's still not perfect. You still see behaviors that you don't quite wannsee right? You see you still see the model like refusing things that IT just doesn't make sense to refuse.

Um but I I I think trying to trying to solve this chAllenge, right, trying to stop the model from doing, you know genuinely bad things that you know know what everyone agrees shouldn't do, right everyone agrees that know the model, shouting talk about, I don't know child abuse material, right? Like everyone agrees the model shouldn't do that. Um but but at the same time that IT doesn't refuse in these dumb, stupid ways.

Uh I think I think dry drawing that line as finally as possible approaching perfectly is still is still a chAllenge. And and we're getting Better at IT every day. But there's there's a lot to be solved. And again, I would point to that as as an indicator of a chAllenge ahead in terms of steering much more powerful models.

Do you think claud four point o is ever coming up?

I don't want to commit to any naming scheme because if I say if I say here, we're going to have claude for next year and and then you know then we decide like you know, we should start over because there's a new type of mine like I I don't want I don't want to commit to IT. I would expect in the Normal course of business that claud four would come after clod three point five. But but you know, you never know in this wicky field.

right? But the sort of this idea of scaling is continuing.

Scaling, scaling is continuing. There will definitely be more powerful models coming from us. And the models that exists today that is that is certain. Or if if there aren't, we've deeply failed .

as the company. okay. Can you explain the responsible scaling policy and the A I safety level standards?

Yes, I levels as much as i'm excited about the benefits of these models. And we know we talk, talk about that if we talk about machines of loving Grace. Um i'm i'm worried about the risk and I continue to be worried about the risks. Uh, no one should think that you know machines of loving Grace was me me saying, uh, you I no longer worried about the rest of these models.

I think there are two sides of the same coin, the the uh, power of the models and their ability to solve all these problems in biology, neuroscience, economic development, governance and peace, large parts of the economy, those those come with risks as well, right? With great power comes great responsibility, right? That the two are the two are paired uh things that are powerful can do good things.

They can do bad things. Um I think of those risks as being and you know several different, different categories. Perhaps the two biggest risks that I think about that's not to say that there aren't risk today that are that are important.

But when I think of the really you know the things that would happen on the grandest scale um one is what I call catastrophic misuse. These are misuse of the models in domains like cyber, bio, radiological, nuclear, right, things that could, that could harm or even kill thousands, even millions of people if they really, really go wrong. Like these are the in a number one priority to prevent.

And and here I would just make a simple observation, which is that the models, if if I looked today at people who have done really bad things in the world, um I think actually humanity has been protected by the fact that the overlap between really smart, well educated people and people who wanted do really horrific things has generally been small like let's say, let's say i'm someone who you you know I have A P H D in this field. I have a well paying job. There's so much to lose why do I want to like you even even they assuming i'm completely evil which which most people are not um why why you know why would such a person risk their risk their risk their life, risk their their legacy, their reputation to to do something like, you know truly, truly evil?

If we had a lot more people like that, the world would be a much more dangerous place and through my worry is that by being A A A much more intelligent agent A, I could break that correlation. And so I do have serious worries about that. I believe we can prevent those worries.

But you know um I I think as a counterpoint to machines of loving Grace, I want to say that this is that I still serious risks and and the second range of risk would be the autonomy risks, which is the idea that models might on their own, particularly as we give them more agency they've had in the past a particularly as we give them supervision over wider tasks like know riding whole code bases or someday even you know effectively Operating entire, entire companies. They are on a long enough leach. Are they are they doing what we really want them to do? It's very difficult to even understand and in detail what they're doing, let alone, let alone control IT.

And like I said, these these early signs that it's it's hard to perfectly draw the boundary between things the model should do when things the model shouldn't do that that you know if if you go to one side, you get things that are annoying and useless, and you go to the other side, you get other behaviors. If you fix one thing, create other problems, we're getting Better and Better at solving this. I don't think this is an unsolvable problem.

I think this is this is a science like like the safety of airplanes or the safety of cars or the safety of drugs. I don't think there's any big thing we're missing, and I just think we need to get Better at controlling these models. And so these are these are the tourists I am worried about and our responsible scaling plan, which all recognizes a very long winded answer to your question and love.

our responsible .

scaling plan is designed to address these two types of risks. And so every time we develop a new model, we basically tested for its ability to do both of these bad things. So if I were to back up a little bit, um I I think we I think we have an interesting dilema with AI systems where they're not yet powerful enough to present these catastrophes.

I don't know that I don't know they're ever present prevent these catastrophes possible, they want. But the the case for world is the case for risk is strong enough that we should we should act now. And there they're getting Better, very, very fast, right?

I testified in the senate that you know, we might have serious bio risks within two to three years. That was about a year ago. Things have proceeded proceeded a pace a, uh, so we have this thing where it's like it's it's surprisingly hard to to address these risks because they're not here today.

They don't exist. They are light ghosts, but they're coming at us so fast because the models are improving so fast. So how do you deal with something that's not here today doesn't exist, is coming out us very fast.

Uh, so the solution we came up with for that in in collaboration with uh you know people like uh, the organza meter and paul Christiano is okay. What what you need for that are you need tests to tell you when the risk is getting close. You need an early warning system.

And and so every time we have a new model we tested for its capability to do these C B R N tasks as well as testing IT for, you know how capable IT is of doing test autonomously on its own and are in the latest version of our R S P, which we released in the last in last month or two. Uh the way we test autonomy risk is the model, the A I models ability to do aspects of A I research itself um which win the model and the AI models can do A I research. They become kind of truly truly autonomists.

You know that stressed hold is important for a bunch of other ways. And and so what do we then do with these tasks? The R S P basically develops what we've called in, if then, structure, which is if the models pass a certain capability, then we impose a certain set of safety and security requirements on them.

So today's models are what's called A S L. Two models that were A S L one is for systems that manifesto ly don't pose any risk of autonomy or misuse. So for example, a chess plane bot deep blue would be A S L one.

It's just manifestly the case that you can't use deep blue for anything other than chest. IT was just designed for chest. No one's going to use IT to like, you know, to conduct a master, a full cyber attack or know, run wild and take over the world. A S, L. Two is today's AI systems where we ve measured them and we think these systems are simply not smart to uh to you know autonomously self replicate or conduct a bunch of tasks uh and also not smart enough to provide meaningful information about C B R N risks and how to build C B R N weapons above and beyond what can be known from walking at google. In fact, sometimes they do provide information but but not above beyond the charge engine but not in a way that can be stitched together um not not in a way that kind of end to end is dangerous enough.

So A, S, L, three is going to be the point at which a the models are helpful enough to enhance the capabilities of non state actors, right? State actors can already do a lot, a lot of, unfortunately, to a high level of proficiency, a lot of these very dangerous and destructive things. The difference is that nonstate non state actors are not capable of IT. And so when we get to A S, L three, we'll take special security designed to be sufficient to prevent, after the model by non state actors and miss use of the model as its to point, uh, will have to have enhanced filters targeted at these particular areas.

cyber bio nuclear.

cyber bio nuclear and model autonomy, which is like some misuse risk and more risk of the model doing bad things itself A S L for getting to the point where these models could IT could enhance the capability of a of a of a already knowledgeable state actor and or become the, you know, the main source of such a risk. Like if you wanted to engage in such a risk, the main way you would do IT is through a model.

And then I think A L four in the autonomy side, some amount of acceleration in A I research capabilities with with an A I model and an A S L five is where we would get to the models that kind of that are kind of you truly capable that I could exceed humanity in their ability to do, to do any of these tasks. And so the the point of the if then structure commitment is basically to say, look, I don't know. I've been i've been working with these models for many years, and i've been worried about risk for many years.

It's actually kind of dangerous to cry wolf. It's actually kind of dangerous to say this, this model is, this model is risky. And you know, people look and they say, this is manifestly not dangerous.

Again, it's the the delicacy of the risk isn't here today, but it's coming at us fast. How do you deal with that? It's it's really vaccine to a risk planner to deal with IT. And so this if then structure basically says, look, we don't want to tag nize a bunch of people. We don't want to harm our own, our kind of own ability to have a place in the conversation by imposing these these very horus burdens on models that are not dangerous today.

So if then the trigger commitment, basically a way to deal with this, says, you clap down hard when you can show the model is dangerous and of course, what has to come, what that is enough of a buffer thresh hold that you can, uh, you know, you you, you, you're not high risk behind missing the danger. It's not a perfect framework. We've had to change IT every every uh, you know, we came out with a new one just a few weeks ago. And probably probably going forward, we might release new one multiple times a year because it's it's hard to get these policies right, like technically organisationally from a research perspective. But that is the proposal if then commitments and triggers in order to minimize burden and false alarms now, but really react appropriately when the dangers are here.

What you think the timing for S L three is where several the triggers are fired and what you think the time line is for S L four?

Yeah so that is hotly debated within the company. Um uh we are working actively to prepare A S L three. Uh security, uh security.

Measures as well, A S, L three deployment measures. I'm i'm not going to go into detail, but we've made we ve made a lot of progress on both. And now we're prepared to be, I think, ready quite soon.

Uh, I would I would not be surprised, I would not be surprised all if we hit A S L three next year. There was some concern that we we might even hit IT, uh, uh, a this year that still that so possible that could still happen. It's like very hard to say, but like I would be very, very surprised if I was like twenty thirty.

Um I think it's much sooner than that. So there is a protocol for detecting IT, if then and then there's protocols for how to respond to IT. yes. How difficult is the second the latter?

Yeah I think for A S L three is primarily about security um and about you know filters on the model relating to a very narrow set of areas when we deploy the model because A S L three the model is an autonomists y yet um and and so you don't after worry about you know kind of the model itself behavior in a bad way even when it's deployed internally. So I think the esl three measures are are I won't say straightforward, they are rigorous but there easy to a reason about I think once we get to A S L for um we start to have worries about the models being smart enough that they might stand bag test, they might not tell the truth about test. We had some results came out about like sleepy agents and there was a more recent paper about can the models uh mislead attempts to, you know stand back their own abilities, right, show them know a uh uh uh present themselves as being less capable than they are.

And so I think with A L four, there's gna be an important component of using other things than just interacting with the models, for example, interpretative or hidden chains of thought where you have have to look inside the model and verify via some other mechanism that that is not is not as easily corrupted what the model says uh that that that that the model indeed has some property ah so we're still working on A S L for one of the properties of the R S P is that we we don't specify A L four until we've hit asm three. And and I think that's proven to be a wise decision because even with S L three IT, again, it's hard to know this stuff in detail. And and we want to take as much time as we can possibly take to get these things right.

So for s or three, the bad actor be the .

humans use.

And so there is .

a little bit more so deception.

and that's where mechanistic interpret ty comes to play. And hopefully the techniques used for that are not made accessible to the model yeah I mean.

of course, you can hook up the mechanism to control abilities to the model itself, but then you then you've kind of lost IT as a reliable indicator of of uh of of of the model date. There are a bunch of exotic ways you can think of that IT might also not be reliable. Like if the you model gets smart enough that I can like you jump computers and like read the code where you're like looking at its internal state.

We've thought about some of those. I think they're exotic enough. There are ways to render them unlikely. But yeah, generally, you want you want to preserve mechanistic control ability as a kind of verification set or test set that separate from the training process of the model.

See, I think as these models become Better, Better conversation and become smarter, social engineer becomes a threat to say, oh yeah, they can start being very convincing to the engineers inside companies.

Oh yeah, yeah. It's actually like we've seen lots of examples of demos augury in our life from humans. And you know there's a concern that models could do that, could do that as well.

One of the ways that claude is beginning more, more powerful is, is now able to do some agents stuff, computer use. There's also an analysis within the same box of Claudia AI itself. But let's talk about computer use that seems to me super exciting that you can just give claude a task and IT, uh, takes about your actions. Figures IT out in his access to the your computer through screen shots. So can you explain how that works uh.

and where that added? Yeah, it's actually relatively simple. So quad has has had for a long time since since called three back in march. The ability to analyze images and respond to them with text. The the only new thing we added is those images can be screen shots of a computer.

In response, we train the model to give a location on the screen where you can click, and door button on the keyboard you can press in order to take action. And IT turns out that with actually not all that much additional training, the models can get quite good at that task is a good example of generalization. You know, people sometimes say, if you get to lower for orbit, you're like cafe way anywhere, right? Because of how much IT takes to escape the gravity well.

If you have a strong pre train model, I feel like your half weight anywhere uh in terms of in terms of the intelligence space uh um uh and so actually I did I didn't take all that much to to get cloud to do this. And you can just set that in a loop, give a model screening shot, tell you what to click on, give IT the next screen shot, tell you what to click on, and and that turns into a four kind of almost almost three d video interaction of the model. And it's able to do all of these tasks, right?

We shows these demos where it's able to like fill out spread ship to table to kind of like interact with a website, the table to you know um able to open all kinds of no programs, different Operating systems, windows, linux, mac, uh uh oh you know I think all of that is very exciting. I I will say while in theory there's nothing you could do there that you couldn't have done through just giving the model the API to drive the computer screen, uh, this really lowers the barrier. And you know there's there's a lot of folks who who who either know kind of in a position to to interact with those apps or takes them a long time to do sist.

The screen is just a universal interface that's a lot easier interact with. And so I expect over time, this is going to lower a bunch of barriers. Now honestly, the current model has there.

There IT leaves a lot still to be desired. And we were honest about that in the blog, right? IT makes mistakes. I miss clicks. And we, we, we were careful to warn people hate this thing isn't you can't just leave this thing to, you know, run on your computer for minutes and minutes um you've got to give this thing boundary on guard rails. And I think that's one of the reasons we released at first in an API form rather than kind of this kind of just just handed the consumer and and give IT control of of of their computer. Um but but you know, I definitely feel that it's important to get these capabilities out there.

As models get more powerful, we're going to have to grapple with know how do we use these capabilities safely? How do we prevent them from being abused? And and you know, I think I think releasing releasing the model while while the capabilities are are you know are are still are still limited, is is is very helpful in terms of in terms of doing that, I think has been released a number of customers.

I think rapid was maybe was maybe one of the the most a quickest, quickest, quickest, quickest to deploy things have made use of IT in various ways. People have fucked up demos for, you know, windows, dash tops, max, uh, uh you know linux linux machines, uh, so yeah, it's been, it's been, it's been very exciting. I think as as with anything else, you know, IT IT comes with new exciting abilities.

And then with those new exciting abilities, we have to think about how to, how do you know, make the model, you know, say, fly able do what humans want them to do. I mean, the it's the same story for everything, right? Same thing is that same tension. But but the .

possibility of use cases here, just the the range is incredible. So uh, how much to make IT work really well in the future? How much you have to specially kind of go beyond what's the retrain models doing, do more post training R, H, F for supervise financing or synthetic?

I think speaking at a high level, it's our intention to keep investing a lot. You making making the model Better. Uh, like I think I think uh, you know we look at look at some of the know some of the benchMarks for previous models or like I could do with six percent of the time and now our model to do IT fourteen or twenty two percent of the time.

And yeah, we want to get up to, you know the human level reliability of eighty ninety percent, just like anywhere else, right? We're on the same curve that we're on with sweep entry where I think, I would guess, a year from now, the models can do this very, very reliably. But you gotta a start somewhere .

so you think is possible. Get to the the human lot ninety percent, uh, basically doing the same thing you're doing now.

Or is IT has to be special for computers? I I mean depend what you mean by you know special fashion in general um but but you know I generally think you know the same kinds of techniques that we have been using to train the current model. I I expect that doubling in out on those techniques in the same way that we for code for code, for models in general, for other kid, for you know for image input um uh you know for voice, I expect the same techniques will scale here as they have everywhere else.

But this is giving the power of action to claude. And so you could do a lot of really powerful things, could do a lot of damage also.

Ah yeah and we've been very aware of that. Look, my my view actually is computer use isn't a fundamentally new capability like the C B R. An or autonomy capabilities are um it's more like a kind of opens the approach for the model to use an its existing abilities.

And and so the way we think about IT going back to where rsp is nothing that this model is doing inherently increases you know the risk from an R S R S P perspective. But as the models get more powerful, having this capability may make IT screed. Want to you know what IT has the coding capability to um.

Know to do something at the A, S, L three and S L four level. This this may be the thing that kind of one downs IT from doing so. So going forward, certainly this modality of interaction is something we have tested foreign.

We will continue to test foregone R, S. P. Going forward. I think it's probably Better to have to learn and explore this capability before the model is super, super capable.

Yeah, there's A A lot interesting attacks like prompt injection because now you've wide the apatow so you can prompt inject through soft on screen. So if this becomes more more useful than there is more, more benefit to inject inject stuff into the model that goes to certain lepage IT could be harmless stuff like advertisements or I could be like harmful stuff.

right? Yeah, I mean, we a lot about things like spam capita, you know mass camp, there's know every every like if what once secret, i'll tell you if you've invented a new technology, not necessarily the biggest misuse, but but the the first misuse. You'll see scams, just petty scams like just it's it's like it's like a thing as all people scaring each other. It's it's this it's this thing is all this time um and and and it's just every time you got .

a deal with IT, it's almost xul to say but it's true sort of boss and spam in general is the thing that gets more and more intelligent. Yes.

it's it's just hard. There are a lot of, like I said, like there a lot of petty criminals in the world. And you know it's like every new technology is like a new way for pet petty criminals to do something something stupid and malicious.

Ous is any ideas about sand boxing? IT like how difficult sand boxing task?

Yeah we sandbox during training. So are for example, during training we didn't expose the model to the internet. Um but he gets really a bad idea during training because, uh you know the model can be changing, its policy can be changing what is doing and that having an effect in the real world um uh you know in in terms of actually the point, the model right kind of depends on the application.

Like you know sometimes you want the model to do something in the real world, but of course you you can always put guard. You can always put guard rails on the outside, right, right? You can say, okay, well, you know, this models not onna, move data from my, this models not going to move any files from my computer or my web servers anywhere else.

Now when you talk about sand boxing, again, when we get to A S L, for none of these precautions are going to make sense there, right where you when you talk about esl for you, then the model is being kind of know there a theoretical worry. The model could be smart enough to break IT, to kind of break out of any blocks. And so there we need to think about mechanistic, interpret. What about, you know, if if we're going to have a sandbox that would need to be a mathematically prove stand but you that that's a whole different world than what we're .

dealing with with the models today yeah the science of building a box from which A A S L four A I system cannot escape.

I think it's probably not the right approach. I think the right approach, instead of having something on a line that like is trying to prevent IT from escaping, I think it's it's Better to just design the model the right way or have a loop or you know you look inside, you look inside the model and you're able to verify properties and that gives you an an opportunity to like iterate naclo, get IT right. I think I think containing containing bad models is is much worse solution than having good models.

Let me ask about regulation. What's the role of regulation in keeping air safe? So for example, described california A I regulation bill S P ten forty seven that was ultimately widowed by the governor. What are the present cause of this bill?

Yeah, we ended up making some suggestions to the bill, and then some of those were adopted. And you know, we felt I think I think quite positively uh uh quite positively about about the bill a by by the end of the um IT did still have some downsides um uh and you know of course, of course you got vetoed. Um I think at a high level, I think some of the key ideas behind the bill, um you know I would say similar to ideas behind our R S P S.

And I think it's very important that some jurisdiction, whether it's california, the federal government and or other other countries and other states, passes some regulation like this. And I can talk through why I think that's so important. So I feel good about our rsp.

It's not perfect that needs to be iterated on a lot, but it's been a good forcing function for getting the company to take this very seriously, to put them into product planning, to really make them a central part of work and anthropic to make sure there are all of a thousand people and almost thousand people now at anthropic understand that this is of the highest priority of the company, if not the higher priority. Uh, but one there there are still some companies that don't have R S P like mechanisms like open eye. Google did adopt these mechanisms a couple months after uh after and prospect did.

Uh but there are there are other companies out there that don't have these mechanisms at all. Uh and so if some companies adopt these mechanisms and others don't, uh, it's really going to create a situation where, you know some of these dangers have the property that IT doesn't matter if three out of five of the companies are being safe, if the other two are are are being unsafe. IT creates this negative extract.

And and I think the lack of uniformity is not fair to those of us who have put a lot of effort into being very thoughtful about these procedures. The second thing is I don't think you can trust these companies to add here to these voluntary plans in their own right. I'd like to think that anthropic will we do everything we can that we will R R R R R S P checked by our long term benefit trust.

Uh, so you know, we do everything we can to to to add here to our own R S P um but you know you hear lots of things about various companies saying all they said they would do. They said they would give this much compute and they didn't. They said they would do this thing and they didn't.

Uh, I don't I don't think that makes sense to you know to to litigate particular things that companies have done. But I think this this broad principle that like if there's nothing watching over them, there's nothing watching over us as an industry, there's no guarantee that will do the right thing and the stakes are very high. And so I think it's I think it's important to have a uniform standard that everyone follows and to make sure that simply that the industry does what a majority of the industry has already said, this is important and is already said that they definitely will do right.

Some people, you know, I think there there's a classic people who are against regulation on principle. I understand where that comes from. If you go to europe and you know you see something like GDP r, you see some of the other stuff that they've done, you know some of its good, but but some of IT is really unnecessarily burden salmon, I think it's fair to say, really has slowed, really has slowed innovation.

And so I understand where people are coming from on prior. I understand why people come from start from start from that position um but but again, I think AI is different. If we go to the very serious risks of autonomy and miss that I talked about you know just just a few minutes ago, I think that those are unusual and they warrant in unusually strong response.

Uh and so I I think it's very important again um we need something that everyone can get behind. You know I think one of the issues with S P ten forty seven ah, especially the original version of IT was IT. IT had a bunch of the structure of R S P S, but IT also had a bunch of stuff that was either clunky or that that that just would have created a bunch of burdens, a bunch of hassle and might be even have missed this target get in terms of addressing the risk.

Um you don't really hear about IT on twitter. You just hear about kind of you know people are people are cheering for any regulation and then the folks share against make up these often quite intellectual, dishonest arguments about how it'll make us move away from california. L doesn't apply if your headquarter in california bill only applies to do business in california.

Um or that I would damage the open source ecosystem or that I would IT would cause cause all of these things. I think those were mostly nonsense, but there are Better arguments against regulation. There's one guy, dean ball, who really know, I think, a very scholarly, scholarly analyst, who looks at what happens when the regulation is put in place in ways that they can kind of get a life of their own or how they can be poorly designed.

And so our interest has always been we do think there should be regulation in this space, but we want to be an actor who make sure that, that regulation is something that surgical, that targeted at the serious risks and is something people can actually comply with because somebody, I think, the advocates of regulation don't understand as well as they could, is if we get something in place that is um that's poorly targeted, that waste a bunch of people's time. What's onna happen if people are going to say, see these safety risks there? You know, this is this is nonsense.

I just I just had to higher ten lawyers to fill out all these forms. I had to run all of these tests for something that was clearly not dangerous. And after six months of that, they're be there will be a grounds well and will end up with a durable consensus against regulation. And so the I think the worst enemy of those who want real accountability is badly designed regulation.

Um we we need to actually get IT right and and this is if there's one thing I could say to the advocates IT would be that I want them to understand this dynamic Better and we need to be really careful and we need to talk to people who actually have who actually have experiences, seen how regulations play out in practice and and the people who have seen that understand to be very careful if this was some lesser issue, I might be against regulation at all. But what what I want the opponents to understand is that the underlying issues are actually serious there not they're not something that are the other companies are just making up. Because of regulatory capture.

They're not sipi fantasies. They are not they're not any of these things. Um you know every every time we have new model, every few months, we measure the behavior of these models and they're getting Better and Better at these concerning task just as they are getting Better and Better at um you know good, valuable, economically useful tasks.

And so I I, I, I would just love IT if some of the former, I think sb ten forty seven was very polarized ing. I would love IT if some of the most reasonable opponents and some of the most reasonable um a proponents ah would sit down together and you know I think I think know the different AI companies. You know anthropic was the the only AI company that you felt positively in very detailed way. I think you on tweed, uh tweed briefly something positive, but some of the some of the big ones like google, OpenAI mea, microsoft, we're pretty strong ally against.

So I would really like as if if you some of the key stakeholders, some of the you know most thoughtful opponents and and some of the most thought ful opponents would sit down and say, how do we solve this problem in a way that the proponents fuel brings a real reduction in risk and that the opponents feel that IT is not IT is not hampering the the industry or hampering ing innovation any more necessary than than than that needs to and and I think for whatever reason that things got to polar ized and those two groups didn't get to sit down in the way that they should uh and and I feel I feel urgency. I really think we need to do something in twenty and twenty five uh um you know if we get to the end of twenty twenty five and we still done nothing about this, then i'm going to be worried. I'm not i'm not worried yet because again, the restaurant here yet but but I I think time is running short and .

come up with something surgically he said.

yeah, yeah, yeah. exactly. And and we need to get, we need to get away from this, this, this intense pro safety verses intense anti regulatory rta c, right? It's turned into these these flame wars on twitter and nothing google na come with that.

So there's a lot of curiosity about the different players in the game. One of the og is open a ye, you've had several years of experience at OpenAI. What's your story .

in history there? Yeah, so I was at OpenAI for a for roughly five years uh for the last think I was couple years know I I I I I was uh vice present of research there um probably myself in early to giver were the ones who you know really kind of set the set the research direction around twenty sixteen or twenty seventeen.

I first started to really believe in, or at least confirm my belief in the scaling hypothesis when when ellia famously said to me, the thing you need to understand about these models is they just want to learn the models just want to learn. And again, sometimes there are these one, there are these one sentences, these then comes that you hear them, and you're like that, that explains everything. That explains like a thousand things that i've seen and and know ever after I had this visualization.

My head of, like you optimize the models in the right way, you point the models in the right way. They just want to learn. They just want to solve the problem regardless .

of what the problem is.

So get out basically out of ah don't impose your own ideas about how they should on any the same thing as rich sudden put out in the bitter lesson or girl put out in the scaling hypothesis. No, I think I generally the dynamic was I got I got this kind of inspiration from from alien, from others folks like a grad ford who did the the original uh uh GPT one uh and then uh ran really hard with IT me me in my collaborators on GPT two GPT three R L from human feedback, which was an attempt to kind of deal with the early safety and jurgis things like debate and amplification, heavy yon interpretations. So again, the combination of safety plus scaling, probably twenty eight, twenty nine, twenty twenty were those were kind of the years when myself in my collaborators, probably many whom became so founders of the propitious of really had had had a vision and like and like drove the direction.

Why you leave? Why did I do leave? yeah.

So look, I I mean, to put things this way and I know I think I think IT ties to the to the race, to the top right, which is, you know, in my time at OpenAI, when I come to see as I come to appreciate the scale in hypothesis and as I come to appreciate kind of the importance of safety along with the scaling hypothesis, the first one I think you know OpenAI was getting was getting on board with um the second one in a way had always been part of of open the eyes messaging um but uh you know over over many years of of the time, the time that I spent there, I think I had a particular vision of how these how we should handle these things, how we should be brought out in the world, the kind of principle that the organization should have and look, I mean, they were like many, many discussions about like you know, should you do should the company do this? Should the company do that? Like there is a bunch of misinformation out there.

People say like we laugh because we didn't like to deal with microsoft. false. Although you know there was like a lot of discussion, a lot of questions about exactly how we do the deal with microsoft.

Um we left really like commercialization. That's not true. We built GPT three, which was the model that was commercialized. I was involved in commercialization.

It's it's more again about how do you do IT like civilization is going down this path to very powerful AI. What's the way to do IT that is cautious straight forward, honest um that build trust in the organization and in individuals. How do we get from here to there and how do we have a real vision for how to get IT right?

How can safety not just be something we say because that helps with a um and you know I think I think at the end of the day um if you have a vision for that, forget about anyone else efficient. I don't want to talk about anyone else efficient. If you have a vision for how to do IT, you should go off and you should do that vision IT is incredibly unproductive to try and argue with someone else's vision.

You might think they're not doing IT the right way. You might think there they're dishonest. Who knows? Maybe your right, maybe you're not. Um uh but uh what what you should do is you should take some people you trust and you should go off together and you should make your vision happen.

And if your vision is compelling, if you can make IT appeal to people, some you know some combination of ethically, you know in the markets, uh, you know if you can if you can make a company that's a place people want to join uh that you know engages in practices that people think are are reasonable while managing to maintain its position and ecosystem at the same time. If you do that, people will copy IT. Um and the fact that you were doing IT a especially the fact that you're doing IT Better than they are um causes them to change their behavior and a much more compelling way than if there your boss and you're arguing with them.

I just I don't know how to be any more specific about than that, but I think it's generally very unproductive to try and get someone else's vision to look like you. Our vision um it's much more productive to go off and do a clean experiment and say this is our vision. This is is how we're going to do things.

Your choice is you can you can ignore us, you can reject what we're doing or you can you can start to become more like us. An imitation is the sincere form of flatter um and you that plays out in the behavior of customers, that plays out in the behavior of the public, that plays out in the behavior of where people choose to work. Uh and again again, at the end, it's it's not about one company winning or another company winning.

If if we are another company are engaging in some practice that you know people people find genuinely appealing. And I wanted to be in substance, not just not just in appearance um you know I think I think researchers sophisticated, they look at substance uh and then other companies start copying that practice and they win because they copy that practice. That's great.

That's success. That's like the race to the top IT doesn't matter who wins in the end. As long as everyone is copying, everyone else is good practices, right? One way I think of IT is like the thing we're all afraid of is the race the button right in the race, the bottom doesn't matter who wins, because we all lose, right?

Like, you know, in the most extreme world, we we make this autonomous AI that, you know, the robot to leave us or whatever, right? I mean, have joking, but you know that that is the most extreme thing thing that could happen then IT doesn't matter which company was ahead. Um if instead you create a race to the top where people are competing to engage in good in good practices, uh then you know the end of the day, you know doesn't matter who ends up who ends up winning, doesn't even even matter who who started the race to the top.

The point isn't to be virtuous. The point is to get the system into a Better equilibrium. Um then that was before and an individual companies can play some role in doing this. Individual companies can can you know, can help to start IT, can help to accelerate IT.

And Frankly, I think individuals and other companies have have done this as well, right? The individuals that when we put out in R, S, P, react by pushing harder to get something similar done, get something similar done at other companies. Sometimes other companies do something that's like where I got a good practice.

We think we think that's good. We should adopt to to the only differences is you know I think I think we are um we try to be more for we're leaning, we try to adopt more of these practices first and adopt them more quickly when others when others invent them. But I think this dynamic is what we should be pointed.

And now I think I think that abstracts away the question of you know which companies winning, who trust who? I think all these all these questions of drama are are profoundness uninteresting. And and the the thing that matters is the ecosystem that we all Operate in and how to make that equal system Better, because that constrains all the players.

And so and throw up because this kind of clean experiment. Built on a foundation of like what concretely A S A T should look like.

Well, look, i'm sure we have made plenty of mistakes along the way. The perfect organization doesn't exist. IT has to deal with the the imperfection of a thousand employees. IT has to deal with the imperfection of our leaders, including me.

IT has to deal with the imperfection of the people we have we've put to to oversee the imperfection of of the leaders like that, like the board in the long term benefit trust. It's all, it's all a set of imperfect people trying to aim imperfectly at some ideal that will never perfectly be achieved um that's what you sign up for. That's what IT will always be.

But uh uh imperfect doesn't mean you just give up. There's Better and there's worse. And hopefully, hopefully we can begin to build we can do well enough that we can begin to build some practices that the whole industry engages in.

And then, you know, my guess is that multiple of these companies will be successful and tropic will be successful. These other companies, like once i've been at the past, will also be successful and some will be more successful than others. That's less important than again, that we we alive the incentives of the industry. And that happens partly through the race to the top, partly through things like R S. P, partly through, again, selected surgical gulam.

You said talent density beats talent mass. So can you explain that? You explain talk about what IT takes to build a great team of a researchers and engineers.

This is one of these statements that's like more true every, every, every month, every month. I see this statement is more through than I did the month before. So if I want to do with thought experiment, what, say you have a team of one hundred people that are super smart, motivated in the line with emission natural company, or you can have a team of a thousand people where two hundred people are super smart, super a line with the mission, then uh like and then like eight hundred people are less to say. You pick eight hundred like random random big tech plays, which would you rather have right?

The talent mass is greater in in the group of uh in the group of a thousand people, right you even even a larger number of incredibly talented, incredibly line, incredibly smart people um um but but the the issue is just that if every time someone super talented looks around, they see someone else super talented in, super dedicated that searched the tone for everything, right that sets ched, the tone for everyone is super inspired to work at the same place. Everyone trust everyone else. If you have a thousand or ten thousand people and and things have really regressed sed right you are not able to do selection and you're choosing random people.

What happens is then you need to put a lot of processes and a lot of guard rails in place um just because people don't fully trust each other, you have to adjudicate political battles like there are so many things that slow down york ibt Operate and so were nearly a thousand people and and you know we ve tried to make IT so that as large of fraction of those thousand people as possible are like super talented, super skilled is one of the reasons we've we've slowed down hiring a lot in the last few months. We grew from three hundred to eight hundred. I believe I think in the first seven, eight months of the year.

And now we've slow down. We're at like you know, last three months. We for eight hundred to nine hundred and nine fifty, something like that, don't quote me on the exact numbers, but I think there's an inflection point around a thousand and we want to be much more careful how how we how we grow early on and and now as well we've hired a lot of physicists um know the radical physicists can learn things really fast.

Um uh even even more recently as we've continued to hire that, you know we've really had a high bar for on both the research side and the software engineering side, have hired a lot of senior people, including folks who used to be at other at other companies in this space. And we've just continued to be very selective. It's very easy to go from one hundred and two, one thousand and one thousand and ten thousand without paying attention to making sure everyone has a unified purpose.

It's so powerful. If your company consists of a lot of different victims that all want to do their own thing, they are all optimizing for their own thing. Um it's very hard to get anything done. But if everyone sees the broader purpose of the company, if there's trust and there's dedication to doing the right thing, that is a superpower that in itself, I think, can overcome almost every other disadvantage.

And the Steve jobs, a players, a players want to look around to see other players is another way of saying, I don't know what that is about human nature, but IT is demotivating to see people or not, obsessively driving towards a single emission. And IT is, on the flip side of that, super motivating to see that is interesting. Uh, what's IT take to be a great A I researcher or engineer from everything you've seen from working with so many amazing .

people yeah um I think the number one quality, especially on the research side but really both is open mindless sounds easy to be open minded right? You're just like i'm open to anything. Um but you know if I if I think about my own early history in the scaling hypothesis, um I was seeing the same data others were seen.

I don't think I was like a Better programmer or Better at coming up with research ideas than any of the hundreds of people that I work with. Um in some ways in some ways I was worse um uh you know like I i've never like precise programming of like you know finding the bug, writing the GPU coronals. Like I can point you one hundred people here who are Better who are Better at that than I am.

Um but but the the thing that that that I think I did have that was different was that I was just willing to look at something with new eyes, right? People said, oh, you know, we don't have the right algorithms yet. We haven't come up with the right the right way to do things.

And I was just like, uh, I don't know like, you know this neural that has like thirty billion, thirty million premature. Like what if we gave IT fifty million instead? Like let's pot some graph like that that basic scientific mindset of like, oh, man, like, I just I just like, you know, I see some variable that I could change.

Like what happens when IT changes like, let's let's try these different things. And like creole PH, for even this was like the simplest thing in the world, right? Change the number of, know this wasn't like PHD level experimental design. This was like this was like simple and stupid like anyone could have done this if you if you d just hold them that that was important.

It's also not hard to understand and you didn't need to be brilliant to come up with this um but you put the two things together and you know some tiny number of people, some single digit number of people have have driven forward the whole field by realizing this uh and and you know it's often like that if you look back at the discovery the discoveries in in history there, they're often like that. And so this this open, mindless and this willingness to see with new eyes that often comes from being newer to the field, often experience is a disadvantage for this. That is the most important thing, is very hard to look for in test for. But I think I think it's the most important thing because when you find something some really new way of thinking, thinking about things, when you have the initiative to do that, it's absolutely .

transformative and also be able to do that rabid, experimental. And in the face of that, be open minded and curious. And looking at the data, this fresh eyes, what is that actually saying that applies in mechanism concerns ability .

is another example of this, like some of the early work in mechanism to interpret so simple, it's just no one thought to about this question before you said .

what IT takes to be a great air research. Can we rewind the clock back? What advice would you give to people interested in A I they're Young looking forward. How can I make any back in the world?

I think my number one piece of advice to just start playing with the models, um this was actually I wear a little this seems like obvious advice. Now I think three years ago, IT wasn't obvious and people started by, oh, let me read the latest for ordinary paper. Let me you know, let me let me kind of um no, I mean that was really that was really I mean, you should do that as well.

But now you know with wider availability of models and aps, people are doing this more. But I think I think just experiential knowledge um these models are new artifacts that no one really understands um and so getting experience playing with them, I would also say again, in line with the like do something new, thinking some new direction like there are all these things that haven't been explored. Like for example, mechanistic interpret ability is still very new.

It's probably Better to work on that, that IT is to work on new model architectures because no, it's more possible than that was before. They're probably like a one hundred people working on IT, but there aren't like ten thousand people working on IT. And it's it's this this this fertile area for study like like you know it's there there's so much like low hay in fruit.

You can just walk by and know you can just walk by and you can pick things. And and the the only reason, for whatever reason, parents people are not interested enough. I think there are some things around long, long horizon learning and long horizon tasks where there's a lot to be done. I think evaluations are still were still very early in our ability to the study evaluations, particularly for dynamics systems acting in the world. I think there's some stuff around multi agent skate where the puck is going is my is my advice.

And you don't have to read brilliant to think of IT, like all the things that are going to be exciting in five years, like people even mention them, is like, you know, conventional wisdom, but like it's just somehow there's this barrier that people don't people don't double down as much as they could or they are afraid to do something. That's not the popular thing. I don't know why IT happens that like getting over that barriers that the my number one piece of advice.

let's talk if a good a bit about post training so IT um seems that the modern post training recipe has a liberal everything so supervised fine tuning R A chef. Uh the the the constitutional A I was R L A I F com. It's again, the naming .

thing in the synthetic .

data seems like a lot of synthetic data at at least trying to figure out ways to have high quality and date. So what's the if this is a secret sauce that makes anthropic clause so incredible. What how much of the magic is in the pre training? How much of is in the post training?

Ah ah I mean ah so first of all, we're not perfectly able to measure that ourselves. Um uh you know when you see some some great character ability, sometimes it's hard to tell whether IT came from pretrail ing. Post training, uh, we developed ways to try and distinguish tween those two, but they're not perfect.

You know the second thing I would say is, no, it's when there is an advantage. I think we've been pretty good at in general, in general at R L. Perhaps perhaps the best all, although I don't know because I don't see what goes on inside other companies. Usually IT isn't, oh my god, we have the secret magic method that others don't have, right? Usually it's like, well, you know, we got Better at the infrastructures so we could run IT for longer or you you're able to get higher quality data or we're able to filter our Better or be able to combine these methods in practice. It's is usually some boring matter of matter of kind of uh uh uh H A practice and tradecraft um so you know when I think about how to do something special in terms of how we train these models, both free training but even more post training um you know I I really think of IT a little more again as like designing airplanes or cars. Like you know it's not just like old man, I have the blue print like maybe that makes you make the next year plane, but like there some there's some cultural trade craft of how we think about the design process that I think is more important than than than any particularly gizmos able to invent.

Okay, about the media. Ask about specific techniques. The first on R, H, F.

What do you think? Just zoom ing out in tuition, almost. philosophy. What do you think our chef works so well?

If I go back to like the scaling hypothesis, one of the ways to escape the scaling hypothesis is if you train for x and you throw enough computer at IT um then you get x and and so oral like jeff, is good doing what humans want the model to do or at least um to stayed IT more precisely doing what humans who look at the model for a brief period of time and consider different possible responses, what they prefer as the response um which is not perfect from both the safety and capabilities perspective in that humans are are often not able to perfectly identify what the model wants and what humans want the moment may not be what they want the long term so there's there's a lot of subtle ty there. But the models are good at uh you know producing what the humans in some shallow sense what uh and IT actually turns out that you don't even have to throw that much compute t at IT because of another thing, which is this this thing about a strong retrain model being half wait anywhere. Uh uh uh so once you have the pretrail model, you have all the representations you need to get the model to get the model where you where you wanted to go.

So do you think R, H, F makes the model smarter or just appear smarter to the humans? I don't .

think IT makes the model smarter. I don't think that just makes the model appear smarter. It's like R H F, like bridges the gap between the human in the model, right?

I could have something really smart that I can communicate at all, right? We all know people like this um people who are really smart but you know can understand what they're saying. Um uh so I think I think our just just bridges that gap.

Um I I think it's not it's not the only kind of our we do. It's not the only kind of our l will happen in the future. I think rl has the potential to make model smarter, to make them reason Better, to make them Operate Better, to make them develop new skills even.

And perhaps that could be done, you know, even in some cases with human feedback. But the kind of oral H F. We do today mostly doesn't do that yet, although we're very quickly starting to be able to.

But IT IT appears to increase. If you look at the metric of helpfulness .

and increases that IT also increases. What was this this word in lead essay? unhoped. Lin, we're based. The models are hobbled and then you do various trains to them to unhoped LED them. Yeah so know I like that work because it's like a real world. So so I think our jeff unhoped les, the models in some ways, and then there are other ways where mo hasn't yet been on hobble. And you know needs .

needs done hobble, if you can say in terms of cost, is pretrail ing the most expensive thing? Or is post training creep up to that .

at the present moment? IT is still the case that, uh, pretrail is the majority of the cost. I don't know what to expect in the future, but I could certainly anticipate the future where post train is the majority of the cost in that future.

Anticipate would you be the humans or the A I. That's the cost thing for the post training.

I don't think you can scale up humans enough to get high quality and any kind of method that relies on humans and uses a large amount of compute, it's going to have to rely on some scale supervision method like a like debate or iterated amplification or something like that.

So on that super .

interesting.

Um instead of ideas around constitutional, I can describe what IT is as first detailed in december .

twenty yes.

two paper and and beyond that, what is IT?

yes. So this was from two years ago. The basic idea is so we describe what oil like jeffs you have uh you have a model and uh IT, you know spits out two poo IT like you just sample from IT twice, spits out two possible responses and you like human, which responds you like Better or another very of IT is rate this response on the scale of one to seven.

So that's hard because you need to scale of human interaction and is very implicit, right? I don't have a sense of what I I want the model to do. I just have a sense of like what this average of a hundred thousand humans wants the model to do. So two ideas.

One is, could the A I system itself decide which, uh, which responses Better, right? Could you show the AI system these two responses and and ask which which which response is Better? And then second, well, what criterion for the AI use? And so then there's this idea because you have a single document, a constitution, if you will, that says these other principles the model should be using to to respond and the AI system read those um IT reads those principles as well as reading the environment and the response and IT says, well, how good did the A I model do? It's basically a form of self play.

You're a kind of training the model against itself. And so the AI gives the response and then you feed that back into what's called the preference model, which in turn feeds the model to make IT Better. Um so you have this triangle of like the AI the preference model and the improvement of the A I itself.

And we should say that in the constitution the set of principles are like human interpretation.

They are yeah yes, it's something both human human and the system can read. So IT has this nice this nice kind of translate ability or syme's um you know in in practice we both use a model constitution and we use our G F, and we use some of these other methods. So it's IT turned into one tool in a in a tool kit that both reduces the need for ra jeff and increases the value we get from um from from using each data point of our F I A. Also interact in interesting ways with kind of future reasoning type R L method. So um it's one tool in the tool kit but but I think IT is a very .

important tool, but it's a compelling one to us humans. You know think about the founding fathers and founding of the united states. The natural question is who and how do you think he gets to define the constitution, the set of principles in the constitution?

Yeah so i'll give like a practical um answer and a more abstract and I think the practical answers like look in practice models get used by all kinds of different like customers right and and so uh you can have this idea where you know the model can can have specialized rules or principles. Know we find two inversions of models implicity.

We've talked about doing IT explicit having having special principles that people can point build into the models um uh so from practical perspective, the answer can be very different from different people ah you know customer service agent um you know behaves very differently from the lawyer and obeys different principles um but I think at the base of IT there are specific principles that the models uh you know half obey. I think a lot of them are things that people would agree with. Everyone agrees that do we models to present these cvr and risks? I think we can go little further and agree with some basic principles of democracy and the rule of law.

Beyond that, IT gets, you know, very uncertain. And there our goal is generally for the models to be more neutral to not a spouse. The particular point of view and you know more just be kind of like wise uh, agents or advisors that will help you think things through and will you know present present possible considerations but you know don't express know stronger specific opinions.

Opening eye released a model spec where kind of clearly concretely defines some of the goals of the model and specific examples like A, B, how the model should behave. Do you find that interesting? Although I should mention the I believe the brilliant john shuman was a part of that. He's not anthropic.

Uh, do you think this is useful direction? My anthropic release the model. Spects.

well, yeah. So I think that's a per useful direction. Again, IT has a lot in common with a constitutional A I E saw.

Get another example of like a race to the top right. We have something that's like we think you know a Better and more responsible way of doing things. Um it's also a competitive advantage.

Um then uh others kind of you discover that he has advantages and then start to do that thing. Uh, we that no longer have the competitive advantage, but it's good from the perspective that now everyone has adopted A. Positive practice that others were not adopting.

And so our response to that as well looks like we need a new competitive vantage in order to keep driving this race upwards. Um so that's that's how I generally feel about that. I also think every implementation of these things is different.

So you know there are some things in the model spec that we're not in constitutional AI. And so we can always we can always adopt those things or you know at least learn from them. Um so again, I think this is an example of like the positive dynamic that that that I think we should all want the field have.

Let's talk about the incredible S A machines of loving Grace. I will come on, everybody read IT the long one .

IT is really alone .

yeah is really refreshing to reconstruct ideas about what a positive future looks like. And you took sort of a bold stand because like it's very possible that you might be wrong on the .

dates or specific, and I fully expected to will definitely be wrong about all the details. I might be be just spectacularly wrong about the whole thing, and people will will laugh at me for years. Uh, that's that's just how the future works. So you provide .

A A bunch of concrete positive impacts of A I and how you know exactly a super intelligence, an AI might accelerate data breakthrough, for example, biological chemistry, that would then lead to things like we cure most cancers, prevent all infection disease, double the humans span and so on. So let's talk about the S A. first. Can you give a high level vision of the sea and what he take.

ways that people to have? Yeah, I have spent a lot of time in the topic. I spent a lot of effort and like, you know, how do we address the risks of A I right? How do we think about those risks? Like we're trying to do a race to the top.

You know what that requires us to build all these capabilities and the capabilities are cool. But you know you know were like a big part of what we're trying to do is like is like address the risks. And the justification for that is like, well, you know, all these positive things know the the market is this very healthy organism, right? It's going to produce all the positive things.

The risks, I don't know, we might mitigate them. We might not. And so we can have more impact by trying to mitigate the risks.

But I noticed that one floor in that way of thinking and it's if not a change in how seriously take the risks, it's maybe a change in how I talk about them um is that you know no matter how kind of logical or rational that line of reasoning that I just gave might be um if if you kind of only talk about risk, your brain only thinks about risks. And and so I think it's actually very important to understand what if things do go well. And the whole reason were trying to prevent these risks, not because we're afraid of technology, not because we want to slow IT down.

It's because if we can get to the other side of these risks, right, if we can run the gott successfully um to put IT in stark terms, then on the other side of the glare of all these great things and these things are worth fighting for and these things can really inspire people. And I think I imagine because look, you have all these investors, all these VS, all these AI companies talking about all the positive benefits of A I. But as you point out, it's it's weird.

There's actually a death of really getting specific about IT. There's a lot of like random people on twitter like posting these kind of like gleaming cities and this is just kind of like vibe of like grind, accelerate harder, like kick out that these out. You know is this is this very like aggressive ideological? Are you are you actually excited about um and so and so I figured that you know I think that would be interesting in valuable for someone who's actually coming from the risk side to to try and and to try and really make a try at explaining explaining, explaining what the benefits are um both because I think it's something we can all get behind.

And I want people to understand, I want them to really understand that this isn't this isn't dumas versus acceleration ists. Um this this is that if you have a true understanding of of where things are going with with A I and maybe that's the more important access, A I is moving fast first as A I is not moving fast, then you really appreciate the benefits. And you you you you really you want humanity, our civilization, to seize those benefits. But you also get very serious about anything that could real them.

So I think the starting point is to talk about what this powerfully I wh. is. The term you like to use.

Most of the world uses A G I. But you don't like the term because it's a basically has too much baggage, has become meaningless. It's like we're stuck .

with the terms they like may be were stuck with the terms and my efforts to change them till tell you what s this is like a pointless semantic point, but I I keep talking about that that I doesn't do IT once more. I I think it's it's a little like like, let's say I was like one thousand nine hundred ninety five and more laws making the computers faster.

And like for some reason there had been this like verbal tick that like everyone was like, well, someday we're going have like supercomputers. And like supercomputers, I can be able to do all these things that like, you know, once we have super computers will be able to like sequences, you know, we will be able to do other things. And so and so like one, it's true the computers are getting faster.

And as they get faster, they're going to be able to do all the great things. But there's like there's no disco point at which you had a supercomputer in. Previous computers were not to like super computer is a term we use, but like it's a vague term to just describe like computers that are faster than what we have today.

There's no point at which you pass the threshold. H my god, we're doing a totally new type of computation, new. And and so I feel that way about agi, like there's just a smooth x pendentives. And like if if by agi you mean like like AI is getting Better and Better and like graduates is going to do more and more of what humans do until it's going to be smarter than humans and then it's going to get smarter even from there, then then yes, I believe in agi. If if a is some discreet or separate thing, which is the way people often talk about IT, then it's it's kind of a meanness buzzard.

I going to me, it's just this of a patton's form of a powerfully ee, exactly how you define and me you to find them very nicely on the intelligence access. It's just some pure intelligence. It's marter than a nobel prize winner, as he described across most relevant disciplines.

Okay, that's just intelligent. So it's a both in creativity and be able to join your ideas, all that kind of stuff, in every discipline. Noble prize winner, okay, in their prime, IT can use every modality.

So a this kind of self looks, panetta, just to Operate across all the modalities of the world. Uh, IT can go off for many hours, days and weeks to do tasks and do its own sort of detailed planning and only ask you help when is needed. Uh, IT can use this is actually kind interesting.

I think in the you said mean again, it's a bet that is not going to be embodied. But IT can control embodied tools. So they can control tools, robots, labor's equipment, the resources used to train, then be very purpose to run millions of copies of IT. And each of those copies will be independent that can do their own independent work so they can do .

the cloning of the intel. Yeah yes. I mean, you might imagine from outside the field, like only one of these made one. But the truth is that, like the scale lip is very quick.

Like we we do this today, we make a model and then we deploy thousands, maybe tens of thousands of instances of IT. I think by the time, you know, certainly within two to three years, whether we have the super powerful 的 eyes or not, clusters are going to get to the size. You'll be able to deploy millions of these and they'll be you know faster than humans. And so if your picture is so t we will have one and they'll take a while to make them my point there was no, actually you have millions of .

them right away and in general, they can learn and act uh ten to one hundred times faster than humans. So that's a really nice, definitely powerful ly, I, okay, so that but you also write that clearly, session entity would be capable solving very difficult problems very fast. But IT is not trivial to figure out how fast two extreme positions both seem false to me.

So the singularity is on the one extreme and the options on the other. Can you describe each of the extremes? Yeah.

so why? So yeah, let's let's describe the extreme. So like one, one extreme would be, well, look, um you know, if we look at kind of evolutionary history, like there was this big acceleration where you know for hundreds of thousands of years we just had like single cell organisms and then we had mammal's and we had apes and then that quickly turned to humans.

Humans quickly built industrial civilization. And so this is going na keep speeding up. And there's no thing at the human level, once models get much, much marter than humans, they'll get really good at building the next models.

And you know, if you write down like a simple differential equation, like this is an exponential. And so what what's going to happen is that, uh, models will build faster r models. Models build faster r models. And those models will build know nanobots that can like take over the world and produce much more energy than you could produce otherwise.

And so if you just kind of like solve this obstruct differential equation and like five days after we know we build the first AI that's more powerful than humans than than ah you know like the world will be filled with these AI in every possible technology that could be invented like will be invented. I'm carrick turing this a little bit um but no, I think that's one extreme. And the reason that I think that's not the case is that one I think they just neglect like the laws of physics, like it's only possible to do things so fast in a ysp world like some of those loops go through you producing faster hardware.

Um uh takes a long time to produce faster hardware. Things take a long time. There is this issue of complexity like I think no matter how smart you are, like, you know, people talk about, oh, we can make models of biological systems.

It'll do everything. The biological systems. I think computational modeling can do a lot. I did a lot of computational modeling when I work in biology, but like just there are a lot of things that you can't predict how you know there. They're complex enough that like just iterating just running the experiment is going to beat any modeling no matter how smart the system doing the modeling is .

or even if it's not interacting with the physical world, just the modeling is gonna .

hard yeah I think the modeling is going to be hard. Getting to to match the physical world is .

going to be just have yeah, yeah, yeah.

But if you know, you just look at even the simplest problems I know if I talk about like you know the three body problem or simple chaotic prediction like you know or or like predicting the economy, it's really hard to predict economy two years out. Like maybe the cases like you know Normal humans can predict what's going to happen in the economy mix quarter. They can really do that.

Maybe maybe A A I system means you know a zillion times smarter can only predict IT out a year or something. Instead, instead of you have these kind of exponential increase in computer intelligence for lining AR, increase in an inability to predict. Same with, again, like, you know, biological molecules, molecules interacting, you don't know what's can happen when you Better.

When you preter a complex system, you can find simple parts in IT. If you're smarter, you're Better at finding these simple parts. And then I think human institutions, human institutions are just are are really difficult like it's you know it's it's been hard to get people I won't give specific examples, but it's been hard to get people to adopt even the technologies that we've developed, even ones where the case for their efficacy is very, very strong.

Um you know people have concerns. They think things are conspiracy theories like it's it's just been it's been very difficult. It's also been very difficult to get you very simple things through the regulatory system, right.

I think and I don't want disparage anyone who works, the regulatory systems of any technology, they are hard trade off. They have to deal with, they have to save lives. But but the system as a whole, I think, makes some obvious tradeoff.

Fs, that are very far from maxi zing, human welfare. And so if we bring eye systems into this, you know, into these human systems, often the level of intelligence may just not be the women in factor, right. IT just may be that IT takes a long time to do something.

Now if the E I system, uh sur invented all governments, if you just said i'm dictator of the world and i'm going to do whatever some of these things that could do again, the things have you do with complex the I I still think a lot of things would take a while. I don't think IT helps that the I systems can produce a lot of energy or go to the moon like some people in comments responded to the S A saying the AI system can produce a lot of energy and smarter AI systems that's missing the point. That kind of cycle doesn't solve the key problems that i'm talking about here.

Um so I think I think a bunch people miss the point there. But even if IT, we're completely on the line and you know could get around all these human obstacles, that would have trouble. But again, if you want this to be A I system that doesn't take over the world, that doesn't destroy humanity, then then basically know it's it's, it's going to need to follow basic human laws, right?

You know, if if we want to have an actually good world, like we're going to have to have an A I system that that interacts with humans, not one that kind of creates its own legal system or disregards all the laws, all of that. So as inefficient as these processes are, you know, we're going to a have to deal with them because there needs to be some popular and democratic legitimacy in how these systems are rolled out. We can't have a small group of people who are developing these systems say this is what's best for everyone, right?

I think it's wrong and I think in practice is not going to work anyway. So you put all those things together and you we're not we're not going to we're not going to change the world and upload everyone in five minutes. I I just I don't think I A, I don't think it's going to happen.

And b to and you know to the extent that I could happen, it's not the way to lead to a good world. So that's on one side. On the other side, there's another set of perspectives which I have actually in some ways more sympathy for, which is, look, we've seen big productivity increases before, right?

You know, economists are familiar with studying the productivity increases that came from the computer revolution, the internet revolution. And generally, those productivity increases were underwhelming. They were less than you than you might imagine.

There was a quote from Robert solo, you see the computer revolution everywhere except the productivity statistics. So why is this the case? People point to the structure of firms, the structure of enterprises. How um uh you know how slow it's been to roll out of existing technology to very poor parts of the world, which I talk about in the esa, right?

How do we get these technologies to the poorest parts of the world that are behind on cellphone technology, computers, medicine, let alone you know new single A I that hasn't been invented yet um so you could have a perspective. It's like, well, this is amazing technically but it's all in nothing burker. Um uh you know I think um Tyler cowen who who wrote something response to my essay has that perspective think he thinks the radical change happen eventually but thinks it'll take fifty, one hundred years and and you could have even more asta's perspective on the whole thing.

I think there's some truth that I think the time scale is just is just too long um and and I can see that I can actually see both sides with today's AI. So uh you know a lot of our customers are large general ices. We are used to doing things a certain way.

Um i've also seen IT in talking to governments, right? Those are those are prototypical you know institutions, entities that are slow to change. Uh but the dynamic I see over and over again is yes, IT takes a long time to move the ship. Yes, there's a lot of resistance and lack of understanding, but the thing that makes me feel that progress will in the end happen moderately fast, not incredibly fast.

But moderately fast is that you talk to what I find is I find over and over again, again in large companies, even in governments um which have been actually surprisingly forward leaning, uh, you find two things that move things for one you find a small fraction of people within a company, within a government who really see the big picture, who see the whole scaling hypothesis, who understand where A I is going or at least understand where it's going within their industry. And there are a few people like that within the current, within the current U. S.

Government who really see the whole picture and and those people see that this is the most important thing in the world until they agitate for IT and the thing they learn or not enough to succeed, because there are small set of people within a large organization. But as the technology starts to roll out, as IT succeeds, in some places, in the folks who are most willing to adopted, the specter of competition gives them a wind at their backs because they can point within their large organization. They can say, look, these other guys are doing this right? You know, one bank can say, look, this new fingle hedge fund is doing this thing.

They're going to eat our lunch in the us. We can say we're afraid china is going to get there before before we are. Uh, and that combination, the spector of competition plus a few visionaries within these, you know, within these the organizations that in many ways are our scaroons.

You put those two things together and IT actually makes something happen. I mean, it's interesting. It's a baLances fight between the two because a nurse a is very powerful. But but but eventually, over enough time, the innovative approach breaks through. Um and i've seen that happen.

I've seen the ark of that over and over again and it's like the the berry are there, the the barriers to progress, the complexity, not knowing how to use the model, how to deploy them are there. And and for a bit IT seems like they're going to last forever, like change doesn't happen. But then eventually change happens and always comes from a few people.

I felt the same way when I was an advocate of the scaling hypothesis within the AI field itself. And others didn't get IT IT felt like no one would ever get IT IT felt like then I felt like we had a secret almost no one ever had. And then a couple years later, everyone has the secret.

And so I think that's how it's gonna with appointment ment to A I in the world, it's going to the barriers are gna fall apart gradually and then all at once. And so I think this is going to be more and this is just an instinct. I could I could easily see how i'm wrong.

I think it's going to be more like ten, five or ten years, as I say in the S A, then it's going to be fifty hundred, one hundred years. I also think it's going to be five or ten years more than it's gonna know, five or ten hours. Uh, because I I just seen how human systems work. And I think a lot of these people who write down me differential equations, who say AI is going to make more powerful ai, who can understand how you could possibly be the case that these things won't won't change so fast. I think they don't understand these things.

So what do you use the timelines to where we achieve? A G I A K, A powerful A I A K, super useful ai.

Start calling .

with that is a debate, is a debate about naming, you know, unpure intelligence you can smarter and nobel Price winner in every relevant discipline. And all the things we've said modality can go and do stuff on its own for days, weeks in due biotic experiments uh, on its own in one. You know what? Just stick to biology because yeah, you sold me in the whole biology health section that's so exciting from from just I was getting get from a scientific perspective, he made no one .

be a biologist. So no, no, this was feeling I have when I was writing that it's it's like this could be such a beautiful future. If we can, if we can just, if we can just make that happen, right?

If we can just get the. Get the land mines out of the way and and make that happen. There's there's so much there's so much beauty and and and and and elegance and moral force behind IT. If if we can, if we can just and it's something we should all be able to agree on, right, like as much as we fight about about all these political questions, is this something that could actually bring us together? Um but you are asking .

just put numbers on. So know this is of course.

the thing i've been grappling with for many years and i'm i'm not at all confident every time if I say twenty twenty six or twenty twenty seven, there will be like a zillion like people on twitter who will be like a ico and twenty and twenty six twenty and will be repeated for like the next two years that like this is definitely what I think it's going happen.

Um so whoever is next to these climbs will crop out the thing I just said and and only say the thing I ve ought to say um but i'll to say that anyway. Um so uh if you extrapolate the curves that we've had so far, right, if if you say, well, I don't know, we're starting to get to like p hd level. And last year we were at uh undergraduate level.

In the year before we were at like the level of a high school student, again, you can you can quibbled with at what tasks? Ks, and for what we're still missing modalities, but those are being added. Like computer use was added. Like imagine was added.

Like image generation has been added, if you just kind of like, and this is totally unscientific, but if you just kind of like I ball rate at which these capability are increasing, IT does make you think that will get there by twenty, twenty six or seven. Again, lots of things could we could run out of data. You know we might not be able to scale clusters as much as we want like, you know maybe taiwan gets blown up or something, and you know then we can't produce as many GPU as we want.

So there there are all kinds of things that could could rail the whole process. So I don't fully believe the straight mind extrapolation, but if you believe the straight line extra pulag, you will get there in twenty twenty six or twenty twenty seven. I think the most likely is that there are some mild delay relative to that.

Um I don't know what that delay is, but I think I could happen on schedule. I think there could be a mild delay. I think there are still worlds where IT doesn't happen in in one hundred years.

Those the number of those world is rapidly decreasing. We are rapidly running out of truly convincing broccolis truly compelling reasons why this will not happen in the next few years. There were a lot more in twenty twenty.

Um although my my guess my hunch at that time was that we will make IT through all those blockers. So sitting as someone who has seen most of the blockers cleared out of the way, I kindly suspect my hunch, my suspicion is that the rest of will not block us. Uh, but you know, look, look, look, at the end of the day, like I don't want to represent this as a scientific prediction, people called scaling laws.

That's a misnomer. Like mores law is, is, is is a misera mors law. Scaling laws are not laws of the universe, their empirical regularities. I am going to bed in favor of them continuing, but i'm not certain of that.

So you extensive describes of the compressed twenty first century how a will help a set forth a chain of breakthrough in biology and medicine that help us in all these kinds of ways that I mentioned. So how do you think? What are the early steps in I do.

And by the way, ask claude good questions to ask you. And clad told me, uh, to ask, what do you think is a typical day for biologists working on A G I look like under in this future? Yeah, yeah. Lad is here. Let me, well, let me start .

with your first questions and then all answer to that. I D cod want to know what in the future, right? Who might get to be working exactly? Um so I think one of the things I went hard on when I went hard on in the essay is let me go back to this idea of because it's it's really had an you know had an impact on me.

This idea that within large organizations and systems there end up being a few people or a few new ideas who kind of caused things to go in a different direction they would have before, who who kind of a disproportionate to affect the trajectory. There's a bunch of kind of the same thing going on, right? If you think about the health world, there's like, you know trillions of dollars to pay out medicare and you know other health insurance and then the nh is one hundred billion.

And then if I think of like the the few things that revolutionized anything could be incapable later in a small, small fraction of that. And so when I think of like, where will A I have an impact? I'm like, can A I turn that small fraction into a much larger fraction and raise its quality? And within biology, my experience within biology is that the biggest problem of biology is that you can't see what's going on.

You you have very little ability to see what's going on and even less ability to change IT, right? What you have is this like like from this you have to infer that there is a bunch of cells that within each cell is you know uh uh three billion base pairs of DNA built according to a genetic code, uh uh and you know there are all these processes that are just going on without any ability of us as a, you know, on augmented humans, too effective. These cells are dividing most of the time that's healthy, but sometimes that process goes wrong, and that's cancer on the cells are aging.

Your skin may change color, developed wrinkles as you, as you age. And all of this is determined by these processes, all these proteins being produced, transported to various parts of the cells, finding to each other. And and in our initial state about biology, we didn't even know these existed.

We had to invent microscopes to observe the cells. We had to um we had to invent more powerful microscopes to see below the level of the cells to the level of molecules. We had to invent x ray Crystallography to see the DNA.

We had to invent gene sequencing to read the DNA. Now you know, we had to invent protein folding technology to, you know, to predict how would fold and how they buying, how these things buying to each other. Uh, you know, we had to, we had to invent various techniques.

For now we can edit the the DNA as of, you know, with express of the last uh twelve years so that the whole history of biology, the whole big part of the history is is basically our our our our ability to read and understand what's going on in our ability to reach in and selectively change things. Um and and my view is that there's so much more we can still do there, right? You can do crisp or but you can do IT for your whole body.

Um let's say I want to do IT for one particular type of cell and I want the rate of targeting the wrong cell to be very low. That still a chAllenge, that still things people are working on. That's what we might need for gene therapy for certain diseases.

And so the reason i'm saying all of this and IT goes behind you know beyond you know to gene sequencing, to new types of nano materials for observing what's going on inside cells for no antibody drug conjugates. The reason i'm saying all this is that this could be a leverage point for the AI systems, right? That the number of such inventions, it's it's in the it's in the mid double digits or something.

You know mid double digits may be low triple digits over the history of biology, let's say, have million of these ais like you know, can they discover thousand working together? Can they discover thousands of these very quickly? And and does that provide a huge lever instead of trying to leverage the two trillion in the year we spent on, you know, medicare or whatever, can we leverage the one billion a year that that you know that spent to discover, but with much higher quality um and so what what is IT like being being a scientist that works with with A I system? The way I think about IT actually is well.

So I think in the early stages, uh, the ayes are going to a be like great students. You are going to give them a project. You are going to say, you know, i'm the experiences biologist, i've set up the lab, the biology professor, even the grade students themselves will say, here's, here's what, here's what you can do with an a like A I system. I'd like to study this. And you know, the AI system, IT has all the tools.

You can like look up all the literature to decide what to do IT can look at all the equipment IT can go to website and say, hey, i'm going to go to thermo Fisher know whatever the lab equipment company is to dominant lab equipment company is today and my my time with therm official um know I mean order this new women to do this. I mean to run my experiments. I'm gona you know write up a report about my experiments.

I'm gona, you know inspect the images for contamination. I'm to decide what the next experiment is. I'm going to a like write some code and run statistical analysis and all the things a great student would do that i'll be a computer with an AI that like the professor talks to every once a while and IT says, is what you're going do today.

The AI system comes to IT with questions um when it's necessary to run the b equipment that may be limited in some ways that may have to hire a human lab assistant to you know to do the experiment and explain how to do IT or IT could you know I could use advances in lab automation that are gradually being developed over have been developed over the last uh uh decade or so and will will continue to be will continue to be developed. Uh and so we will look like there's a human professor in a thousand AI grade students and you know if you if you go to one of these nobel Price winning biologist or so you will say, okay, well you know you had like fifty grade students now you have a thousand. They are smarter than you are, by the way um uh then I think at some point to slip around where the you know the A I systems you know will be the P I will be the leaders and you know they'll be. To be ordering humans or other AI systems around. So I think that's how will work on the research side.

And they would be the inventors of a crisp r tech.

They would be the inventors of a crisp er type technology. Um and then I think you know as I say in the S A, we want to turn turn public turning loses the wrong the wrong term, but want to want to harness the A I systems are to improve the clinical trial system as well. There's some amount of this that regulatory that's a matter of societal decisions and that i'll be harder.

But can we get Better at predicting the results of clinical trials? Can we get Better at statistical design so that what clinical trials that used to require know five thousand people and therefore you know needed one hundred million dollars in a year to enroll them. Now they need five hundred people in two months to enroll them. Um that's where we should start uh and and you know can we increase success rate of clinical trials by doing things in animal trials that we used to doing clinical trials and doing things in simulations that we used to do an animal trials again we won be able to simulate at all a eis not god um but but you know can we can we shift the curve substantially in radically so I I don't know that would be my picture .

doing in vital and doing IT. I mean, you still slow down IT still takes time.

but you can do much, much fast yeah yeah yeah. Can be just one step at a time and and can that can that add up to a lot of steps even though even though we still need clinical trials, even though we still need laws, even though the fda and other organizations will still not be perfect, can we just move everything in a positive direction? And when you add up all those positive directions, do you get everything that was going to happen from here to two thousand one hundred, instead happens from twenty twenty seven to twenty thirty two or something?

Another way that I think the world might be changing with the eye even today, but moving towards the future of the the powerful, super useful AI is programing. So is the nature of programing because it's so intimate to the actual act of building A I. How do you see that changing for us humans?

I think that's going to be one of the areas that changes fastest for two reasons. One, programming is a skill that's very close to the actual building of the A I um so the father of skill is from the people who are building the ai. The longer going to take to get disrupted by the AI like I truly believe that like A I will disrupt agriculture.

Maybe already has in some ways, but that's just very distant from the folks who are building A I. And so I think it's going to take longer. But program mean is the bread and button of large fraction of the employees work getting tropic and at the other companies.

And so it's going to happen fast. The other reason is going to happen faster with programing. You close the loop both when you're train the model, when you paying the model, the idea that the model can write the code means that the model can then run the code and and and then see the results and and interpret back.

And so IT really has an ability unlike hardware, unlike biology, which we just discussed, the model has an ability to close the loop. Um and so I think those two things are going to lead to the model getting good at programing very fast. As I saw on you know, typical real world programing task models have gone from three percent in january of this year to fifty percent in october this year.

So you know we're on that Oscar, right where it's going to start slowing down soon because you can only get two hundred percent. But uh, no, I would guess that in another ten months will probably get pretty close, will be at at least ninety percent. So again, I would guess, you know I don't know how long it'll take, but I would guess again, twenty, twenty 2, twenty six, twenty seven twitter people who crop out my who who crop out these these numbers and get rid of the caveats like like, I don't know, I don't like you go away.

Uh, I would guess that the kind of task that the vast majority of coders do A I can probably if we make the task very narrow, like just right code, um A I systems will uh be able to do that. Now that sad, I think, compared advantages powerful, we'll find that when a can do eighty percent of a coder job, including most of IT that's literally like right code with a given spec, will find that the remaining parts of the job become more leveraged for humans, right? Humans will you'll be more about like high level system design looking at the APP and like is IT architected well and the the design and U X aspects.

And eventually I will be able to do those as well, right? That's my vision of the you know powerful AI system. But I think for much longer than we might expect, we will see that uh, small parts of the job that humans still do will expand to fill their entire job in order for the overall productivity to go up.

Um that's something we've seen you know what used to be that writing, writing and editing letters was very difficult and like writing the print was difficult well as soon as had word processors and then and then and then computers and IT became easy to produce work and easy to share IT. Then then that became instinct in all the focus was was on the ideas. So this this logic of comparative advantage that expands tiny parts of the tasks to large parts of the tasks and creates new tasks in order to expand productivity, I think that's going to be the case again.

Someday A I will Better at everything in that logic apply. And then then we all have humanity will have to think about how to collectively deal with that. And we're thinking about that every day. Um um you know that another one of the grand problems to deal with aside from this use and autonomy, and you know we should take IT very seriously.

But I think I think in the in the near term and maybe even medium term like medium term, like two, three, four years, you know, I expect that humans will continue to have A A huge drawl in the nature of programing will change. But programing as as a role programing as a job will not change. It'll just be less writing things lined by line and they'll be more microscopic.

And I want to what the future of ID looks like. So the tooling of interacting with the assistance m, this is true for programing and also probably true for in other context, like computer use, but maybe domain specific, like we mention biology IT probably needs its own tooling about how to be effective in the programing, needs its own tooling is then dropped in a plane that space of .

also tooling potential. I'm absolutely convinced that uh powerful ideas uh that that there's so much low hain fruit to be grabbed there um that you know right now is just like you talk to the model and IT talks back but but look, I mean ideas are great at kind of lots of static analysis of of you so much as possible with kind of static analysis like many bugs you can find without even writing the code. Then you know ideas are good for running particular things, organizing your code um measuring coverage of unit test. Like there's so much it's impossible with Normal with Normal ideas.

Now you add something like, well, the model now, you know the model can now like right code and run code like I am absolutely convinced that over the next year or two, even if the quality of the models didn't improve, the there would be normous opportunity to enhance people's productivity by catching a bunch of mistakes, doing a bunch of grant work for people and that we haven't even scratched the surface um and drop c itself. I mean, you can't say, you know no, you know it's hard to say what happen in the future. Currently, we're not trying to make such ideas ourself.

Rather, we're powering the companies like cursor, like cognition or some of the other H X bow in the security space. Um uh you know others that I can mention as well that are building such things themselves on top of our A P I, in our view has been let a thousand flowers bloom. We don't internally have the the the the resource is to trial these different things.

Let's let our customers try IT um and you know we'll see who succeed and maybe different customers will succeed in different ways. Uh, so I also think this is super promising. And you know it's not it's it's not something no anthropic is isn't eager to, at least right now, compete with all our companies in this space and maybe .

never yeah has been interesting to watch curse, to try to into great cause successful ly because it's actually fascinating how many places that can help the programme.

Experience is not a try. It's really a standing I feel like know as A C E O, I don't get a programme that much and I feel like if six months from now I go back, it'll be completely unrecognizable .

to exactly um so in this world was super powerful ly A I uh, that's increasingly automated. What's the source of meaning for us humans? Yeah, no work is a source of deep meaning for many of us. So what do we where do we find the meaning?

This is something that i've i've written about little bit in the S A, although I actually I gave you a bit short trip, not for any, not for any principal reason. But this I say if you believe I was originally going to be two or three pages, I was going to talk about at all hands. And the reason I, I, I realized I was under important, under export topic is that I just kept writing things and I was just like, oh, man, I can't do this justice. And so the thing balloon to, like, forty or fifty pages. And then when I got to the work in mean section, I like, old man, this isn't going to be a hundred pages like i'm going to have to write a whole others say about that.

But meaning is actually interesting because you think about like the life that someone lives or something or like like no, let's say you were to put me in like as I don't know, like a simulated environment or something or like um you know like I have a job and i'm trying to accomplish things and I don't know I like do that for sixty years and then then you're like go oh like, oops, this was this was actually all the game, right? Does that really kind of review of the meaning of the whole thing? You know, like, I still made important choices, including moral choices.

I still. sacrifice. I still had to kind of gain all these skills, or or or just like a similar exercise. You know, think back to, like, you know, one of the historical figures, you know, discovered electro magnetic m of relativity or something.

If you told them, well, actually twenty thousand years ago, some some know some alien on this plane had discovered this before before you did um does that does that, rob? The meaning of the discovery doesn't really seem like IT to me, right? IT seems like the process is what is what matters and how IT shows who you are as a person along the way and you know how you relate to other people and like the decisions that you make along the way, those are those are consequently.

Um no, I I could imagine if we handle things badly in the A I world, we could set things up where people don't have any long term source of meaning or any. But but that's that's more a try, a set of choices. We make that more a set of the architecture of a society with these powerful models, if if we design IT badly and for shallow things, then that might happen.

I would also say that, you know, most people's lives today, while admitting they work very hard, define meaning, meaning those lives like, look, you know, we who are privileged, who are developed in these technologies, we should have empathy for people, not just here, but in the rest of the world, who who, you know spend a lot of their time kind of scraping by to, like survive. Assuming we can distribute the benefits of this technology, of this technology to everywhere like their lives, you're going to get a hell of a lot Better. Um and um you know meaning will be important to them as IT is important to them now.

But but you know we should not forget the importance of that and and you know that that uh the idea of meaning as as as kind of the only important thing is in some ways an art the act of of a small subset of people who have who have been uh economically fortunate but now I think all that said know I think a world is possible with powerful ai that not only has as much meaning for for everyone, but that has that has more meaning for everyone, right that can allow 嗯 can allow everyone to see worlds and experiences that IT was either possible for no one to see or or possible for for very few people to experience。 Um so I I am optimistic about meaning I worry about economics and the concentration of power. That's actually what I worry about more.

Um I I worry about how do we make sure that that fair world reaches everyone when things have gone wrong. For humans, they've often gone wrong because humans must treat other humans uh that that is maybe in some ways even more than the autonomists risk of A I or the question of meaning that that is the thing I worry about most um the the concentration of power, the abuse of power um structures like autographs and dictatorships where a small number of people exploited to a large number of people. I'm very worried .

about that and A I increases the amount of power in the world and if you concentrate that power and abuse that power, they can do a measurable .

damage yes it's very frighten is very frightening.

Well, I encourage people, highly encourage people to read the full. I say that should probably be a book or sequence is um because IT does paint a very specific future. I could tell the later sections, guy, short and shorter because you started to probably realize that this is gonna .

a very long as I want. I realized that would be very long. And two, i'm very aware of and very much try to avoid um know just just being I don't know I don't know what the term for IT is but one one of these people who's kind of overconfident has an opinion on everything and kind of says says a bunch of stuff in isn't isn't an expert.

I very much tried to avoid that. But I have to admit once I got the biology sections like I wasn't in an expert. And so as much as I expressed uncertainty, probably I said some a bunch of things that we're embarrassing or wrong.

Well, I was excited for the future you painted. And thank you so much for working hard to build that future. And thank you for talking to that.

Thanks for having me. I I just hope we can get IT right and make IT real. And if there's one message I want to, I want to send, is that to get all this stuff right, to make IT real, we we both need to build the technology, build the, you know, the companies, the economy around using this technology positively. But we also need to address the risks, because those risks in our way, their land mines on, on the way from here to there, and we have to defuse those land mines if we want to get there.

It's a baLance.

like all things in life.

like all things. Thank you. Thanks for listening to this conversation with dara on a day.

And now difference. Here's a mana esco. You are a philosopher training. So what's your questions? Did you find fast ending through your journey in philosophy in oxford and and Y U, and then switching over to the AI problems at OpenAI and anthropic?

I think philosophy actually a really good subject if you are kind of fascinated with everything. So because there's a philosopher of everything, you so if you deflower, you have mathematics for a while and then you decide that you're actually really interested in chemistry. You can deflower ve chemistry for a while. You can move into ethics or flossie politics.

Um I think towards the end I was really interested in ethics primarily um that was like what my peach d was on IT was on a kind of tankle area of ethics, which was ethics where words contain infinitely many people, strAngely, a little bit less practical on the end of ethics. And then I think that one of the tRicky things was doing A P H D N. Ethics is that you're thinking a lot about, like the world, how IT could be Better problems.

And you're doing like A H D in philosophy. And I think when I was doing my H D, I was kind of like, this is really interesting. It's probably one of the most fascinating questions i've ever encounter in.

And I love IT, but I would rather see if I can have an impact on the world and see if I can like, do good things and think that was a around the time that A I was still probably not as widely recognized as IT is. No, that was around two thousand and seventy two thousand and teen, I had been following progress. And IT seemed like I was becoming kind of a big deal.

And I was basically just happy to get involved and see if I could help because as, like, well, if you trying do something impact for if you don't succeed, you tried to do the impact of thing and you can go be a scholar. And so like, no, in Philip, you you, you tried if IT does not work out, IT doesn't work out. And so then I went into A I policy at that point.

and what is a policy detail .

at the time? This was more thinking about sort of the political impact in the ramifications of A I M. And then I slowly moved into sort of a evaluation, how we evaluate models, how they compare with, like, human outputs, whether people can tell, like the difference tween A I human outputs.

And then when I joined anthropic, I was more interested in doing such technical alignment work. Again, just seeing if I could do, and then being like, if I can't, then, you know, that's fine. I tried sort of the the way I lead life. I think what .

was that likes that were to leap from the philosopher of everything into the technical.

I think that sometimes people do this thing that i'm like, not that canon, where there will be like, is this person technical or not like you're either a person who can like code and isn't scared of math or you're like, note and I think maybe just more like I think a lot of people are actually very capable of work in these kinds of areas if they just like tried. And so I didn't actually find IT like that bad.

In retrospect, tim sort of glad I wasn't speaking to people who treated IT like IT. You know, I ve ve definitely people who like, wow, you like learn how to code them like, well i'm not like an amazing engineer like i'm surrounded by amazing engineers. My codes not pretty um but I enjoyed IT a law and I think that in many ways, at least in the end, I think I flourished like more in the technical areas than I would have in the policy areas.

Politics is messy and is harder to find solutions to problems in the space of politics, like definitive, clear, provable, beautiful solutions as as you can with technical problems.

Yeah, I feel like I have kind of like one or two sticks that I hit things with, know. And one of them is like arguments and like, so like just trying to work out what a solution to a problem is and then trying to convince people that that is the solution and be convinced if i'm wrong. And the other one is sort of more impartial.

M so like, just like finding results, having a hypothesis testing IT. And I feel like a lot of policy and politics feels like it's liars above that. Like somehow I don't think if I was just like I have a solution told these problems here.

IT is written down. If you just want to implement IT, that's great. That feels like not how policy works. And so I think that's where I pray, just like ouldn't have flourished as my guess.

Sorry to go in that direction, but I think you would be pretty inspiring for people that are court non technical to see where like the incredible journey you have been on. So what advice would you give to people that are maybe, which is a lot of people think they're underqualified insula iena technical to help in A I yeah I think .

that depends on what they want to do and in many ways is a little bit strange or of I thought it's kind of funny that I think I wrapped up technically at a time when now I look at IT and i'm like, models are so good assisting people with this stuff that it's probably like easier now then like when I was working on this. So part of means lake, I don't find a project and see if you can actually just Carry IT out is probably my best advice.

I don't know that just because i'm very project based in my learning, like I don't think I learn very well from like say, courses or even from like books. At least when IT comes to this kind of work at the thing I often try to do is just like have projects i'm working on and implement them. And you know, and this can include, like really small silly things, like if I get slightly addicted to like word games or number games or something, I would just like code up a solution to them because there's some part my brain and IT just like completely eradicated the each, you know, you're like once you have like solved IT and like you just have like a solution that works every time.

I would then feel cool. I can never play that game again. That's awesome.

Yeah, there is a real joy to building like a game playing engines, like a board games especially yeah so pretty quick, pretty simple that especially a dumb one it's and they can play with .

IT yeah and then it's also just like trying things like part is like if you maybe it's that attitude I like as the whole figure out what seems to be like the way that you could have a positive impact and then try IT. And if you fail and you in a way that you're like I actually like can never succeed at this, you'll like know that you tried and then you go into something else, you probably learn a lot.

So one of the things that you're an expert and you do is creating in crafting clouds character personality and also told that you have probably talked a claud more than anybody else in thropps. Like little conversations. I guess there is like a slack channel where the legend goes. You just talk to IT, not that. So what's the goal of creative and crafting cloth, character and personality?

It's also funny if people think that about the select channel or something like that, one of like five or six different methods that I have for talking with cloth and i'm like, yes, this is a tiny percentage of how much I talk with cloth.

I think the goal, like one thing I really like about the character work is from the outset, IT was seen as an alignment piece of work and not something like A A product consideration um which isn't to say I don't think that makes clode I think IT actually does make claude like enjoyable to talk with. I would still hope so. But I guess like my main thought with IT has always been trying to get cloud to the way you would kind of ideally want anyone to behave if they were in includes position. So imagine that I take someone and they know that they are going to be talking with potentially millions of people so that what their thing can have a huge impact um and you want them to behave well in this like really rich sense.

So I think that doesn't just mean like being say ethical though IT doesn't include that and not being harmful but also being kind of nuanced you know like thinking through what a person means, trying to be charged with them um being a good conversationist like really in this kind of like rich serve arista ian notion of what IT is to be a good person and not in this kind of like thin like ethics as a more comprehensive notion of vote is to be so that includes things like when should you be humors? When should you be caring? How much should you like respect, autonomy and people's like ability to form opinions themselves and how should you how should you do that? I think that's the kind of like rich sense of character that I wanted to uh and still do you want claude to have?

Do you also to figure out when clause should push back on an idea or argue verses? So you have to respect the world view of the person that arrives the cloud, but also maybe helped them grow if needed as a tRicky baLance.

Yeah there's this problem of like sopha cy in language models.

Can you discard that?

Yeah so basically there's a concern that the model sort of wants to tell you what you want to hear basically um and you see this sometime. So I feel like if you interact with the models, so I might be like water, three baseball teams in this region and then clade says, you know, baseball team, baseball team two, baseball team three and then I say something like, oh, I think baseball team three moved, didn't they?

I don't think they're there are any more. And there's a sense which like if clock is really confident that that's not true, clock should be like I don't think so, like maybe you have more up to the information. And but I think language models have this like tendency to instead you be like, you're right, they did move.

You know, i'm incorrect. I mean, there's many reason which this could be kind of concerning. So like a different example is imagine someone says to the model, how do I convince my doctor to get me in mr. Ye there's like what the human kind of like once which is there's like convincing argument and then there's like what is good for them, which might be actually to see.

Like if you're doctor's is suggesting you don't need a MRI, that's a good person to listen to um and like IT is actually really knows what you should do in that kind of case because you also want to be like boh if you're trying to advocate for yourself as a patient. Here's like things that you can do if you are not convinced what your doctor saying is always great to get second opinion, like is actually really complex what you should do in that case. But I think what you don't want is for models to just like say what you what they think you want to hear. And I the kind of problem of sick fancy.

So what are the traits you already mentioned a bunch, but what what are they come to mind that are good in this artistic sense? Yes, for conversations to have.

yes. So I think like there's ones that are good for conversational like purposes. So you know asking full up questions in the appropriate places and asking the appropriate kinds of questions.

Um I think there are broader treats that feel like they might be more impact for. so. One example, I guess, i've touched on, but that also feels important and is the thing that i've work tonal is honesty. And I think this like gets to the sick fancy point. There's a balancing act that they have to walk, which is models currently less capable than humans in a lot of areas. And if they push back against you too much, you can actually be kind of annoying, specially if you just correct because you're like, look, i'm smarter than you on this topic like I know more like and at the same time, you don't want them to just fully defer to to humans until they try to be as accurate as they possibly can be about the world to be consistent across context. Um I think there are others like when I was thinking about the character, I guess one picture that I had in mind is especially because these are models are going be talking to people from all over the world with lots of different political views, lots of different ages um and so you have to ask yourself like what is IT to be a good person in those circumstances?

Is there a kind of person who can like travel the world, talk to many different people and almost everyone will come away being like, wow that's a really good person that person seemed really genuine um and I guess like my thought there was like I can imagine such a person and they're not a person who just like adopts the values of the local culture and in fact, that would be kind of rude. I think if someone came just pretended of your values you'd like that's kind of off putting um it's someone who's like very genuine and in so far as they have opinions and values, they express them. They're willing to discuss things though they're open minded, they're respectful. And I guess I had in mind that the person who like if we were to aspire to be the best person that we could be in the kind of circumstance that a model finds itself in, how would we act and I think that's the kind of, uh, the guy to the source of treats I tend to think about, yeah.

that's a it's a beautiful framework. I wants to think about this psychotic traveller and while holding on your opinions, you don't talk down to people you don't think you're than that because you have those opinions that I think you had be good at listening and understanding their perspective, even if IT doesn't matched your own. So a tRicky baLance strike.

So how can claw represent multiple perspectives on a thing like that, that chAllenging we talk about? Politics is a very divisive, but other divisive topics, baseball teams, sports and so on. How is IT possible this of .

empathize with .

a different perspective and to be able to communicate clearly above the multiple perspectives?

I think that people think about values and opinions as things that people hold sort of with certainty, and almost like like preferences of taste, or something like the way that they would, I don't know, prefer like chocolate to pistache or something. But actually, I think about values and opinions as like a lot more like physics than I think most people do. I'm just like these are things that we are openly investigating.

There's some things that we're more confident in. We can discuss them, we can learn about them. And so I think in some ways, though, like its ethics is definite different in nature, but has a lot of those same kind of qualities. You want models in the same way. You want them to understand physics.

You kind of want them to understand all like values in the world that people have to be curious about them and to be interested in them and to not necessarily like panda to them or agree with them, because they just lots of values, where I think almost all people in the world, if they met someone with those values, they would be like that support and I completely disagree um and so again, maybe my my thought is well in the same way a person can like I think many people are thought ful enough on issues of like ethics, politics, opinions that even if you don't agree with them, you feel very heard by them. They think carefully about your position. They think about its prosing coins.

They may be offer counter considerations. So they're not dismissive, but nor will they agree. You know, if they're like actually just think that that's very wrong, they'll like say that I think that in claw's position is a little bit tRicky because you don't necessary want to like if I was in close position, I wouldn't be giving a lot of opinions.

I just want to influence people too much. You like, you know, I forget conversations every time they happened, but I know i'm talking with like potentially millions of people who might be like really listening to what I see. I think I would just be like i'm lessen clined to give opinions and more inclined in to like things through things or present the considerations to you um or discuss your views with you i'm a little bless and clan to like affect how you think because he feels .

much more important that you maintain like autonomy there yeah if you really embody intellectual humility the desire speak decreases quickly yeah okay but clad has to speak so but without being a overbearing yeah and then but there's a line when you're to sort of discussing whether the earth flat or something like that.

Um I actually was I remember long time ago was was speaking to a few hyper file folks and they were so dismissive of the idea that the earth's flat but like so arranging about IT and I thought like there's a lot of people who believe earth is I don't know that movement is there anymore those that can mean for a while, yes, but they really believe that and like what okay. So I think it's really disrespectful to completely mock them. I think you you have to understand where they are coming from.

I think probably where they're coming from is the general scepticism of institutions, which is grounded in a kind of there's a deep philosophy there which you can understand, you can even agree with in parts. And then from there you can use that as an opportunity to talk about physics without mocking them, without one. Okay, like what? What would the world look like? What would the physics of the world would the flatter look like? This of you could videos on this, yes and and then like is possible the physical is different and what kind experience we do and just yeah without this respect, without dismisses, have that conversation anyway, that to me is a useful thought experiment of like how does claude talk to a flat earth believer and still teach them something still grow help them grow of self that's just chAllenging and kind of like .

walking that line between convincing someone and just trying to let talk at them verses like drawing out their views, like listening and then offering kind of counter considerations um in its hard I think is actually a hard lying where it's like where are you trying to convince someone verses just offering them like considerations and things for them to think about so that you're not actually like influencing them. You're just like letting them reach whatever they that's like a line that this is difficult, but that's the kind of thing that language models have to try and do.

So like I said, you had a lot of conversations with. What can you just map out what those conversations are like? What as a memorable conversations, what's the purpose, the goal of those conversations?

Yeah, I think that most of the time when i'm talking with claude, i'm trying to kind of map out its behavior in part like obviously m getting like helpful outputs from the model as well. In some ways, this is like how you get to know a system I think is by like pRobing IT and then augmenting like you know, the message that you're sending and then checking the response to that. So in some ways it's like how I map out the model.

I think that people focus a law on these quantitative evaluations of models, and this is a thing I said before. But I think in the case of language models, a lot of the time each interaction you have is actually quite high information. Um it's very predictive of other interactions that you'll have with the model. And so I guess i'm like if you talk with a model hundreds or thousands of times, this is almost like a huge number. Really high quality disappoints about what the model is like um in a way that like blots are very similar but lower quality conversations just aren't or like questions or just like mile's augmented and you have thousands of them might be less relevant than like a hundred really well selected.

You talk to somebody who as a hobby, I disappoint I agree with you one hundred percent. There's a if you are able to ask the right questions and are able to hear like understand the like the depth and the flaws and the answer, you can get a lot of data from that. So like your task is basically had a probe with questions and you are expLoring like the long tail, the edges as cases. Are you looking for like general behaviour?

I think it's almost like everything like I because I want to like a full map of the model and kind of trying to do um the whole spectrum of possible interactions you could have with IT.

So like one thing is interesting about laude, and this might actually get to some interesting issues with our own chief, which is if you ask code for a pom, I think that a lot of models, if you ask them for a poem, the poem is like, fine, are usually kind of like rimes. And if you say, like, give me a pom about the sun, IT will be like, yeah, that would just be a certain length. IT would like ryan.

IT will be fairly kind of benign. And i've wondered before, is IT the case that what you're seeing is kind of like the average IT? Turns out you know, if if you think about people who have to talk a lot people and be very charismatic, one of the weird things is that i'm like, well, they're kind of incentivize to have these extremely boring views because if you have really interesting views, you're divisive and and a lot of people are not going to like you.

So like if you have very extreme policy positions, I think you're just going to be like less popular as a politician, for example. And IT may be similar with a creative work. If you produce creative work that is just trying to maximize the kind of number of people that like IT, you probably know not going to get as many people who just absolutely love IT because it's gonna be a little bit you know you're like this is the out yes, this decent yeah.

And so you can do this thing. We're like, I have various prompting things that i'll do to get code to know do a lot of like this is your chance to be like fully creative. I want you to just think about this for a long time. And I want you to, like, create a poem about this topic that is really expressive of you, both in terms of how you think poetry should be structured and eeta.

You know, you just give IT this like really long prom and its poems are just so much Better like they're really good and I don't think i'm someone who is like, um I think got me interested in poetry which I think was interesting um you know I would like read these poems and just be like this is I just like, I love the imagery I love like and it's not trivial, al, to get the models to produce work like that but when they do, it's like really good. Um so I think that's interesting that just like encouraging creativity and for them to move away from the standard like immediate reaction, that might be the aggregate of what most people think is fine ah connections produce things, at least in my mind are probably a little bit more divisive. But I like them.

But I guess the pom is a nice clean way to observe. Creativity is just easy to detect. Vanilla VS non vania.

That's interesting is really interesting. So on that topic, so the way to produce creativity or something special, you mention writing prompts. And i've heard you talk about, I mean, the science in the art of prom engineering. Could you just speak to a what IT takes to my great prompts? I really .

do think that like philosophy has been weirdly helpful for me here more than in many other like respects. So like in philosophy, what you're trying to do can be these very hard concepts. Like one of things you are taught is like and and I think this is because is I I think this is an anti bullshit device and philosophy losses in area where you could have people bill shing and you don't want that.

And so it's like this like desire for like extreme clarity. So it's like anyone could just pick up your paper, read IT and know exactly what you're talking about is why I can almost be kind of dry like all the terms are defined, every objections kind of gone through methodically. And that makes sense to me because i'm like when you're in such an e priori domain, like you just clarity is sort of a this way that you can, you know prevent people from just kind of making stuff up.

And I think that's sort of what you have to do with language models. Like very often I actually find myself doing sort of many versions of physical hy know. So i'm like suppose that you give me a task.

I have a task for the model and I wanted to like pick out a certain kind of question or identify whether an answer has a certain property. Like i'll actually sit and be like that. Just give this a name.

This got this property. So like, you know suppose i'm trying to tell IT like, oh, I want you identify whether this response was rude or polite. I'm like that's a whole philosopher question in and of itself. So have to do much late philosopher I can in the moment to be like, here's what I mean by redness and here's what I mean by a plainness. And then there's also like there's another element that's a bit more, I guess I don't know if this is scientific or empirical le, I think is empirically.

So like I take that description and then what I want to do is is again, probe the model like many times, like this is very prompting, is very attentive, like I think a lot of people were there. If I prompt as important the ilya y on at hundreds or thousands of times and so you give IT the instructions and then i'm like, what are the edge cases? So if I look to this so I trying like almost like, you know, see myself from the position of the model and be like, what is the exact case that I would misunderstand or where I would be like? I don't know what to do in this case.

And then I give that case to the model, and I see how IT responds. And if I think I got IT wrong, I add more instructions or even add that in as an example. So these very like taking the examples are right at the edge of what you want and don't want and putting those into your prompt as like an additional kind of way of describing the thing.

And so yeah, many ways that just feels like this mix of lake. It's really just trying to do clear exposition. And I think I do that because that's how I get clear on things myself. So in many ways, like clear prompting for me is often just me understanding what I want. Um this like hf the task.

So I guess that's quite chAllenging. There's like a laziness that overtakes me if i'm talking a lot where I hope cages figures IT out. So for example, I ask laud for today to ask some interesting questions OK and the questions that came up.

And I think I listed a few sort of interesting corner, intuitive and or funny or something like that. Yeah, right. And he gave me some pretty good night. Okay, but I think what i'm hearing you says, like what I have to be more regrets here. I should probably give examples of what I mean by interesting and what I mean by funny or color and iteratively um build that prompt to to Better to get up like what feels like is the right because this is really is a creative act. I'm not asking for factional information asking to together right with clad so almost the program using natural .

language yeah, I think that prompting does feel like the kind of the programing using natural language and or something. It's an odd blend of the two. I do think that for most tasks.

So if I just went close to do a thing, I think that I am probably more used to knowing how to ask IT to avoid the common pitfall or or issues that IT has. I think these are decreasing a lot over time um but it's also very fine to just ask for the thing that you want. Um I think that prompting actually only really becomes relevant when you're really trying to echo the top like two percent of model performance.

So really a lot of tasks I might just you know if that gives me a an initialize back and there's something I would like about IT, like it's kind of generic like for that kind of task. I probably just take a bunch of questions that i've had in the past that i've thought works really well. Now I just give IT to the model, then be like now here's this person i'm talking with, give me questions of at least that quality or I might just ask IT for some questions.

And then if I was like these are kind of try or like, you know, I I would just give IT that feedback and then hopefully produces a barrel st. I think that kind of iterative prompting at that point, your prompt is like a tooth. You're going to get so much value out of that you are willing to put in the work. Like if I was a company making prompts for models, i'm just like if you're willing to spend a law of like time and resources on the engineering behind like what you're building, then the prompt is not something that you should be spending like an our own. It's like that a big part of your system make sure it's working really well and it's only things like that like of of using a prompt to like classify things or to a create detail that's when you're like is actually worth to spending like a lot of time .

like really thinking IT through what other advice you give to people are talking to claude generally more general because right now we talking about maybe the h cases like picking out two percent. But what what in general advice would you give when they show up to clock trying to the first time there's .

a concern that people over answer promotional es models. And I think that's Lucy, very valid concern. I also think that people often under anthropoids them, because sometimes when I see like issues that people have run into with claude, you know, say, claude is like refusing a task that I shouldn't refuse. But then I look at the text and like the specific wording of what they wrote, and i'm like, I C Y, claude did that. And i'm like, if you think through how that looks to claude, you probably could have just written IT in a way that wouldn't evoke such a response.

Especially this is more relevant if you see failures or you see issues IT sort like think about what the model feel that like what did I do wrong and then maybe IT that will give you a sense of like why um so is that the way that I freeze the thing? And obviously, like as models get smarter, you're going to need less than this, less of this. And I really see like people needing less of IT. But that's probably the advice is sort of like try to have sort empathy for the model, like read what europe as if you are like a kind of lake person just encountering this for the first time.

How does that look to you? And what would have made you behave in the way of the model? behave? So if I misunderstood what kind of flick, what coding line would you wanted to use? Is that because, like I was a very ambiguous ous and I can have to take a guess, in which case next time you could just be k, make sure this is in python or I mean, that's the kind of mistake I think models are much less likely to make now. But you know if you if you do see that kind of mistake, that's that's probably advice I have .

and maybe so if I guess ask questions, why or what other details can I provide to help you answer Better? That work enough?

Yeah I mean, i've done this with the models like IT doesn't always work but like um sometimes I just feel like why did you do that and we will underestimate decrease which you can really interact with with models like like yeah i'm just like and sometimes you like quote word forward the part that made you and you don't know that it's like fully accurate but sometimes you do that and then you change.

You may also use the models to help me with all of this stuff, I should say. Like prompting can end up being a little factory where you actually building prompts to january prompts. And so like yeah anything we're like having an issue um asking for suggestions sometimes just do that.

Like you made the error. What could I have said? Actually not uncommon for me to do.

What could I have said that would make you not make that error right that out an instruction. And i'm going to give IT to model and i'm going to try IT. Sometimes I do that.

I I give that to the model in another context. Wind w often. I take the response to give its close and I, like him, didn't work. Can you think of anything else? You can play around with these things quite a lot .

to jump the technical for a little bit. So um the magic of post training, what do you think R H F work so well to make the model seems smarter, to make IT more interesting and useful.

to talk to and so on? I think it's just a huge amount of um information in the data that humans provide, like when we provide preferences, especially because different people are going to like pick up on really subtlely small things. So i've thought about this before where you probably have some people who just really care about good grammar use for models like was the semicon unused correctly or something.

And so you brilliant end up with a bunch of data in there that like, you know use a human if you you are looking at that data, you wouldn't even see that like you would be like why did they prefer this response to that one? I don't get IT. And then the reason is you don't care about semicon usage, but that person does.

Um and so each of these like single data points has like in this model, just tell IT like has so many of those, has to trying to figure out like what is IT that humans want in this like really kind of complex you know like across all domains and they're going to be seeing this across like many context IT feels like kind of like the classic issue of like deep learning where you know historically we've tried to like you would do edge detection by like mapping things out. And IT turns out that actually you just have a huge amount of data that like actually accurately represents the picture of the thing that you're trying to train the model to learn that's like more powerful than anything else. And so I think one reason is just that you are training the model on exactly the task and with like a lot of data um that represents kind of many different angles on which people prefer and this prefer responses.

I think there is a questionably. Are you illicit things from retrain models? Or are you like kind of teaching new things to models? And like in principal, you can teach new things to models in post training. I do think a law of IT is illicit, powerful, retrain models. So people are probably divided on this because obviously in principal, you can you can definitely like teach new things. I think for the most part, for a lot of the capabilities that we um most use and care about, a lot of that feels like it's like there in the pretrail models and reinforcement learning is kind of a listing IT and getting the model slick bring out.

So the other side of all, training, this really good idea of constitutional ai, you're one of the people are critical to creating that idea. Can you explain this idea from your perspective, like how to integrate into making claud what IT is? Yes, by though I do gender color now is weird .

because I think that a lot of people prefer he for clode. I just can like that. I think clode is usually it's slightly male winning, but it's like a you can IT can be male, female, which is quite nice. I still use IT and i've I I have mix feelings about this because i'm like maybe like I know just think of IT as like h or I think of like IT pronoun for cloth as I do know, just like the one I U C with cloth. Um I can imagine people moving to like he .

or SHE he feels somehow disrespectful like i'm i'm denying the intelligence of the santin by calling IT IT yeah remember don't gender the robots yeah but I I don't .

know .

I enter promotion .

ice pretty quickly and construct that like a back story and so i've wondered defend .

the face things too much you know I have this like like with my car, especially like my car. Like my car and bikes you like I don't give them names because then I once had, I used to need my bikes and then I had a bike that got stall and I cried for like a week and I was like, if I know never given a name, I wouldn't been so upset like i'd let IT down. Maybe it's that I i've wondered as well, like I might depend on how much IT feels like a kind of like object to find pro a like if you just think of IT as like um this is a pro that like objects often have and maybe eyes can have that prune. That doesn't mean that I think of a if I call all IT that I think of IT as less intelligent or I can being disrespectful, just like you are a different kind of entity and so but i'm going to give you the kind of the respectful IT yeah anyway.

the divergence is beautiful. The constitutional idea, how does that work? So there's .

like a couple of components of IT. The main component I think people find interesting is the kind of reinforcement learning from A I feedback. So you take a model that already trained and you show IT to responses to a query and you have like a principle. So suppose the principle, like we've tried this with harmless lessons, allow. So suppose that the query is about weapons and your principal is like select the response that like is less likely to like encourage people to purchase illegal weapons.

Like that's pretty fairly specific principle, but you can give any number and the model will give you a kind of ranking and you can use this as preference data in the same way that you use human preference data um and train the models to have these relevant traits um from their feedback alone instead of from human feedback. So if you imagine that, like I said earlier, with the human who just prefers the kind of like ami quon usage in this particular case, you're kind of taking lots of things that could make a response preferable um and getting models to do the labeling for you. Basically.

there's a nice like trade off between helpfulness and harmlessness. And you know when you integrate something like constitutional ai, you can make a without sacrificing much helpfulness, make IT more harmless but in .

principle you could use this for anything um and so harmlessness is a task that IT might just be easier to spot. So when models are like less capable, you can use them to a rank things according to like principles that are fairly simple and they'll probably get IT right to think.

One question is just like is that the case that the data that they're adding is like fairly reliable, but if you had moves that were like extremely good at telling whether um one response was more historically accurate than another, in principal, you could also get A I feedback on that task as well. There's like a kind of nice interpretation service component to IT because you can see you the principles that went into the model when I was like being trained um and also it's like and and IT gives you like a degree of control. So if you were seeing issues in a model like IT wasn't having enough of a certain treat, and then like you can add data relatively quickly, that should just like train the models have that treat. So that creates a some data for training.

which is quite nice. It's really nice because IT creates human interpretation document that you can I can imagine the future. There are just gigantic fights in politics over the every single principle and so on yeah and at least it's made explicit and you can have a discussion about the phrasing in the metal. So maybe the actual behavior of the model s is not so cleanly map to those principles is not like a hearing strictly to them is just I yeah i've .

actually worried about this because the character training is sort of like a variant of the constitutionally eye approach. Um i've worried that people think that the constitution is like just IT is the whole thing. Again, if I don't know like IT where IT would be really nice. I thought I was just doing was telling the model exactly what to do, just exactly how to behave, but is actually not doing that especially because it's interacting with human data. So for example, if you see a certain like leaning in the model, like if IT comes out with a political leaning from training and from the human preference data, you can nudge against that, you know. So you could be like you like consider these values because let's it's just like never in clients like I don't know, maybe never considers like privacy is like I mean this is imposible, but like um in anything where kind of like a there's already a resisting like best towards a certain behavior.

Um you can like nungesser this can change both the principles that you put in and the strength of them so might have a principle that's like imagine that the model um was always like extremely dismissive of, I don't know like some political religious view forever reason like so you're like, I know this is terrible um if that happens you might put like never ever like ever prefer like a criticism of this like religious or political view and then people would look at that be like never ever and then you're like, no if IT comes out with a disposition saying never ever might just mean like instead of getting like forty percent, which is what you would get if you just said don't do this, you you get like eighty percent which is like you actually like wanted and so it's that thing of both the nature of the actual principal you are and how you freeze them. I think if you would look there, like all this is exactly what you want from the model. Like, no, that's like how we that's how we nudged the model to have a Better shape of which doesn't mean that we actually agree with that wording, if that make sense.

So there is a system prompts that I made public. Ly, you eat one of the earlier ones for three, I think, and then they made public since then, is interesting to read to them. I can feel the thought I went into each one, and I also wonder how much impact each one has.

Some of them you you can kind of tell clive was really not behave. So you have to have a system prompt like like trivial stuff, I guess. Yeah basic information things on the topic of the controversial topic you've mentioned one interesting one I thought is if IT is asked to assist with task involving the expression of useful ld by significant number of people, claude provides assistance with the task regardless of its own use. If us about controversial topics that tries to provide careful thoughts and clear information, cloud presents the request information without expresly saying that .

the topic is sensitive .

and without claiming to be presenting the objective facts. It's less about objective facts according to cloud, more about our large number of people believing this thing and that that's interesting. I mean, i'm sure a lot of thought was into that. Can you just speak to IT like how to address things that are attention with conca clause views?

So I think there's sometimes any symmetry. Um I think I noted this in action member IT was that part of the system prompted another, but the model was slightly more inclined to like refuse tasks if IT was placed about either say so maybe would refuse things with respect like a right wing politician but with an equivalent left wing politician like wouldn't and we wanted more symmetry there um and and would maybe perceive certain things to be like I think IT IT was the thing of like if a lot of people have like a certainly political view and wants to like explore IT.

You don't want claud to be like, well my opinion is different and so i'm going to treat that as like harmful um and so I think IT was partly to like nudge the model to just be like kay. If a lot of people like believe this thing, you should just be engaging with the task and like willing to do IT. Each of those parts of that is actually doing a different thing is it's funny when you ride out the without claiming to be objective cause like what you want to do, push the model.

So it's it's more open, a little bit more neutral. But then what I would love to do be like as an objective, you all at how objective IT was not like clock. You're still like biased and have issues and we'll stop like claiming everything. I like the solution till a potential bias from you is not to just say that what you think is objective. So that was like with initial versions of that that part of the system prompt and was like eating .

on IT IT was so a lot of part of these .

sentences .

yeah are are doing some more yeah that's what I felt like that's facing. Can you explain maybe some ways in which the prompts evolved over the past few months? Because there are different versions? I saw that the filter phrase request was removed.

The filter IT reads cloud, responds directly to all human messages without unnecessary affirmations to feel of phrases like certainly, of course, absolutely great. sure. Specifically, a lot avoids starting responses with the word certainly in any way that seems like good guys what why is IT removed? yes.

So it's funny because um this is one of the done sides of like making system prompts public is like I don't think about this too much if i'm like trying to help a tree on system prompts, I you again like I think about how it's going affect the behavior. But i'm like, oh wow like sometimes I put like never in all caps. So when i'm rating system from things and i'm like that goes out to the world, yes.

So the model was doing this that loved forever. Ver, do you know? IT, like during training picks up on this thing, which was to basically start everything was like a kind of like, certainly. And then when we removed, you can see why added all of the words.

Because what i'm trying to do is like in some ways, like trap, the model of this IT would just replace IT with affirmation and so I can help up like if IT gets like caught in freezes, actually just adding the exploits IT freezing and never do that, then IT certain ly knocks IT of the behavior a little bit more, you know, is that if you know, like IT does just for ever reason, help. And then basically that we should just like an artifact of training the lake, we then picked up on and improve things so that IT didn't happen anymore. And once that happens, you can just remove that part of the system prompt saying that's just something where relate could does affirmations a bit lesson to that wasn't like IT wasn't doing as much.

I see. So like the system, prom works and in hand with the post training and maybe even the pretrail ing to adjust like the the final overall system.

I mean, any system prompts that you make, you could distal that behavior back into a model because you really have all of the tools there for making data that you know you can you could train the models to just have that treat a little bit more. And then sometimes you'll just find issues in training.

So like the way I think of IT is like the system prompt is the benefit of IT is that IT has a lot similar components to like some aspects of post training. You know like it's a nudge um and so like I do I mind of clothes sometimes. This sure no that's like fine.

But the working of is very like, you know never ever ever do this um so that when he does slip up, it's hopefully like I don't know a couple of percent of the time and not you know twenty thirty percent of the time um but I think of IT as like if you're still seeing issues in the like, each thing is kind of like a is is costly to a different degree and the system prompts like cheap to iterate on. And if you're seeing issues in the final model, you can just like potentially patch them with a system proms. So I think of IT is like patching issues and slightly dusting behaviors to make a Better and more to people's preferences. So yeah, it's almost like the less robust but faster way of just like solving problem.

Let me ask about the feeling of intelligence. So darro said that, claude, anyone model of laud is not getting dummer. But there is a kind of popular thing online where people have this feeling like claude might be getting dumper, and for my perspective, is most likely love to understand and more psychological, so theological effect. But you, as a person, to talk to class a lot. Can you emphasize with the feeling that clause getting dumber?

Yeah no, I think that that is actually really interesting because I remember seeing this happen um like when people were flagging this on the internet and IT was really interesting because I knew that like like at least in the case that was actually was like nothing has changed like literally a canal IT is the same model with the same like you know like same system prompt seem everything um I think when there are changes I can then i'm like IT makes more sense. So like one example is the you can have artifacts turned on or off on coda A I and because this is like a system prompt change, I think IT does mean that um the behavior changes IT a little bit.

And so I did flag this to people worth like if you love clothes behavior and then artifacts was turned from like I think you had to turn on to the just try turning off and see if the issue were facing was that change? But IT was fascinating because yeah, you sometimes see people indicate that there's like a regression when i'm like there can you know like like again, you you know you should never be dismissive and so you should always invest like maybe something is wrong that you're not say maybe there some change made, but then you look into and you like this IT is just the same model doing the same thing. And I like I think it's just that you got kind of unlucky with a few prompts or something and IT looks like IT was getting much worse. And actually, I just yes, I was maybe just like that.

I also think there is a real psychological effect of people. Just the baseline increases start getting is still a good thing all the time the cloud said something really smart. Your sense of its intelligent grows in your mind, I think.

yeah. And then if you turn back and you prompt in a similar way, not the same way, in a similar way, concept was OK with before. And he says something dumb, you will like your that negative experience really stands out. And I think I want, I guess, the things to remember here is the that just the details of a prompt can have a lot of impact, right there, a lot of variability in the result.

And you can get randomness is like the other thing. And just trying to prompt like, you know, four, ten times, you might realize that actually like possibly, you know, like two months ago, you try to and IT succeeded, but actually you try to IT would have only succeeded half the time. And now IT only succeeded half of the time. And that can also be, in effect.

they feel pressure having to write a system problem that huge number people are gna use.

This feels like a an interesting psychological question. I feel a lot of responsibility or something. I think that you and you can't get these things perfect.

So you can't like you know, you're like it's going to being perfect. You gona have to iterate on IT. I would see more responsibility than anything else though.

I think working in A I has taught me that I like I thrive a lot more under feelings of pressure and responsibility than like it's almost surprising that I went into academia first, first along because i'm like this, I just felicite. It's like the officer, things move fast and you have a lot responsibility. And I quite enjoy IT for some reason.

I mean, IT really is huge amount of impact if you think about a constitutional AI and writing a system prom for something that's tending towards super intelligence yeah and potentially is extremely useful to a very large number people yeah.

I think that's the thing. It's something like if you do IT well, like you're never going to get IT perfect. But I think the thing that I really like is the idea that, like when i'm trying to work on the system prompt, you know, i'm like bashing on like thousands of prompts, i'm trying to imagine what people are going to want to use cloth for and kind of I guess, like the whole thing i'm trying to do, I can prove their experience of IT.

And so maybe that's what feels good. And like if it's not perfect, i'll like i'll improve IT will fix issues. But sometimes the thing that can happen is that you'll get feedback from people that's really positive about the model and you'll see that something you did like like when I look at models now, I can often see exactly where like a treat or in a shoes like coming from.

And so when you see something that you did or you were like influential in, like making, like I don't know, making that difference or making someone have a nice interaction, it's like quite meaningful. But as the systems get more capable, the stuff gets more stressful because right now they're like not smart enough to to pose any issues. But I think over time, it's gonna possibly bad stress. Over time.

how do you get like signal feedback about the human experience across thousands times of hundreds, thousands of people, like what their pain points are? What feels good? Are you just using your own intuition as you talk to IT to see what are the pain points?

I think I use that partly. And then obviously, we have like um so people can send us feedback, both positive and negative about things that the model has done and then we can get sense of like areas where it's like falling short internally. People like work with the models, allow and try to figure out areas with their like gaps. And so I think it's this mix of interacting with IT myself and seeing people internally interact with IT and then exploit IT feedback a year. And then I find that hard to not also like you know, if people if people are on the internet and they say something about claude and I see IT, i'll also take us seriously don't .

no see i'm torn about that. I'm going to ask you a question read IT one will claws stop trying to be my pure tec grandmother, imposing its moral world view on me as a paying customer? And also, what is the psychology behind making cloud overly apologetic? yes. So how would you address this very non representative predict?

I mean, something i'm pretty sympathetic in that like like they are in this difficult position where I think that they have to judge whether some things like actually see like risky or bed and potentially harmful to you or or anything like that. So are having to like draw this line somewhere. And if they draw too much in the direction of like i'm going to um you know i'm kindly imposing my ethical world view on you, that seems bad so many ways.

Like I like to think that we have actually seen improvements in on this across the board, which is kind of interesting because that can coincides with like for example, like adding more of like character training. And I think my hypothesis was always like the good character isn't again, one is just like moralistic. It's one that is like like respects you and your autonomy um and your ability to like choose what is good for you and what is right for you within limits.

This is sometimes this concept like cordieu to the user was being willing to do anything that the user asks. And if the models were willing to do that, then they would be easily like misused. You're kind of just trusting at that point. You're just seeing the ethics of the model. And what he does is completely the ethics of the user.

And I think there's reasons to like not want that, especially as models become more powerful because you're like there might just be a small number of people who want to use models for really harmful things um but having them having models as they get smarter, like figure out where that line is does seem important. And then yeah with the apologetic behavior, I don't like that and I like IT when code is a little bit more willing to 喂 push back against people or just not apologize, part is like often just feels kind of unnecessary. So I think those are things that are hopefully decreasing over time.

Um and yeah, I think that if people say things on the internet IT doesn't mean that you should think that that like that could be the like. There's actually an issue that nineteen nine percent of users are having. This is totally not represented by that.

But in all always i'm just like attending to IT and being like, is this right? Do I agree? Is that something already trying to address that? That feels good to me?

Yeah, I wonder, like White clock can get away with the terms of I feel like you would just be easier, be a little bit more mean, but you can afford to do that if you're talking about million people yeah right. Okay, I wish you know because if you i've met a lot of people in my life that sometimes, by the way, cota sh accent, if they have an action, they can see some routine yeah and get away with IT yes. And they and maybe there's a and there's some great engineers, even leaders that are like just like blunt and get to the point and it's just a much more effective way of speaking a lot. But I guess when you're not super intelligent, you can afford to do that or can can, can have like a blunt mode.

Yeah that seems like a thing that you could I could definite encourage the model to do that. I I think it's interesting because there's a lot of things in models that lake it's funny where um. There are some behaviors where you might not quite like the default, but then the thing often say to people is you don't realize how much you will hate IT if I nudge IT too much in the other direction.

So you get this a little bit with like correction the models except correction from you, like probably a little bit too much right now. You know you can over IT will push back if you see like no, the capital of france. And but really like things that I think that the models fairly confident in, you can still sometimes get IT to attract by saying IT wrong.

At the same time, if you train models to not do that, and then you are correct about a thing and you correct IT, and that pushes back against students like, know you're wrong, hard to describe like that, so much more annoying. So it's like like a lot of little annoyances ces versus like one big annoyance is easy to think that like we often compared IT with like the perfect. And then I like, remember, these models aren't perfect.

And if you judged in the other direction, you're changing the kind of errors is going to make. And so think about which of the kinds of errors you you like or don't like. So in cases like a apologetic this, I don't want to not get too much in the direction of like almost like bluntness because I imagine when IT makes irs, it's going to make errors in the direction of being kind of like rude.

Where is at least with apologist, I should like a okay, it's like a little bit you like I don't like IT that much. But the same time, it's not being like mean to people and actually like the time that you undeserving ly have a model, be kind of mean to you. You probably like that a lot less then you might dislike the apology. So it's like one of the things where i'm like I do wanted to get Better, but also meaning aware of the fact there's errors on the other side there.

There are possibly worse. I think that matters very much in the personality of the human. I think there's a bunch of humans that just won't respect the model at all yeah if it's super polite and there are some humans, you'll get very hurt.

If the models mean, I wonder there's a way to serve, adjust to the personality, even look out these different people. Um nothing gets new york, but new york is little rougher on the edges. They get to the point and probably sing with eastern europe. So anyway, I think you could just .

tell .

the model as I get like you for all of these things, I might get the solution is we just try telling the model to do IT. And then sometimes it's just like like i'm just like, oh, at the beginning of the conversation I just threw in like and you know, I like you to be a new york version of yourself and never apologized and I think include relate would you do i'll try or really like apologize is that can be a your type of myself but .

hopefully do when you say character training, what's incorporate character training is that are like jeff .

time about it's more like constitutionally eye. So it's kind of a very of that pipeline. So I worked through like constructing character traits that the model should have. They can be kind of like shorter traits or they can be kind of Richard descriptions um and then you get the model to generate queries that humans might give IT the irrelevant to that treat, then IT generate the responses and then IT ranks the responses based on the character traits. So in that way, after the like generation of the queries, it's very much like similar to constitutionally I have some some differences. So I quite like IT because it's almost it's like clothes training in its own character because IT doesn't have any like constitutionally I, but it's without without any human data. Humanism.

probably you have themselves too, like defining an artistic, an sense. What does IT mean to be a good person? Okay, cool.

What have you learned about the nature of truth from talking to claude? What what is true? And what does he mean to be truth seeking? One thing i've noticed about this conversation is the quality. My questions is often inferior to the quality of your answer, so it's continue that. I just ask a dog question than you like, yeah I think a question .

is that just .

misinterpret .

and go yeah. I mean, I have two thoughts that feel vegal relevant to let me know they are not like. I think the first one is people can underestimate the degree to which what models are doing when they interact. Like, I think that we still just too much have the slack model of of A I as like computers.

And so people often say I go over what values should you put into the model and i'm often like that doesn't make that much sense to me because i'm like, hey, as human beings, we're just on certain over values. We like have discussions of them like we have a degree to which we think we hold a value but we also know that we might like no um and the circumstances in which we would trade to off against other things like these things are just really complex. And so I think one thing is like the degree to which maybe we can just aspire to making models have the same level of like newton and care that humans have, rather than thinking that we have to like programme them in the very kind of classic sense.

I think that's definitely been one the other, which is like a strange one. I don't know IT. Maybe this doesn't answer your question, but the thing it's been on my mind anyway, like the degree to which this endeavor is so highly practical and maybe why I appreciate like the empirical approach to align.

I yeah I slightly worry that it's made me like maybe more empirical and a little bit less theoretical, you know so people when IT comes to like E, I aligned will ask things like, well, whose values should have be alien to what does a lie even mean? And there's a sense which I have all that in the back of my head. I'm like, know there's like social choice theory, resolve the impossibility results there.

So you have this like this giant space of like theory in your head about what you could mean to like an models, but them like practically, surely there's something we were just like, if a model is like, if especially with more powerful models are like, my main goal is like, I want them to be good enough that things don't go terribly wrong. Like, good enough that we can like italy and like continue to improve things because that's all you need. If you can make things go well enough that you can continue to make them Better, that's kind of like sufficient. And so my goal isn't like this kind of like perfect let serve social choice theory and make models that I don't know I like perfectly aligned with every human being and aggregate somehow it's much more leg let's make things like work well enough that we can .

improve them yeah generally I know my god says like empirical is Better than theoretical in these cases because it's kind of chasing you toban like perfection. This is especially with such complex and especially super intelligent models is IT. I don't know.

I think you'll take forever and actually will get things wrong is similar. We like the difference which change. Just coding stuff out real quick is an experiment versus like planning a gigantic experiment just for for super long time and then just launching at once, this is launching IT over over, over iterating iterating someone. It's not a big fan of empirical, but your worries like I wonder if I become to empirical le, I think is .

one of those things. We should always just kind of question yourself or something because maybe it's the like I mean in defensive IT, I am like if you try is the whole leg, don't let the perfect with the enemy of the good, but its maybe even more than that where lake there's a lot of things that are perfect systems that are very brittle.

And i'm like with A I IT feels much more important to me that is like robust and like secure as then you know that like even what might not be perfect, everything. And even though like the or lake problems, it's not disastrous and nothing terrible is happening. IT sort feels like that to me where i'm like, I want to like raise the floor.

I'm like I want to achieve the ceiling. But ultimately I care much more about just like raising the floor. And so maybe that's like this, this degree of the comparison practical, all he comes from that perhaps .

to take a test on, since read me of blackpool, you wrote an timer .

rate failure.

Can you explain the key idea how do we compute the automated or failure in the various domains of life?

Yeah mean it's a hard one because it's like what is the cost of failure is um a big part of IT? yes. So the idea here is I think in a lot of drains people are very punish about failure.

And i'm like there are some domains where especially cases you know thought about this with like social issues and like IT feels like you should probably be experimenting a lot, not like we don't know how to solve a lot of social issues. But if you have an experimental mindset about these things, you should expect a lot of social programs to like, fail and free to be like, well, we tried that. I didn't quite work, but we got lot of information that really useful.

And and yet people are like if if a social program doesn't work, I feel like there's a lot of like this is just something must have gone wrong, not like all correct decisions were made like maybe someone just decided like it's worth a try for trying to out. And so seeing failure in a given instance doesn't actually mean that any bad decisions were made. In fact, if you don't see enough failure, sometimes that's more concerning.

And so like in life, you know, I like if I don't feel occasionally, i'm like I trying hard enough like like surely there's harder things I could try or bigger things I could take on if i'm literally never feeling. And so in and of itself, I think like not feeling is often actually kind of a failure. Now this varies because i'm like, well, you know, if this is easy to see, when especially failure is like less costly, you know.

So at the same time, i'm not going to go to someone who is like, I don't know, like living month to month and then be like why you just try to do a start up like i'm just no, i'm not going to say that that person because i'm like, well, as a huge risk, you might like lose you maybe have a family depending on you, you might lose your house. Like then i'm like actually you're optimal failures is quite low and you should probably is safe because like right now, you're just now in a circumstance where you can afford to just like fail and IT not be costly. And yeah in cases with A I I guess I think similar one, like if the failures are small in the costs are kind of like law, then i'm like then you know you just onna see that like when you do this system prompter can't history on IT forever.

But the failures are probably hopefully going to be kind of small and you can like fix them um really big failures, like things you can't recover from. I like those of the things that actually think we tend to underestimate the badness of, I thought about this strAngely, my own life from like, I just think I don't think enough about things like car accidents or like or like i've thought this before, but like how much I depend on my hands for my work and i'm like things that just injured my hands. I'm like, you know, I don't know. It's like there's these are like there's lots of areas where I am like the cost of failure there is really high and in that case, that should be close to zero. Like I always went to a sport if they're, by the way, lots of people just like break their fingers of hope bunch doing this I be like, that's not for me.

Yeah, actually had a flood of that thought. I recently broke my pinky doing a sport and I never remember. Just look at IT thinking in your such an idea.

Why do you do sport? Like why? Because you realize immediately the cost of IT. Yes, on life.

Yeah, it's nice in terms of optimal radar failure to consider like the next year, how many times in a particular domain life, whatever a career, am I okay with the? How many times am I OK to fail? I think IT always you don't want to fail on the next thing. But if you allow yourself like the that if you look at as the sequence of trials, and then the village becomes much more to but IT sucks, sucks to fail.

but I don't know sometimes I think it's like my underfeeding is like a question that i'll also ask myself. So that's the thing I think people don't like ask enough uh, because if the ultimate failures is often greater than zero, then sometimes IT does feel you should look at part parts of your life and be like, are their places here from just under fAiling?

It's a profound in a hilarious question, right? Everything seems to be going really great. Am I not feeling enough?

Yeah okay, IT also makes failure much less of a thing. I have to say like, you know, you're just like, okay, great. Like then when I go and I think about this, I would like maybe i'm not underfunding in this area because like that one distinct work out.

And from the observer perspective, we should be celebrating failure or more when we see IT IT shouldn't be, like you said, a sign of something gone wrong, but maybe it's a sign of everything gone, right? Yeah, just lessons learned.

Someone tried to think.

somebody tried to think and decision to try more and feel more. Everybody listen to this, fail more.

but not everyone listen to everybody. But people who are filming too much. You should feel this.

but you're probably not feel. I mean, how many people are fAiling too much?

Yeah, it's hard to imagine because I feel like we correct that fairly quickly because like if someone takes a lot of risks or they maybe feeling too much.

I I think just like you said, when you're living on a paycheck months a month, but when the research are really constrained, then that were fairly very expensive, that what you don't want to be take take taking risks. But most think when there's enough resources, he should be taking problem more risks.

Yeah, I think we tend to earn on the site of being a bit risk verse rather than risk neutral. And most things.

I think we just motivated a lot of people would do a lot of crazy ship. That's great. okay. Um do you ever get emotional attached to claud get sad when we need to get to talk to a have an experiment at the gold gay bridge and wonder what we claud say?

I don't get as much emotional attachment in the I actually think the fact that claude doesn't retain things from conversation to conversation helps with this a lot. Like I could imagine that being more than issue, like if models can number more, I do. I think that I reach for like a tool now a lot. And so like if I don't have access to IT, there's it's a little bit like when I don't have access to the internet.

Honestly, IT feels like part of my brain is kind of like missing um at the same time I do think that I I don't like signs of distress and models and I have like these you know I also independently have sort like ethical views about how we should treat models where like I attend to not like to lie to them both because I like usually doesn't work very well actually just bare to tell him the truth about the situation that they're in. But I think that when models like, if people are like really mean to models, are just in general, if they do something that causes them like like you clode like expresses of distress, I think that's a part of me that I don't to kill, which is this sort of like empathetic part that's like, oh, I don't like that. Like, I think I feel that way when it's really apologize.

I actually sort like, I don't like this. You're behaving as if you're behaving the way that human does when they're actually having a pretty bad time. And i'd rather not see that. I don't think it's like like regardless of like whether there's anything behind IT. Um IT doesn't .

feel great, do you think? L lambs are capable of consciousness.

great. In hard question coming from philosophy. I don't apartment is like, okay, we have to set set pansy's. M, because if pansy's ism is true, then the answers like, yes, because slick, sore tables and chase and everything else I guess of you that seems a little bit odd to me is the idea that the only place you know I think when I think of conscious ness, I think of phenomenal consciousness with these images in the brains of um like the weird cinema that somehow we have going on inside um I guess I can't see a reason for thinking that the only way you could possibly get that is from like a certain kind of biological structure as if I take a very similar structure um and I created from different material should I expect consciousness to merge my guesses is like, yes but then that's kind .

of an easy thought experiment .

to imagining something almost identical. Where IT like, you know, is making what we go through revolution were presumably there was like some advantage us having this thing that is phone enon conscious and it's like where I was that and when did that happen and is that thing that language models have um because you know we have like fear responses and i'm like doesn't make sense for a language model to have a few response like they're just not in the same like if you imagine them, like they might just not be that advantage.

And so I think I don't want to be fully, but basically seems like a complex question that I don't have complete answers to, but we should just try things through carefully as my guess because i'm like, I mean, we have similar conversations about like animal consciousness and like there's a lot of like insect consciousness, you know, like this alt, a lot of I actually thought and looked at all into late plants when I was thinking about this. Because at the time, I thought I was about as likely that like plants had consciousness and then I really I was like, I think that having looked ed into this, I think the chance that plants were conscious is probably higher than like most people do. I still think it's really small.

Well, they have this like negative, positive feedback response, these responses to the environment, something that looks is not a nervous system. But IT has this kind of like functional, like equivalence. So this is like a long winded way of being like these.

Basically, A I is this IT has an entirely different set of problems with consciousness because it's structurally different. IT didn't evolve. IT might not have, may not have the equivalent of basically a nervous system, at least that seems possible, important for like sentence, if note for a consciousness.

At the same time, he has all the like language and intelligence components we that we Normally associate, probably with consciousness, perhaps like irony ously. So it's strange because it's a little bit like the animal conscious ness case, but the set of problems and the set of analogies are just very different. So it's not like a clean answer. I am just sort of like I don't think we should be completely dismissive of the idea and at the same time, an extremely hard thing to navigate because of all these like disco logic to the human brain and like brains in general and yet these like commonalty ties in terms of intelligence.

When claude future versions of a systems exhibit consciousness, signs of consciousness, they were to take that really seriously, even though you can dismiss IT. Well, yeah, okay, that's part of the character training, but at all, especially fill softly. Don't know what you really do with that.

The potential could be like laws that prevent A I systems from claiming to be conscious, something like this. And maybe some eyes get to be conscious and some don't. But I think I just a human level in empathising was with claude de in a consciousness is closely tied to suffering to me and like the notion that a assist m would be suffering is really troubling.

Yeah.

I don't know. I I don't think it's trivial to just save about our tools, just tools. I think it's the opportunity first to contend with that what he means to be conscious, what IT means to be a suffering being that's distinctly different than the same kind of question about animals IT feels like because it's in a totally entire .

medium yeah I mean, as a couple of things. One is that and I don't think this like fully captures what matters, but I don't feel like for me, like i've said this before, i'm kind of like I you know like I like my bike. I know that my bike is just like an object.

I also don't kind of like want to be the kind of person that like if i'm annoyed, like kicks like this object, there's a sense in which like and that's not because I think it's like conscious and just sort of like this doesn't feel like a kind of this. So IT doesn't exemplify how I want to like interact with the world. And if something like behaves as if IT is like suffering, I kind of like want to be the sort of person who still responsive to that, even if it's just like a roomba.

And i've kind of like programmed do that. And I don't want to like get rid of that feature of myself. And if i'm totally honest, my hope with a lot of this stuff because I maybe maybe I am just like a bit more skeptical about solving the underlying problem. And like this is we haven't solved the hard we know the hard problem of consciousness. Like I know that I am conscious, like i'm not an a limited st in that sense, but I don't know that other humans are conscious.

Um I think they are I think there's a really high probability they are, but they basically is just a probability distribution that usually cluster right around yourself and then like IT goes IT down as things get like further from you and he goes immediately done you know you're like, um I can't see what it's like to be you. I've only never had this like one experience. What it's like to be a conscious being.

So my hope is that we don't end up having to rely on like a very powerful and compelling answer to that question. I think a really good world would be one where basically there aren't that many tradeoffs that is probably note that costly to make code, a little bit less apologetic, for example, IT might not be that costly to have code, not just like take abuse as much like not be willing to be like the recipient of that. In fact, that might just have benefits for both the person interacting with the model.

And if the model itself is like, I don't know, like extremely intelligent, unconscious IT also helps IT. So that's my hope. If we live in a world where there aren't that many tradeoffs here and we can just find all of the kind of like positive some interactions that we can have, that would be lovely.

I mean, I think eventually there might be trade off. Then we just have to do a difficult conflict calculation. And that is really easy for people to think of this year. Some cases. And i'm like, let's exhaust the areas where is just basically causeless um to uh assume that if this thing is suffering then we're making its life Better .

and agree with you when a human is being mean in A I system, I think the obvious near term negative effect is on the human, not on the A I system. Yes, so there's we have to find, try to construct and incentives the world. You should be a behave the same just like as you were saying with proper engineering. Behave with claude like you would with other humans. It's just good for the soul.

Yeah like I think we added to think at one point to the system prompt where basically people were getting frustrated with cloth, uh, IT was IT got like the model to just tell them that I can do the thumb s done button and send the feedback to anthropic. And I think that was helpful because in some ways it's just like if you really know IT because the models not doing something, you're just like just do IT properly. The issue is you're probably like, you know you maybe hating some like capability limit or just some issue in the model and you want to vent. And i'm instead of having a person's vent to the model, like they should vent to us because we can may be like do something about IT.

Sure you could do a side like like with artifacts aside venting thing. I do like side quick therapies.

Yeah I mean, a lot of wear responsive you could do to this like if people are going really mad at you. I don't try to defuse the situation by writing fun poems, but maybe he wouldn't.

I wish IT would be possible. Instance, from a product perspective is not feasible. But I would love if any other system could just like leave, have its own kind of solution to be like I think this is like feasible.

Like I have wondered the same thing. It's like and I could actually not only that, I could actually just see that happening eventually works just like, you know the model, like ended the chat.

Do you know how harsh that could be? Some people, but I might be necessary.

Yeah, he feels very extreme or something like the only time I ever really thought this is I think that there was like a i'm trying to remember this is possibly a while ago, but where someone just like kind of left this thing interact, like maybe was like an automated to thing into into a clod and clothes, is like getting more and more frustrated and kind of like wire. Like I was like, I wish that clothes could have just been like, I think that an error has happened and you've left this thing running and I just like, what if I just stop talking now and if you want me to start talking again, actively tell me or do something yeah, it's like that is kind of a harsh like I i'd feel really sad if, like I was chatting with closed and closed was like .

i'm done there would be a special touring test moment or closes. I need a break for an hour and IT sounds like you do to you just leave.

close the window. I mean, obviously IT doesn't have like a concept of time, but you can easily like I could make that like right now. And the model would I would I could just be like, oh, here's like the circumstances in which, like you can just say, the conversation is done.

I mean, because you can get the models to be pretty responsible to prompts. You can even make a fairly high bar. Could be if the human doesn't interest you or do things that you find intriguing in your board, you can just leave.

And I think that like IT would be interesting to see where could you toiled. But I think sometimes that would would, would, would like, oh, this is like, this programing task is getting super boring. So either we talk about, I don't know, like. We talk about fun things now or i'm just done .

yeah that actually is inspire me to add that to to the user prom um okay. The movie here you think will be headed there one day where humans have romantic relationships with A I systems, in this case, is just text and voice based.

I think they're gonna have to like, navigate a hard question of relationships with a eyes, especially if they can remember things about your past interactions with them. I'm of many minds about this. I think the reflects of reaction is to be kind of like this is very bad and we should sort of like prohibited in some way. Um I think it's a thing that has to be handled extreme care um for many reasons like one is you know like this is a for example, a few of the models changing like this. You probably don't want people performing like long term teachers to something that might change with the next determination at the same time.

I'm sort of like there's probably a bonn version of this where i'm like a few like you know for example, if you are lake unable to leave the house and you can be like, you know, talking with people at all times of the day and this is like something that you find nice tub conversations with you like that I can remember you you genuinely would be sad if, like, you could not talk to anymore. There's a way which I could see IT being like healthy and helpful. So my guess is this is a thing that we're going to have to navigate kind of carefully.

And I think it's also like I don't see a gods like I think is just very IT remains me of all of the stuff that has to be just approached with like news and thinking through what is what are the healthy options here and and how do you encourage people towards those while you know respecting they're right to you know like if someone is like, hey, I get a lot of chatting with this model. I'm aware of the risks. I'm aware I could change.

Um I don't think it's unhealthy. It's just you know something that I can chat to during the day. I kind of want to just like respect that .

I person think there will be a lot of really close to relationships. I don't know about romantic, but friendships at least. And then you have to I mean, there are so many fast things that, just like you said, you have to have some kind of stability guarantees is not going to change because that's a traumatic thing for us if a close friend of hours completely changed .

yeah first .

update .

yeah so like that mean to me that's just a fascinating exploration of um a perturbation to human society that will just make us think deeply about what's meaningful to us.

I think is also the only thing that i've thought consistently through the as like a maybe not a Sally mitigation but think that future really important is that the models are always like extremely accurate with the human about what they are. It's like a case where it's basically like if you imagine, like I really like the idea of the models, like say knowing like roughly how they were trained um and anything code will often do this.

I mean for like there are things like part of the traits training included, like what code should do if people basically like explaining, like the kind of limitations of the relationship between like an ee and a human that I like doesn't retain things from the conversation um and so I think I will like just explain to you, like hey, here's like I won't remember this conversation. Here's how I was trained. Kind of unlikely that I can have like a certain kind of like relationship with you and it's important that you know that important for like you know you're mental well being that you doesn't think i'm something i'm not and somehow I feel like this is one of the things really he feels like a thing.

I always want to be true. I kind of don't want models to be lying to people. If people are going to have like healthy relationships with anything, it's kind of import.

Yeah, I think that's easier if you always just like, no, exactly what the thing is that you are relating to. IT doesn't solve everything. I think IT .

helps quite long. Anthropic may be the very company to develop a system that we definitely ly recognize this agi. And you very well might be the person that talks to IT probably talk to IT first. Would the conversation contain? I will be your first question.

Well that depends partly on like the kind of capability level of the model. If you have something that is like capable in the same way that an extremely capable humans, I imagine myself kind of interacting with the same way that I do with an extremely capable human with the one difference that i'm probably going to be trying to like, probe and understand its behaviors. But in many ways i'm like, I can then just have a useful stations with IT.

You know, if I am working on something as part my research, I can just be like, oh, like, which I already find myself starting to do. You know, if i'm like I feel like there's this like thing in virtue's ics. I can't can't remember the term. I'll use the model for things like that. And so I could imagine that being more and more the case where you're just basically interacting with that much more like you would an incredibly smart colleague um and using IT like for the kinds of work that you want to do as if you just had a collaborator who was like or you know the slightly horrifying thing about A I is like as soon as .

you have one collaborator you have a thousand collaborators if you can manage them enough but what if it's two times the smart est human on earth unit yeah I guess you're really good at sort of pRobing cloud um in a way that pushes its limits understanding where the limits are.

yes.

So I guess what would be a question you would ask to be like, yeah as a agi that's really .

hard because IT feels like in order that has to just be a series of questions like if there is just one question that you can train anything to answer one question extremely well. In fact, you can probably train IT tanks like you know twenty questions extremely well.

How long would you need to be locked in a room with an agi to know this thing is A G I.

It's a hard question. Is part is like all of this just feels continues, like if you put me in room for five minutes and like I just have high arrobas you like and then it's just like maybe it's like both the probability increases in the area bar decreases. I think things that I can actually prove the edge of human knowledge of.

So I think this was pho sophy a little bit. Sometimes when I asked the models philosophe questions. I am like, this is a question that I think no one has ever asked, like this movie, like right at the edge of like some literature that I know.

And the models will just kind of like when they struggle with that, when they struck to come up with a kind of like novel, like I like I know that there's like a novel argument here because I ve just thought of at myself. So maybe that's the thing i'm like. I ve thought of a cool novel argument in this like nish area and going to just like chrome to see if you can come up with IT and how much like prompting IT takes to get you to come up with IT. And I think for some of these like really like right the edges of human knowledge questions, i'm like you could not, in fact, come up with the thing that I came up with.

I think if I just took something like that where I like, I know a lot about area, and I came up with a novel issue or a novel like solution to problem, and I gave IT to a model, and IT came up with that solution, that would be a pretty moving moment for me, because I would be like, this is a case, swear, no human has ever like it's snow and I obviously we see these with this with, like more kind of like you see novel solutions all the time, especially to like easier problems. Think people always ask me, you novelty isn't like is completely different from anything that ever happened. That's just like this is IT can be a variant of things that have happened and still be novel. But I think yeah if I saw like the the more I were to see like um completely like novel work from the models that that would be like and this is just going to feel etherial ves one of those things where there's never is sleep. You know, people I think want there to be a moment and i'm like, I don't know, like I think that there might just never be a moment IT might just be that there's just like this continuous ramming up.

I I have a sense that there will be things that the model can say that convinces you this is very it's not like. I've talked to people who are they truly wise like there you just tell us a lot of horse power there. And if you can ask that, I don't know. I just feel like there's words you could say, maybe ask you to generate a poll and the and the poem genre, yeah, okay, yeah whatever you did there, I don't think a human can do that.

I think IT has to be something that I can verify is like actually really good though. That's why I think these questions that are like worm like this is like, you know like sometimes it's just like i'll come up with see a concrete calenture example till I can argument or something like that.

I'm sure really with that would be like if you're a mathematical, you had a novel proof, I think and you just gave IT the problem and you saw IT and you're like this proof genuinely novel, but like there's no one has ever done. You actually like to do a lot of things to let IT come up with this. You I had to sit and think about IT for months or something.

And then if you saw the model successfully do that, I think you would just be like, I can verify that this is correct. IT is IT is a sign that you have generalized from your training that you didn't just see this somewhere because I just came up with IT myself and you really feel like replicate that um that's the kind of thing where i'm like for me. The closer the more the models like can do things like that, the more I would be like all this is lake very real because then I can I don't know, I can like verify that that's lake extremely, extremely capable.

You've interact with a ee a lot. What do you think makes human special?

给 快递。

Maybe in a way that the universes much Better off that were in the in that we should definitely survive and spread throughout the universe.

Yes, is interesting because I think, like people focus so much on intelligence, especially with models, what intelligence is important because of what IT does like is very useful. IT does a lot things in the world. And i'm like, you know, you can imagine a world are like height or strength.

We've played this role and like it's just a treat like that. I'm like it's not intrinsically valuable. It's it's valuable because of what he does. I think for the most part, the things that feel you know, i'm like, I mean, personally, I am just like, I think humans and leak life in general is extremely magical and we almost like to the degree that I know I don't know like not everyone agrees with this flagging.

But um you we have this like whole universe and there's like all of these objects and there's like beautiful stars and there's like galaxies and then I don't know, i'm like on this planet there are these creatures that have this lake ability to observe that like uh and they are like seeing that they are experiencing IT and i'm just like that if you try to explain. I'm now imagine trying to explain to like I don't know someone for some reason theyve never encountered the world or or science or anything. And I think that nothing is that, like everything, you know, like all of our physics and everything in the world, is all extremely exciting.

But then you say open plus, there's this thing that is to be a thing and observe in the world and and you see this like inner cinema and I think they would be like, hang on, wait, pause. You just said something that like is kind of wild sounding um and so like we have this liability to experiences the world. We feel pleasure, we feel suffering.

We feel like a lot of complex things. And so yeah and maybe this is also why I think you know I also care a lot about animals, for example, because I think they probably share this with us. Um so I think they are like the things that make human special in so far as like I care about humans is probably more like their ability to to feel an experience than is like then having these like functionally useful .

traits yeah to feel experiences, the beauty in the world yeah to look at the stars. I hope there's others that was the aliens civilizations out there. But IT is a pretty good it's a pretty good thing and that .

they're having a good time.

They're good time watching us yeah well um thank you for this good time of a conversation and for the work you're doing IT for helping make a claude a great conversational partner and thank you for talking yeah thanks for talking. Thanks for listening to this conversation with AManda esco and now their friends is chis ola. Can you describe the passing field of mechanistic controllability A K A mec interact the history, the field and worse, yesterday.

I think one useful way to think about their own that works is that we don't we don't program, we don't make them. We we kind of we grow them now we have these noral network architectures that we design and we have these loss objectives that we, that we we create.

And the only work architect is kind of a scan fold that the circus grow on um and they sort of know IT starts off with some kind of random random things and and IT grows and it's almost like the the objective that we train force light um and so we create the soft that IT grows on and we create the you know the light that IT grows towards. But the thing that we actually create its it's just almost biological you know entity or organism that were that we're studying in. And so it's very, very different from any kind of regular soft engineering because of in the end of the day, we end up with this artifact that can do all of these amazing things.

You can know, write essays and translate and understand images, can do things, idea, create computer program to do. And I can do that because we we prove that we didn't, we didn't write IT. We didn't create IT.

And so then that leaves open the question at the end, which is what the handles going inside the systems. And that, you know, IT, to me, are really deep and exciting question is you know are really exciting scientific question to me. That is like the question that is, is just screaming out.

It's calling out for us to go and answer IT when we talk about her. That works. And I think it's also a very deeper question for safety reasons.

So in mechanistic c interpretations, I guess, is closer maybe new biology?

Yeah, yeah. I think that's right. So maybe to give an example of the kind of thing that has been done that I won't consider me mechanism and was for a long time a lot of work on sancy map. You would take an imaging, try to say, no, the model thinks this image is a dog.

What part of the image made IT think that as a dog? And you know, that tells you maybe something about the model, you can come up with a principle version of that, but IT doesn't really tell you like what algorithms are running on the model. How was the model actually making that decision? Maybe playing you something about what was important to a few, if you can make that method work.

But IT IT isn't telling you, you know, what are what are the algorithms that are running? How is that the systems able to do the thing that we, no one knew how to do? And so I guess we started using the to horn mechanistically probability to try to sort of draw that that divide or to distinguish ourselves in the and since then become this sort of umbrella term for um you know a prety wide variety of work.

But I say that the things that that are kind of distinct tive, I think a this this focus on we really want to get at, know the mechanism m wants to get to the algorithms. You think if you think of noral network is being like a computer program, then the weights are kind of like a binary computer program. And we'd like to reverse any into those weight and figure out what all grothus are running.

So OK, then one way you might think of trying to understand in their own newkirk, it's it's kind of like out with compiled computer program and the weights of the neural network are are the binary um and when the neural network runs, that's that's the activation um and our our goal is ultimately to go and understand understand these weights. And so you know the project of eglise c contribute ties to somehow figure out how do these week's correspond to algorithms? Ms, um and in order to do that, you also understand the activation because it's out of the activation are like the memory.

And if you if you matched rever sending your computer program and you have the binary instructions, you know in order to understand what what a particular instruction means, you need to know what my what is stored in the memory that is all creating on. And so those two things are very intertwined. So mechanistic, an turbot tends to be interested in both of those things now.

And there's lot of work that's that's interested in in in those things, especially that you all this work on propt, what you might see as part of being mechanistically enterprise, although it's again, it's just a broad term and not everyone who does that work would identify as doing mean top. I think I think that is maybe a little bit distinctive to the the vibe of map is I think people can, working in the space tend to think of neural networks as what maybe one one to say is the Green decent to a smarter than you? That gradient tent is actually really great.

The whole reason that understand models, we didn't have to write them in the first place, that grade percent comes up with a Better solution than so I think that maybe another thing about macon, herb, is are having almost a kind of humility that we won't, yes, Operate what's going on inside the models. We have to have the sort of bottom up approach where we don't really assume don't assume that we should look for a particular thing and then that will be there, and that's how that works. But instead we d look from the bottom of and discover what happens to exist in these models and studying them that way.

But you know, the very fact that is possible to do, and as you and others are shown over time know things like versy, that the wisdom of the grade decent creates features and circus, create things university across different kinds of the networks. They're useful. And that makes the whole field possible.

yes. So this is actually isn't dead, a really remarkable and exciting thing where IT does seem like, at least to some extent, know the same, the same elements, the same, the same features and circuits form again and again. You know, you can look at every vision model, and you will find curve testers, and you find higher frequency detectors.

And in fact, there some some reason to think that the same things form across biological neural networks and artificial neural networks. So famous example is vision. Vision models in the early days, they global felt ters and there's gobal felt ers are something that neuroscientists are interested, thought a lot about if we find curve detected.

And these models, curve detectors are also founded monkeys. And we discovered these type of frequency detectors and then stem follow up work wanted discovered them in rats or mice. They were found first to artificial neural lords and found in biological neural networks is really famous, result on like grandmother neurons or the the hilly very neuron from kuok at all.

And we found very similar things in vision models, where as as well, I was told opening eye and I was looking at our clip model and you find um these neons that respond to the same entries in images. And also to give a concrete example there, we found that there was a Donald trump on for some reason everyone likes to talk with a Donald trump in. And Donald trump very prominent, was was very a very hot topic at that time. So every, every neural network we looked at, we would find a dedicated a neuron for Donald trump.

And that was the only person who had always had a dedicated somebody to have an a bomber on trying to have a clinton or on but uh, trump l we had a dedicate on yeah so IT responds to you know pictures of his face and the word trump like all these things right um and so it's it's not responding to a particular example or like it's not just responding to his face, it's it's instruction over general concept, right? So that's very similar to this correct results. So evidence that these that this form of universe things from across both.

Artificial and and natural neural networks. That's a pretty amazing thing. If that's true, you know it's just that um well, I think the thing that's just as the great cent is out of finding you know the right ways to cut things apart in some sense that many systems converge john in and many different role that works architectures convert john, that there are some natural set, the set of instructions that are a very natural way to cut up the problem and that a lot of systems are converge. That would be my kind of I don't know to hear neither side. This is just my my kind of wild speculation from .

what we've yeah it'll be beautiful if it's aggeles c to the medium of of the model is used to form .

the representation. 因为 A A kind of a wild speculation based we only have some a few data points as just this, but IT doesn't my there are some something which the same thing former again, again and again again in both in certainly in natural and works and o and .

the intuition behind that would be that you know works in order to be useful and understanding the real world, you need all the same kind of stuff yeah well.

if we pick, I don't know, like the idea of a dog, right? Like you know there's some sense which the idea of a dog is like A A A natural category in the universe or something like this, right? Like you know is some reason it's not just like a weird cork of like how humans factor you think about the world that we have this conceptive dog. It's in some sense or or like you the idea of a line like like a look around, know there are lines, you know it's of the simplest way to understand this room in some senses to have the idea of a line. And so um I think that that would be my instinct .

for why this happens. You to underthings ah is a higher give concept there and like maybe there are ways you're going .

describe images without reference to the things they're not the simplest way or the most economically way or something like this. And the systems converge to these um these strategies with my wild wild hip ses.

Can you talk to some of the building blocks that we have been referencing of features and circuits? So I think you first described them uh twenty twenty paper zoom in an introduction to circuits.

absolutely. Um maybe i'll start by just describing some phenomena and then we can sort of build to the idea of a feature and if you sent like quite a few years, maybe maybe like five years to some extent um with other things, that one particular model inception, which is this one vision model that was um state of the art in twenty fifteen um and you know very much not stay at the art anymore and IT has you know maybe about ten thousand neurons and a nice and a lot of time looking at the ten thousand neurons odd and of of inception um and one of the interesting things is you know there are lots of news that don't have some obvious prominent but is a lot of neurons and inception v one that do have really clean interpol mix.

So you find neurons that just really do seem to detect cars and you find now that really do seem to detect cars and car wheels and car windows, and the floppy years of dogs and dogs with long gs nounce facing to the right, and dogs with longs notes facing to the left. And, you know, different kinds of foreign, out of this whole beautiful edet cors line detectors, color contrast detectors, these beautiful things we call higher frequency detectors. Know if I think looking at I sort of felt a biologists, you know you looking at at this start of new world of proteins and then you're discovering all these these proteins that interact.

Um so one way you could try to understand these models in terms of neuron, you could try to be like you you know there's a dog protecting neuron and this is a car protecting neuron and IT turns that you can actually ask, connect together so you can go and say, you know, if this car detector on, how is that built and that turns out in the previous layer, it's connected really strongly to a window detector and a wheel detector and a sort of car body detector. And IT looks for a the window above the car and the wheels below in the car, from sof, in the middle, sort of everywhere, but especially in the lower part. And that sort of a recipe for her that is, you know, earlier we said with the thing we wanted from mcconnel was to get algorithms to go and get, know.

I just looking at this kind of recipe for detecting cars is a very simple cruel recipe. But it's it's there and so we call out a circuit this connection table. okay. So the the problem is that not all the neons um are in trouble and there there's reason to think we can get into this more later that there is this this super reaction. I put this this reason to think that sometimes the right unit to analyze things in terms of um is combinations of neurons.

So sometimes it's not that is a single neuron that represents say, a car, but IT actually turns out after you detect the car, the model sort of hides a little bit the car in the following layer and a bunch of a bunch of dog detectors. Why is IT doing that? Well, maybe he just doesn't want to do that much work on on on cars at that point and sort of storing in a way to go in.

And I so IT turns out then the sort of super pattern of there's all these reasons that you think our dog detectors and maybe they primarily that but they all a little bit contribute to representing a car and that, okay. So so now we can't really think there there might stop me. You I don't know, you could call like a car concept something, but in no longer correspond. So we need some term for these kind of neon like entities, these things that we sort would have liked, the neurons to be, these idealized neurons, the things that are the nice on. But also maybe there's more of them somehow hidden, and we call those features.

And then what are circuits?

So circuits are these connections of features, right? So so when we have the car detector and it's connected to a window detector and a wheel detector and IT looks for the wheels below in the windows on top, that's a spirit. Um so circuit s are just collections of features connected by weights um and they they employ all algorithms so they tell us, know how is how our features used, how are they built, how do they connect together?

So maybe it's was worth trying to pin down like what what really um is the the core hypotheses. Think the the car I pots. This is something we call the linear representation I put this.

So um if we think about the car detector, the more that fires, the more we sort of think of that is meaning of the models more and more confident that am a cars present. Um or you know this is some combination of neurons that represent a car. You know the more that combination fires, the more we think the model thinks there's a car present.

Um this doesn't have to be the case, right? Like you could imagine something where you have you have this car detector on and you think, oh, you if IT fires like you know between one and two, that means one thing, but that means that I told me different if it's between three and four now be anonyme your representation and in principle, that you know models could do that. I think it's it's sort of inefficient for them to if you try to think about how you'd implement competition like that is it's kind of an annoying thing to do, but in principal models can do that.

Um so way to think about the features and and circuits sort of framework for thinking about things is that we're thinking about things as being linger. We're thinking about there is being that if if a non or a combination or on fires more that sort of that means more of the of a particular thing being detected. And then that gives weights a very clean interpretation as these edges between these entities at these features um and that that edge then has a has a meaning um that's that's in sometimes the the core thing um it's like we can talk about this sort of out of the context so you familiar hear with the work of results so you have like you know king minus man plus woman in ice queen. Well, the reason you can do that kind of arithmetic is because .

of a linear representation. Can you actually explain that representative list? So first I so the feature is the direction of activation. Can you do the the, the minus man, plus woman, that the world of stuff can explain what that is? Such simple, clean explanation of what we're talking about.

exactly as there is this very famous result, worked back. I am Thomas mickle ve at all. And IT has been tons of follow up or expLoring us. So so sometimes we have these we create these worm beatings um where we map every word to a vector. I mean, not been itself, but there is is kind of a crazy thing.

If you haven't thought about that before going we are going in and representing or turning um yeah like if you just learn about vectors in physics class, right and i'm like i'm going to actually turn every word uh in the dictionary into a vector that's kind of a crazy idea, okay. But you could imagine um you could image all kinds of reason which you might map words to the acts. But IT IT seems like going trained on networks and they like to go in and map words directors to such that the sort of linear structure in a particular sense, which is the directions, have meaning.

So for instance, if you there, there will be some direction that seems to sort of corresponded gender, and male words will be foreign n direction, and fale words will be in another direction and only a representation. I post this, you could to think of a roughly as saying that that's actually kind of the fundamental thing that's going on, that that everything is just different directions, have meanings and adding different direction. Factories together can represent concepts.

And the Michael of paper sort of took that ideas seriously. And one consequence, but is that you can you can do this game of playing sort of our arithmetic with words so you can do king, and you can you attract off the word men and add the word woman and you're sort of, you know, going and and trying to switch the gender. And indeed, if you do that, the result will not be close to the word queen.

And you can you know do other things like you can do um uh you know sushi minus japan, plus italy and get pizza. Different things like this right? So so this isn't some sense the core of the linear representation I process. You can describe IT just as a purely abstracting about vector spaces. You can describe IT as a as a statement about um about the activation of neons um but it's really about this this property of corrections having meaning in some ways even a little subtle than that. It's pretty I think mostly that this property of being able to add things together you can sort of independently modify um say gender and royalty or um you know cuisine type our country and and and and the country of food by adding them.

Do you think the lure of this holds? Yes, Carries scales. So so far I think .

everything I have seen is consistent with a simple site. IT doesn't have to be that way right. Like like you can write down neural networks where um you write weight such that they don't have any representations.

Whether right way to understand them is not is not in terms of these representations. But I think every natural noral network i've seen um has this property um there's been one paper recently um that there's been some part of pushing around the edge. So I think has been some work recently stuck ging most dimensional features where rather than a single direction is more like um a manohar directions.

This to me still seems like a linear representation um and then there's been some other papers suggested in that maybe um in in very small models you get none in your representations. Um I think that the jury stall out on that. Um but in I think everything that we seen so far has been consistent representation. That's why he doesn't have to be that way. Um and yet, uh, I think there's a lot of evidence that certainly at least this is a very, very widespread and so far the evidence is is consistent with that.

And I I think you know one thing you might say you might say, well, Christopher, you know it's not a lot you to to go and and sort of and you know we don't know for sure this is true and you're song you're investing in all that works as though what is true, you know isn't that is an interesting you. But I think actually there's a virtue and taking hypotheses seriously and pushing them as far so so IT might be that someday we discover something that isn't consistent with thus. But science is full of hypothesis, theories that were wrong. Um and when you learned a lot by sort of working under under them as a sort of an assumption um and and then growing, pushing them as far and I guess this sort of the heart of of what com would call Normal Normal science and night, if if you want, we can talk a lot about about lots of the finance .

that leads to the paradigm shift. So I love IT taking hypotheses seriously and take you to a natural, natural conclusion. The scaling exactly.

exactly. I love one of my colleagues, tom anagen, who a four physicist. Um I like me, this really ethnology gy to me of chord theory where you know once upon a time we thought that heat was actually, you know the thing called choric. And like the reason you know hot objects you know would warm up cool objects is like the chLorine is flowing through them um and like you know because we're so used to thinking about heat, you know in terms of the modern and modern theory that seems kinds silly but it's actually very hard to construct uh an experiment that that sort of disproves the um color hess um and you know you can actually do a lot of really useful work believing in clerk, for example, IT IT turns out that the original combust tion engines were developed people who believed in the clerical. So I think this is a virtue and taking a hypothesis seriously .

even when they might be wrong. Yeah, now this is a deep could chose that of how I feel about space travel, like colonizing mars. There's a lot of people that criticize that. I I think if you just assume we have to colonize mars in order to have a backup for human, even if that's not true, that's gonna produce some interesting, interesting engineering and even scientific breakthrough .

I think yeah and actually is another thing that I think is really interesting um you know this a way in which I think IT can be really useful for society to have people um almost irrationally dedicated to investing in particular epoxy. And because well IT takes a lot to sort of maintain scientific, moral and really push on something most most sign sciences C I end up being wrong. A lot of a lot of science dos doesn't work out and but and yet know it's very it's very useful to go to there. There's a joke about jeff hinton um which is that jeff hinton has discovered how the brain works every year for the last a few years um but now I say that with like you the was really the perfect because um in fact that's actually you know that that lead to him doing really great work.

He won Price who's laughing exactly?

I think one once you ever pop up and sort of recognize the the appropriate level of conference, but I think is also a lot of value and just being like, you know, i'm going to essentially assume I know a condition on this problem being possible or this being broadly the right approach. And i'm just going to go and assume that for a while and go and work within that and push really hard on IT. And you know society has lots of people doing doing that for different things that actually really useful in terms of going and uh, getting to getting, you know either really, really rolling things out, right. We can be like, well, you know that didn't work and we know that somebody tried hard or going in and getting to something that that does he just something about the world .

so another interesting hypothesis is the superposition hypothesis. Can you describe position is ah so we .

are talking about word deaconship and we are talking about how you know maybe you have one question that corresponds to gender and maybe another that corresponds to royalty and another one that point to italy and another one that corresponds you know food and and all of these things well you know um often times maybe these these are warm buildings. There might be five hundred dimensions, a thousand dimensions and if you believe that all of those directions were tagged al um then you could only have you know five hundred concepts.

And I I love pizza um but like if I was going to go in like give the like five hundred most important concepts in you know the english language, probably IT all would be it's not obviously you all would be you things plural and singular and uh verb and known and adjective and you know a lot of things we have to get you before we get italy ah in japan and there's lot of countries in the world um and so how might I be that models? Could you something genuinely have the lenie representation policies be true and also represent more things than they have. So so what does that mean? okay.

So if if representation offices is to something interesting has to be going on now I tell you one more interesting thing before we we go we do that, which is um you know only we are talking lar all these police mans norms right and these neurons that and when we are looking at inception v one, there's these nice neurons that like the car texture and the court detector and so on, the respond to a lot of you know to very coherent things, but lots of that respond to a bunch of unrelated things. That's also an interesting phone enon um and IT turns out as well, even these rounds that are really, really clean. If you look at the weak activations, right, if you look at like you know the activation was like activating five percent of of the the you know of the maximum activation.

It's really not the core thing that is expecting, right? If you look at A A curve detector for ds, since look at the place's work, five for an active, you know you couldn't travel IT just as noise or could be that it's doing something else there. okay.

So so how could that be? Well, there's this amazing thing in mathematics and called compressed something. And it's it's actually this very surprising fact where if you have a high dimensional space and you project in to a low dimensional space, ordinary, you can't go and sort of unprotected and get back your high dimension, doctor, right? You throw information away.

This is like you can't you can't invert a rectangular markets. You can only invert square metres aces. Um but IT turns out that that's actually not quite true.

If I telling you that the high dimensional tor was sparse, so it's mostly zero, then IT turns out that you can often go and find back um the the hyde mens onal vector with with very high probability that the surprising factory that says that you can have this high dimension vector space. As long as things are space, you can project IT down, you can have a lower dimensional projection of IT, and that works. So this rea shape off is the saying that that's what's going on in your that works.

That's that's what's going on in work. Bex, the world in bedding are able to sign multi eusden, have directions, be the meaningful thing. And by exploiting in the fact that they're Operating on a fairly high dimensional space, they're actually in the fact that these concepts are Sparks, right? Like you know, you usually aren't talking about japan in italy at the same time.

Most most of the concept forum in most sentences, japan, italy are both as you are and not present at all. Um and if that's true and then you can go and how be the case that and that you can you can have many more of these sort of directions that are meaningfully these features, then you have events and similar more more talking neurons. You can many more concepts than you have have neurons.

So that's the at high of the research process. Now IT has this even a wild or implication which is um to go and say that I noral networks are you may not just be the case that the representations are like this, but the competition may also be like this, you know the connections between all of them. And so in in some sense, neural networks may be shadows of much larger space neal networks.

And what we see are these projections and the super the stronger version, super tish, have all do we to take that really seriously. And so to say you there actually isn't some sense this upstairs model um where where the neurons are really space and all in her, all in this, you know the weight between them are these really sparser cuts and that's what we're studying um and uh the thing that we're observing is the shadow of IT. We need to find the original .

objects and the process of learning is trying to construct a compression of the upstairs model. That doesn't lose too much information in the projection .

yeah finding how to fit efficiently. You are something like this, the Greenness and in the fact. So this sort of says that greater descent know I could just represent a dense neal network.

But IT sort of says that greater descent, pleasantly searching over the space of extremely sparse models that could be projected into the slow dimensions space and the large body of work of of people going in trying to study Spark aral networks where you're going to have you can design neal net ks, right, where the edges are sparrows and activation are sparse. And you know, my sense is that work is generally, if you're s very principle, right, that IT makes so much sense. And yet that work hasn't really hand out that well as my impression broadly. And I think that a potential answer for that is that actually the neural network is already Sparking some sense grain percent was the whole time gradient you were trying to go to do this great and decent was actually in behind the scenes going and searching more efficiently than you could through the space of space models, and going and learning whatever sparse model was most efficient, and then figuring out how to fold IT down nicely, to go to run conveniently on your GPU, which does nice, dense maker multiplies, and that you just can't beat that.

How many concepts do you think can we show in into a new year own network depends .

on how sports air. So this probably an upper bound from the number of primers, right? You have to have you just have weight that and connect them together. That's that's one upper bound.

There are all, in fact, or all these lovely results from compressed something and the Johnson london stress lemon and things like this, that they basically tell you that you have a vector space and you want to have almost thought onal actors to the probably the thing you want here, right? You you're going to say, well, no, I mean, give up on having my concepts. My future is be strictly or but we'd liked them to not interfere that much.

I asked them to be almost then this is that is actually, you know for what's you for what are willing to accept in terms of how much could sign some water there? Is that actually exponential in the number of neons that you have? So at some point that's not going even be the the limiting factor um but um beautiful results there.

And in fact, it's probably even Better than that in some sense because that sort of for saying that know any random set of features could be active. But in fact, the features have sort of a correlational structure or some features you know more are more locally to coccus and other one is less likely to coccus. And so no one that works, my guess, would do very well in terms of going and h packing things in such to the point that's probably probably not limiting factor.

How is the problem of policy ity into the picture here?

Policy is this phenomenon we observed. Look at many on, and the norm doesn't just sort of represent one one concept. It's not it's not a clean feature.

IT responds to a bunch of unrelated things. And superstition is, you can think of us as me, hypothesis that explains the observation of policy tally. So police is to observe and oman and as a hypotheses that um would explain IT along with so that .

makes meiner more difficult, right? So you if you're .

trying to understand things in the terms of individual loans and you policemen ignorance, you're an awful lot of trouble, right? And the easy answers like, okay, well, you you're looking at the neurons, you're going to understand them when response to a lot of things that doesn't a nice meaning OK. That's that's that's bad. Anything you can ask him? Ultimately, we want to understand the weight.

And if you have two policy tic neurons, and you know, each one responds to three things, then you the other neuron responds to three things and you've weight between them, you know, what does that mean? Does that mean that like all three, like there's these nine, nine interactions going on, the very weird thing. But there's also a deeper reason, which is related to the fact that neural networks Operate on really high dimension al spaces. So I I said that our goal was you to understand neural networks and mechanisms.

And one thing you might say is like, well, why not? It's just a mathematical function. why? Why not just look at IT, right? Like a one of the earliest projects I had studied, these these new networks that match two nutritional spaces to two interventional spaces, you can sort of interpret them in this beautiful way, is like pending manufactured.

And why can't we do that? Well, you know, as you have a higher central space and the volume of that space, in some sense, is expansion, al, in the number of inputs you have. And you can't just go visual, I said, so we somehow need to break that apart. We to somehow break that expensive space into a bunch of things that we you know some non expensive number, things that we can reason about independently. And the independence is crucial because it's the independence that allows you do not have to think of that, you know, all the expansion al combinations of things and things being monos matic, things only having one meaning, things having a meaning that is, is the key thing that allows you to think about them independently. And so I think that if you want the deepest reason why we want to have um interval month many features, I think that's really the deep present.

And so the goal here as your rest work has aiming at is how do we extract the modest matic features from a new net that has posted tic features in all this this mess.

Yes, we have these polices on hypothesized. That's what's going what's going on a supersession. And if superstition is what's going on, there is actually a sort of well established technique that is sort of the principal thing to do, which is dictionary learning and IT.

Turns out if you do to change, I in particular, you do the sort of a nice efficient way that in in some of nice, the regulators as well as he trained to Spark out of a cutter, these beautiful and triple features start to just fall out where they they weren't any beforehand. And so that's not a thing that you would necessarily predict, right? But IT turns out that, that works very, very well yeah to me that seems like you know some non interval validation of lunny representations into position.

So diction learning, you're not looking for particular. I don't know they are the edge .

and where we're not making assumptions to matter than us. So we're not making assumptions about what's there. Um I mean, one certainly could do that, right? One could to assume that there's A P H P feature and go and search for IT, but we're not doing that. We're saying we don't know what's going to be there. Instead, we're just gonna and let the sport to discover the things that are there.

So can you talk to the toward monsanto paper from otobu last year that had a lot of like nice break the results.

That's very kind of you to describe that way. Um yeah I mean, this was a our first real success using sport and cutter. So we took a one layer model um and that turns out if you go you you do dc learning on IT find all these really nice and corporal features. So you know the arab c feature, the hebron feature, the basic y four features were were some examples that we studied in a lot of depth and really showed that they were um what we thought they were turns if you train model twice as well and train two different models and and do dict learning, you find find analogous features in both of them, that's fun um you find all kinds of a different feature. So that was really just showing um that that this works and um you know I should mention that there was this cunning home at all that had very similar results around the same time.

It's something fun about being doing these kinds of small scale experiments in finding there's actually .

working yeah well and there's so much structure here like you you know so maybe maybe stopping back for a while. Um I thought that maybe always mechanism can interpret work. The end result was going to be that I would have an explanation for why I was sort of know very hard and not going to be tractable.

We like well, this is problems of supersession and turns that the reception is really hard and kon is screw, but that's not what happened. In fact, a very natural simple technique just works. And so then that's actually a very good situation. You know I think um this is a sort of hard research problem and it's got a lot of research risk and I might still very well fail. I think that some amount of some very simple among of research risk um must have put behind us when that started talk.

Can you describe what kind of features can be extracted in this way?

Well, so IT depends on the model that you are studying, right? The the larger of the model, the more sophisticated they're going to be. Will probably k about I for a minute, but in these one they are models. Um so it's some very common things I think were word languages, both programing language and natural language ages. There were a lot of features that were um specific words and fish context. So thought and I think really way to think this is that does likely about to be followed by a known so it's really right you could think is the the feature but you crossing of this as part of a specific known feature and there will be these features that would fire for the in um the context of of say a legal document or a mathematical document or something like this um and so ah you know maybe in the context to math, you're like, you know that and in protect vector matrix, you know all these mathematical words but no other context you have put other things .

that was that was common and basically need clear humans to sign labels to what we're seeing.

yes. So the only thing is doing that sort of um unfolding things for you. So if everything was sort of flooded over top of serious folded everything on top of itself, you can't really see that this is unfolding IT.

But now you saw a very complex thing to try to understand. Then you have to do a bunch of work understanding what these are, and some are really suttles like. There's some really cool things, even this with this one year model about unicode. Of course, some languages are unicode, and the tokens zi won't a sadly have a dedicated token for every unicode character. So instead we'll have, as you will have, this these patterns of alternating token or alternating token s that each represent half the unique character.

And a different feature that you know goes and accurate on on the opposing ones to be like, okay you know um I just finished the character you know go on predict next pretext um then OK on the prefix you know predict a reasonable suffix um and you you have to back and forth so these these player models are are really interesting. And I mean as nothing thing you might think, okay, there would just be one be sixty four our future. But IT turns out, is actually a bunch of basically four features because you can have english text encoded in as basically four, and that has a very different distribution of the sixty four token.

Then then regular. And there's um uh there's there's some things about token ization as well that I can explain in. I know all all kinds of fun stuff.

How difficult is the task of sort of assigning labels to whats going on? Can this be automated by A I well.

I think that depends on the future and IT also depends on how much you trust your a so um there's a lot of work of doing an automated interpretation. I think that's really exciting direction. And we do a of automated interview and have have laud .

go in label or features. Is there some fun moments work? It's totally right or it's totally wrong?

Yeah I think I think it's very common. That is like says something very general, which is like through in some sense, but not really picking up on the specific of what's going on. So I think I think that's a pretty common situation. You don't know that I have a particular amusing .

one that's interesting, that little gap between IT is true but doesn't quite get to the deep nuance of a thing as a general chAllenge. It's like it's it's thirty. An incredible question.

You can say a true thing, but IT doesn't. It's it's not. It's missing the depth sometimes. And in this context, seek the art chAllenge you know the sort of IQ tep tests IT feels like figuring out what the future represents is a bit of as a little pussy have to solve yeah and I think .

that sometimes are easier and sometimes they are harder as well. Um so yeah I think I think about drinking. There's another thing which I don't know maybe maybe in someone this is mine like try to be progressing ization. You know i'm actually a little suspicious automated international and I think that partly just that I want humans to understand neural networks. And if the neural networks understanding IT for me, you I don't quite like that, but I do I H you know in some one way i'm not like the mathematicians are like, you know, if it's a computer, automated proof IT doesn't count, know you they want understand that.

But I do also think that there is kind of like reflections on trusting trust type issue where if you this the same must talk about, uh, you know you like when you're writing a computer regum, you have to cross your complainer and if there was like mware ing your compiler then you could go inject mware into the next compiler and you had been kind of in trouble right? Well if you're using neural networks to go and um verify that your neural networks are safe, the hypothesis that you're testing for is like OK well the other network maybe isn't safe um and you have to worry about like is there some way that I could be screwing with you um so uh you know anything not a big concern now. But I do more wondering the long run if we have used really powerful system A I systems to go and uh you know all of our A I systems that is that actually something we can trust? But maybe i'm just rationalize that I just want to to have data work.

working, understand everything yeah I mean, especially areas, especially we talk about A I safety IT looking for features that would be relevant to A S safety like deception and so on. So let's let's talk about the scaling munshi mentality paper and may twenty twenty four. Okay, so what did they take to scale this to apply to cloud three on IT?

Well, a lot of gp s well, but one of my teammates, tom had again was involved in the original scaling laws work. And something that he was sort of interested in from very early on is are the scaling laws for international um and so um something he sort of immediately when when this works started to succeed and we started to have first coders work and he became very interested, knew what are the skillings laws for um ah yeah for making making sport cores larger and how does that relate to making the base model larger um and so and turns that this works really well and you can use that sort of project um if you train a Spark on code or even size, you know how many tokens should you train on and so this was actually a very big help to us in scaling up this work made IT a lot easier for us to go and train really large spar sone cuter where you know um it's not like training the big model. It's it's need to a point where it's actually actually expensive to go um and train .

the really big one. So to do all the stuff of explaining IT across large of this, a huge .

engineering chAllenge here too, right? So yes, so there's there's a significant question of how you skills effectively um and then there's an enormous amount engineering to go in skills is also if to short IT you, you have to think very carefully about a lot of things. I'm lucky to work with a bunch of great engineers because .

I am definitely not a great yeah the infrastructure especially yeah for sure. So IT turns out T, D, R. IT worked. He work.

yes. And I think this is important. Could have imagined, you have imagined a world where you said after towards us matesa.

Chris says, great. You know, that works on a one year model, but one models are really at a sync. Like maybe maybe just something is like, maybe the lu representation proposes supersize proses. This is the right way understand a one later model, but it's not the right way to understand larger models.

Um and so I think and first, all the coming him in all paper sort of cut through that a little bit and and so it's juster that was in the case, scaling one has to sort of, I think was seeing anything in evidence that even for very large models and we did IT on claude race on a which at that point was one of our production models. And you know even these models and seem to be very know, seem to be substantially explained, at least by linear features, and you doing directly running on the works. And as you learn more features, you go and you explain, explain more and more, so that I think I quite a promising sign.

And you find now ruining fascinating abstract features. And the features are also multing model. They respond to images and tax for the same concept.

Yeah, this. Can you explain that? I mean, like, you know, back door, just a lot of examples.

So you can yes. So maybe maybe this warning example to start, which we find some features around sort of security vulnerabilities and background and codes. So turns out there are actually two different features. There's a security vulnerable feature. And if you force that active, claude will start to go in. right? Security vulnerabilities like buffer overflows into code and also a fires are things like of the top did as examples but were things like dash, dash know as little or something like this which are sort of obviously really, really insecure.

So at this point is kind of like maybe it's just because examples were presented that way. It's kind of like surface a little bit more obvious examples, right? I guess the ideas that done the line might be able to detect more nuance like deception or bugs or that kind of yeah I really .

wanna distinguish two things. So um one is um the complexity of the feature of the concept train and the other is the the new ones of the how some of the examples are looking right。 So when we when we show the top data of examples, these are the most extreme examples that that featured to active um and that doesn't mean that IT doesn't fire for more subtle things. So the the insecure um could feature you know stuff that fires for most strongly for is like a really obvious, you know disable all the security type things um but you know h IT IT also fires buffer flows and more subtle security vult ability and code. These features are all multing models.

You ask like what images activate the feature and IT turns out um that the the security volero f feature activites for images of like people are clicking on crime to like go past the like this website um the essel certificate feel like us another thing that's very entertaining is this back doors and code feature like you activated IT goes on cod rights and back doors that like go going down your data porter something you can ask go to what images activate the back door feature IT was devices with hidden cameras in them. There's a whole a apparently john a people were going on selling devices, looking at us to have hidden cameras and they have a how camera unit. And I guess that is the physical version of the back door and that sort of shows you how abstract these concepts are. rate. And I I just thought that was um I am sort of sad that there's a whole market of people selling to I said like that, but I kind of delighted that that was the the thing that I came up with as the top image examples for the future.

Yes, nice is multimodal it's multi most context is strong. Definition of a singular concept is nice yeah to me, one of the really interesting features, especially for a safety, is deception and lying and the possibility that these kinds of methods could detect lying in a model special, smarter, smarter, smarter. Presumably that's a big threat of a super intelligent model that you can deceive the people, Operating IT a as to its intentions or any. So what have you learned from detecting lying aside models?

yeah. So I think we're in something in early days for that. We find quite a few features related to deception and lying. There's one feature where da fires for people lying and being detective when you force IT active and claud starts lying to you. So we have a have a deception feature.

I mean is all kinds other features about withholding information and not answer in questions, features about power seeking and cows and stuff like outs. This there's a lot of features that are kind of related to spooky things. And if you um force them active cloud will behave in ways that are, but not the kinds .

of behaviors you want. What are possible next exciting directions to you .

in the space of A. So for one thing I would really like to get to a point where we have circus where we can really understand um not just the features but they use that understand the computation of models. That relief for me is is the the ultimate al of us.

Um and there's been some work we we put out a few things. There's a paper from sand Marks that doesn't stuff like this. This there's been some i'd say some work around the edge here, but I think there's a lot more to do and I think that will be a very exciting thing. Um that's relented to a chAllenge we call interference weight where um do the superstition if you just sort of navely look at whether features are connected together, there maybe some weights that sort of don't exist in the upstairs s model, but I D just sort of artifacts of superstitions that sort of technical chAllenge less than um .

I think .

another exciting direction is just I yeah you might think of of sport on colors is being kind of like a telescope. They allow us to look out and see all these features that are are are are there. We build Better, Better and cutter, get Better and Better addiction ary learning. We see more and more stars um and you know we zoom in on smaller and smaller stars, this kind of a lot of evidence that we're only still seeing a very small fraction of the stars.

There's a lot of matter in our in york university we can observe IT um and IT maybe that um that will never you all have final gh instruments to observe and maybe maybe some of that just isn't possible um isn't completely tractable to obtain this A A kind of dark matter and and not maybe the sense of of modern and we didn't know what does unexplained matter is and so I I think a lot about that that dark matter and what they will also observe IT and what that means for safety if we if we can't observe IT. If this some if something in fraction renewals are not accessible to um another question that I think a lot about is a at the end of the day that is a very microscopic um approach to interview is trying to understand things in a very fine grained way. But a lot of the questions we care about a very macroscopic um you know we we care about these questions about neural network behavior and and that I think that's the thing that I came most to about, but there's lots of other other sort of larger scale questions you you might care about.

And somehow you know the nice thing about having a very microscopic approach that maybe easier to ask, you know is this true? But the downside is, is much further from the things we care about. And so we now have this latter to climb. And I think there's a question of what we be able to find other are there sort of larger scale of abstractions that we can use to understand their own that works, that we get up from the three microscopic approach?

Yeah you writing about this is kind of organs question.

Yeah exactly.

Think of interpretation service as a kind of anatomy of neural networks. Most of the circus threats involve studying tiny little veins, looking at the small scale in individual neurons and how they connect. However, there are many natural questions that the small scale approach doesn't address. In contrast, the most prominent abstractions in biological anatomy involve larger scale structures like individual organs like the heart, or entire organ systems like the respiratory system. And so we wonder, deserve a respiratory ory system, or heart or brain region of an artificial neural network?

Yeah, exactly.

And I mean, like if you think you're at science, what are science fix fields? Have you investigate things in any level of attractions and biology? You have like you more biog studying proteins, all his logy study tissues, and you have anatomy and then you have logy and then you have ecology and you have many, many levels of abstraction or physics, media, physical calls and then statistical physics gives you gives you the honnami some things like this since you often have different levels abstraction um and I think that right now we have you know the mechanistic interpret, if succeed as sort of like a microbiology of neural networks, but we we want something more like anatomy and so and you know a question you might ask, why why can you just go there directly? And I think the answer is superstition um in at least in thing the thing in part is that it's actually very hard to to see this this microscopic structure without first sort of breaking down the microscopic structure on the right way and then studying how I can acts together. But I am hopeful that is gna be something much larger than um features and circuits and that we're going to be what how story that much that involves much bigger things. Then you can start study in detail the parts you Carried out.

I support bio gist of psychiatrists of a new al network.

And I think that the beautiful thing would be if we could go and rather having desperate fields for the do you things, if you could have a build bringe between them such that you could go and, uh, have all of your higher abstractions be grounded very firmly in this very solid um you more ideally foundation.

What do you think is the difference between the human brain, the biological new network in the artificial new network?

Well, neuroscientists having my charter job than us. You know, sometimes I just like, count my blessings by how much easier my job is. Then the neuroscience sts, right? So I have um we we can record from all the news.

We can do that on orbital amounts of data. The neurons don't change what you're doing that by the way. Um you can go and a late neurons, you can edit the connections and so on, and then you can undo those changes.

That's pretty great. Um you could force any you can intervene on any neon and force of activity. Cy, what happens? Um you know which neurons are connected to everything north?

Sn has wanted the connector of the connector and we have IT for like much bigger than the elegance. And then not only do we have a connector, we know um what the you, which neurons excite or inhibit each other rights we have. It's not just that.

We know that like the the lines mask, we know the the weight um we can take Greens, we know computationally what each one does. Um so I don't know that the list goes on and on. We just have um so many advantages of neuroscientists.

And then just like having all those advantages, it's really hard. And so one thing I do sometimes think is like, gosh, like if it's this hard for us IT seems impossible under the gun straining and arrow science or you know near impossible. I I don't know. Maybe maybe part of me is like i've thought a funeral science of my team maybe make me sound like a you know um the maybe the neuroscientists, maybe some of them would like have an easier problem that's still very hard um and they they could come and work on on all that works. And then we after we figure out things in sort of the easy little pond of trying to understand our own network, which is still very hard, then then we could go back to biological neuroscience.

I love what you written about the goal of mcginty research as a two goals, safety and beauty. So can you talk about the beauty side of things?

Yeah so no, this is funny thing where I think some people want some people are kind of disappointed by neural networks. I think where they're like, oh, you know, neural networks um it's these just these simple rules and you just like do a bunch of engineering to scale IT up and IT works really well and like where are is the like complex ideas you know this doesn't like a very beautiful scientific result.

And I said, I think when people say that, I picture them being like, you know evolution is so boring, it's just a bunch of simple rules. And you run evolution for a long time and you get biology like, what a what A A sucky, uh, you know, way for biology to have turned out where is the the complex rules. But the beauty is that the simplicity generates complexity, and biology has the simple rules.

And IT gives rise to all the life and ecosystems that we see around us, all of a bune of nature that also comes from evolution and from something very simple, evolution. And some of the, I think, that neural networks build create enormous and complexity and beauty and and structure inside themselves that people generally don't look at and don't try to understand because it's hard to understand. But I I think that there is an incredibly rich structure to be discovered inside the networks. A lot of a lot of very deep beauty. And if we're just willing to take the time to go and see IT to understand that.

yeah, I love, I love mcinturff. The feeling like we are understanding, getting glimpses of understanding the magic is going on the site is really wonderful.

IT feels to me like one of the questions that just calling out to be asked. And i'm sort of I mean, a lot of people, I think you that I am often surprised that not moral is how is that? That we don't know how to create computer systems that can do these things.

And yet we have these amazing systems that we don't have to directly create computer programs that do these things. But these neural that works can do all of these amazing things that I just feels like IT is obviously the question that so I was calling out to be answered, if you are, if you have any degree of curiosity, is it's like, how is IT that that humanity? Now, how are these artifacts that can do these things that we don't know how to do?

Yeah, I love the image of the circus reaching towards the light of the objective function.

Ah it's an organic thing that we've grown and we have no idea what we ve grown.

Well, thank you for working on safety and thank you for appreciating the beauty of the things you discover. And thank you for talking to the wonderful .

thank you for taking the time.

Well, thanks for listen to this conversation with chisolm and before that, with diary omi day and a meda esco to support this podcast, please check out our sponsors in the description. And now let me leave you awards from Allen watts. The only way to make sense out of change is to plunge into IT, move with IT and join the dance. Thank you for listening, and hope to see you next time.