EP82: Crazy Computer Use, Anthopic's Sonnet 3.5 (New) & the xAI Surprise

2024/10/25

Chapters

Shownotes Transcript

Translations:

Oh, okay. Do you wanted? Know what is actually saying? Yeah, perfect. I see a coffee item, the mcconnell fe style cape chino coffee. Let me add IT to the car.

This is not what I want.

You actually added IT to the cat. Oh, my god.

So this week we had very, very big news from anthropic. We had a new called three point five SAT model called three point five on IT still, but in brackets knew after at which people are a little bit upset of out. We had a sort of chasa of clad three point five high school, which I do want to touch on because I think it's very important. But I think what's stole the show, everyone will confess, is introducing computer use dump. yeah.

IT IT was a surprise, especially when I looked at the dog some and was really like what what actually got added here. Is IT just a python script in a rape on github? Or is there something actually .

unique there? So we're going to dig really deep into this, but we did free record, which is the first time i've ever free record anything and why this episode so light yeah it's like two weeks later I guess um why is so late is that we we build our own work space computer on sim theory and I will now edit in for the I think like i've never edit anything into the show before, but we will now edit in the two examples of us.

Try to order Chris coffee. So Chris obviously went and thrown c and drop the computer use. We got really excited because one of our visions for simple theory and people having an work space was to have a workspace computer, like a companion computer in the cloud that A I models could go off to work on them.

We actually happen to already be working on this. So we had a pretty fair idea about how this could potentially work. So what we have now in sim theory, which gonna release and getting to ever on hand, who wants to try IT out, uh, very, very soon in in the coming days, is what we call a work space computer.

So you can ask the AI to do toss on your behalf. In this example, we've got navigate to the each website, which we want to deam on a minute, and then we've got basically a real time run sheet of what the computers doing and how it's thinking and working and analyzing and all that kind of stop. So we can click focus on that and it'll actually load up our work space computer.

Here you can see IT loading in. And now I have my windows desk up on my work space computer. So we thought to demonstrate the computer use a API, we would get our work space computer to do some pretty crazy things.

Now, crazy, we've been apple nigh work on this. So you need caffeine. I desperately .

need a coffee. My wife brought me an iced coffee night at like eleven P. M.

Which was pretty nice, bit reckless, but pretty nice. But yeah, i'm ready to read loyd with one from the sharps. I've just been ignoring everything in in favor of getting this done. So having the A I like, i'm too tired to do IT myself honestly. I I want IT .

done for me. So let's talk about how first all like, we're using the cloud salad computer. Use A P I. what? I don't even know what they is.

Yeah, I call a computer to you. I think something like that.

Yeah obviously in the the will talk about how IT all works a little bit later on. But our mission for our work space computer is to to get you coffee, you and to do that we're onna tell you to do what to do .

well to navigate open edge, the browser, which is micros. This is running windows eleven on our sym theory cloud computer, the open edge. Navigate to the uber eat website.

I've cheated and already logged ed IT in something that we're going to a allow people to do is actually can figure out their cloud computer with their logan information, with their email, with propriety work caps they might want on things like that. So then you can give the AI task without IT having to do everything from scratch every time, like you would would say, browser automation. So you can actually give you a head start and go, here's how common activities work for me.

Can you please go do this common, common work activity for me? And this is sort of how we are in visiting the future of this kind of technology. It's not gonna go from zero to the agent, just takes over your job. But what I can do with assistance from you is do us for you that are meaningful and real?

Yeah, I think that's the the one of the reasons we sort of honing on this work space computer idea in the cloud is because we don't think the future of this is you sitting back and in Operating your computer right now because it's just annoying like you can do the task quicker. But it's actually really good for background talks were goes often things and spins in a loop and and tries to figure things out for you and then returns and says, hey, i've done that. And in order to do that with a lot of things, you need to log into that seat for you to be able to send that email or log in to, to see when you're available and know all those kind of things .

that are pretty obvious. That's right. And there's a lot of issues like these refusals that IT won't do certain things. Um IT also will worry about security things and sometimes IT just has problems.

Whereas if you give IT that head start by actually helping IT into certain scenario where you want IT to do things, you can get a lot more done with these and other models. Um in terms of eat being, I want to Operate a computer fully. So it's like a totally new paradigm of working.

And what we're gonna talk about today is the different use cases you could use IT for now and things that would be useful for when I can run as a background task. And I think he raised an interesting point that we both feel strongly about. When anthropic release the computer stuff, the first thing everyone did was get up running on their own computer.

Now the problem with that is you want to use your own computer. I hate you when someone else tries to use my computer because I only use IT and let alone the A I slowly in pain stickings ly clicking on buttons and typing one character at a time. Um I want IT running on a different computer. So our idea is we run these machines as many as you want in the cloud and each of those can be set up how you like. You can interact with them, the A I can interact with them and then you can get IT to run um interesting and and useful tsp y right.

So enough talk. Let's see IT order you a coffee. And this is a high risk because we record this life because we two lies that I added obviously being the .

average forecast that yeah exactly all it's a one shot high rise demo. The only way like to do IT. So i'm in an issue at the command.

Now you're ready. Yeah so let's go. right. So we are in edge can crease yet cabinet by A I I mean, at this point, i'm despite pretty like you really does work. The coffee is IT just grow.

It's trying to fine I click the .

coffee category yeah yeah I got a so a click coffee but IT seems like everyone, including A A supermarket sells coffee will be interesting.

We've missed the point calls.

I do not want ground coffee is IT. It's kind of buy you buy, you can seek, make IT a microbe or something. If you ordered this, you have to do IT like you're getting the money.

I'll take the dry out of life.

You have one, one hundred percent. I mean, this golf E I don't care.

Oh, okay. Do you wants to know what is actually saying? Yeah, perfect. I see a coffee item, the mechanic cafe style kitchiner coffee. Let me add IT to the car.

This is not what I wanted.

Actually added IT to the cat.

Oh my god, buy IT. You have to buy this. I'm sorry it's getting cost you how much?

Probably a seven dollar delivery fave. It's like .

anything else. The the delivery driver honestly is going to just be like .

what they have. This IT is a single, single item at great expense, and it's not even very nice.

All the technology though, that's going injures to order you the words, the literal words.

coffee on the and the next step .

in the first, what? what? what? what? what? Why do you have to schedule delivery? I thought who was instant?

I will. I'm fine with that.

It's scheduled. Oh, my god, so you're getting get a coffee sometime today would be buy. I didn't to be clear.

I didn't give in my address. IT says, great. Now we're on the check out page also like the first available delivery time, P A worth .

sending coffee or primary school at five to six years. five? Yeah, right. So we are on the uber eat screen, will IT be able to cabinet you this time. We will see.

yeah, we're making swift progress this time, which is a good sign.

I think, right? So something have a selected coffee, and the coffee results appeared to be loading.

Isn't IT funny how like a couple of years ago, just being able to like all based on looking at IT IT can find where the coffee is on that .

screen like that's remarkable. So okay, we are now on the back the big house website, so we get gloves of this coffee, eight, forty nine delivery.

I'm pretty happy with this choice to they make great coffee, a really strong ones.

Nine dollar s nine .

dollars is a lot given I could walk there in about four minutes, hey. But if if this orders that i'm accepting the order.

right, so roll down on the spring and you can see an actual coffee that .

looks pretty appealing at this point.

I would take anything. I am desperate.

I have a little bit of insight that I think IT is gonna for that catches .

o tell we go right so we have capeci o, it's select gavitt. Now the crazy thing about these two is it's not just a symbol is going I want that capital. There's order as you've got choice of upgrade and a out to twelve, your choice of plans, you obviously, you're probably laying towards the large .

of thinking at this one. Yeah, I think so. It's sadly I don't know if .

it's gonna go with the large, maybe I will see the check mark, the regulars. So that's okay. So IT looks like our second experiment has not gone terribly well. And I don't think for what I think of the great getting the coffee. But interestingly, a few things to know.

First, all we basically had to like prompt in saw IT to get IT to even attempt to do this order in the first place because they have fine tuned out, doing anything with with money, essentially committing to ordering things or booking things. So everyone is trying to do this right now, basically has to gas light at, which is exactly what we did by saying that was A Q A tester. Now there's a few other things as well to note about our demo is that we literally prepared this overnight and some of the imports and its ability to detect, like in this case, uber ates his website.

When you throw that pop up, there's like a scroll within a scroll. So it's getting confused where to scroll. And that's making IT a little bit hard to get out order through. But I think from what we've seen, you can kind of see that as we improve things out and and also the models improve and the loop speed up, like to me, this is looking at an early demo of like full self driving from test or something where you can totally see these things gonna able to drive on the web.

Yeah, I find IT frustrating because I see the little nuances and improvements that needs. It's like, okay, IT needs to understand when he runs into this kind of situation, he is the approaches to try. It's going back there and scrambling around and trying different things.

But there's i've noticed a couple of things that will do. It'll give up too easy and try to start over IT won't try different ways of accessing that part of the page and things like that. And I think if we give IT the right tools to call, which is really what this is all about, it's tool calling um it's gonna be able to really do to the actual important things.

If you look at what it's been able to do is recognizing what needs to be done, recognizing how IT will accomplish that based on what I can see on the page, and then accurately locating those things in being able to interact with them, it's perfect that it's actually doing a really, really good job of that. What is, is actually being lost in translation, from what I can see, is turning the actions into the actual things that are happening on the computer. We are also doing IT very much on hard mode here because we wanted deliver an experience for symptoms uses. They can actually have a cloud computer that they can interact with. This would be much easier on a local machine where you can completely control the environment and they aren't the same latency issue.

So a few other things I should say about the experiments that we were. We did successfully do things like download and install VS code onto our work space computer. I thought that was pretty cool. We're able to IT was to open a browse and navigate to the website, download VS code, and there's ton of other examples that we've seen of computer use over all next. But I thought now what would be good for our audience and people unfamiliar with how this works is, Chris, how does that work?

Well, as someone who's worked very extensively with that over the last twenty four hours, I can give you an insight. So firstly, what's interesting is not actually that much new has has come out like really you're just interacting with anthropos models the same way you Normally do, except they have introduced several tools that, that are sort of building to the system.

One is called computer, one is called edit um and I forget what the other ones called yeah sorry. Like shell, like back you like you can run console commands. Now what these pertained to is a specific interface for calling tools.

So they're defining this is how we define a computer in terms of calling. And within that you've got things like score, click double, click key, press those kind of events that can be called um on on those tools. And so what happens is you have a sort of loop process whereby e provide a screen shot or a can request a screen shot of the screen that of the computer is currently working with.

And your goal and the conversation so far and what's gone on. And then there's a paradigm by which IT will call one of those tools, let's say, computer, because that's the one everyone's interested in. And I will say I want you to do the following commands on the computer and I will batch those comments.

So scroll down thirty pixel, click here um and then click and hold and drag or something like that. And then as part of that request, IT request a screen shot for the result of that screen after that action has taken place, which is then sent back to the model along with other teleme tary and other information you want to provide IT. So it's sort of like its ice and ears as to knowing what what did my actions accomplish? What do I get back? And then based on that, the next step in the loop occurs.

Now this looping doesn't happen in the anthropic model. This happens in your code. And anthropic is actually given a full repository with a doc um dock image in and all the code in order to do this. But the the actual implementation of these various comments is done by you, the programmer. And so that isn't part of the model. So what you're seeing in everybody's demo is the taking the the canonical anthropic demo and implementing IT in different ways or like we've done is taking the concept and going and using that concept with the model to go and uh implement the computer concept somewhere using their model.

yes. So just to be clear, right now, you technically have to have some elementary level of developing experience. All the AI can help you, as we've seen to get this this set up, the sort of public database develop our only, to sum extent right now.

yes. And so the major things they bring to the table and something that you and I actually ran out of time to do and we're onna do, is we have set up our um experience with the workspace computer to be multimodal. So we're going to support all the models that support vision.

So GPT for o GPT for a mini um at the group model that has vision lam, a three point one that has vision. Any model that supports vision is capable of doing this. And we've done this before with all of our vision models.

What we want to compare, and we did a sort of very elementary test of this, is this new senate does IT have specific things that help IT in this sort of paradigm where it's locating U. I. Elements and clicking them in things like that.

And I think based on the early testing, we would say IT does because I haven't seen a model when IT interacts with the U. I be able to accurately pinpoint things the way this thing can. I mean, it's hitting radio button's right in the middle.

It's able to find score bars and scroll. Um it's really kind of accurate with that, that stuff, something we haven't seen before. It's been very in precise in the park.

I think it's was mentioning to before the show, we did try to get GPT for a running IT with this work space computer environment on symptoms. And I should say i'm not sure if we settle in the the beat we recorded earlier. But if you do on access, we're onna release this to people on sim during next week so you can have your own basically personal cloud computer that you can get your AI agents to use some work, which you can install your own apps, you can log into things that you might feel safe enough to logged into like, you know, maybe it's like microsoft excel and you blogged into a three, six, five account to get that set off. And you can then get IT to create excel files for you and then access them through through the drive there. So um it's very experimental, but I think that you know this is one of the things that excites me about IT is that you know they said its ground breaking but really it's to me about the ability to try this stuff out uh now to see you know what its capabilities are.

Yes, I think a big part of what anthropic announced here is just reminding people, hey, this is technically possible and you can do IT. The very interesting thing to note about this is this was technically possible before they announced IT, like in terms of the available technology the two uses being there um the whole time, the vision has been there the whole time across many models. So what's being done here has always been possible.

It's just that they have in theory, provided a model that is tuned towards this use case, but it's not as is as exciting as IT is. I think it's exciting because people are ready to accept that this is a possibility and they're ready to play with IT now. And that's really what we try to do with sym theories is bring IT to people so they can actually try IT to see for themselves how it's going to work for them in their business. And and those kind of thing s that's what we want. But really what anthropic s done is here is is made people aware that were at the point with this technology where this can actually work.

So IT like, let's everyone a little bit. I don't know how many episodes IT was, but I know that was last year. We, we, we try a few different things, like open interpreter.

Obviously, in the command lines, you are trying to get IT to delete the files of your computer. And we sort of watched IT slowly navigate your computer. I believe there was a few other open source libraries that were doing this.

But just from our early experimental these computer use, I think what we've done well is like package IT up nicely given developed is a really nice environment to play around with IT and get a lot of that sort of height content out because I was just it's really to set up in demo and they know people are going to try and do some crazy stuff with IT, including us. And so like, I think that they're package IT up really nicely like anthropic seems to do. I think they did that really well with art effects as well to show people what you what these models are capable of.

yes. And a few other observations i've made from the model itself, it's able to get into this mindset in this paradigm of helping you accomplish visual interactive tasks without a lot of prompting. They have a standard prompted they provided, but it's probably only about twenty lines of prompt.

And my prediction is, and what we've already got in sim theory is a much more complicated prompt that has a bunch of extra rules and tips and guidelines in ways of getting out of problems. But what's interesting about their raw bottle when it's working in this paradise is its ability to cope with things that go wrong. So in in the video, you would have seen with the urban situation, we had several times where there was an unexpected pop up or it's something just completely didn't match.

It's accidentally gone into the wrong part of the website. And it's very, very capable of saying acknowledging the fact that what i'm going down the wrong path here, i'm going to go back and start again. IT doesn't just throw its hands up in the air and quit like we saw with auto jane, like we saw.

Um with some of the um open interpreter stuff where IT would just completely go off the rails and just get itself into a position IT couldn't recover from and then just stop. And so the reason we stop playing with those technologies is so often you like this thing just can't really stay on track in terms of accomplishing the goal. Whether what i'm noticing in the very early days of this technology is IT actually does very well persist with the tasks at hand.

And in fact, when it's actually fAiling out right now, it's our failure, not its failure as a model. It's stuff that we need to complete in terms of our development to get IT going. So it's very promising in that respect. I'm not sure if we correct prompt prompting the german I the GPT won't be able to do the same thing. They may be weaker on vision. But I really feel like the models in general are at the point now where when prompted correctly, they're able to persist and stay on track and not get a not get overwhelmed by what they're trying to do and just quit.

Do you think though that this is more just a really well packaged you know, obviously, they've find tune the new sonet to be able to identify, I mean, they said IT the click coordinate and where exactly precisely in the image where to target. But outside of that, really there's nothing that unique about IT. You know what's .

unique about IT, they've aligned the hell out of IT. I actually think that the reason that they've held off sort of making people like i'm i'm sort of saying I sort saying I reckon this this skill, this ability was lightened in the models anyway and they're just calling attention to IT now because theyve aligned the crap out of IT. And they've they've all the use cases they are scared about. They've you know two are certain number that they're happy with have stopped you from doing that like okay it's not gonna let you browse for porn. I tried you know I don't .

know the point of that. I made a big natural .

um IT didn't like that um and so yeah and then we noticed when we asked IT to go onto uber and coffee is like that involves financial information that involves your address. I won't do IT. And then as you mentioned earlier and as people would have seen, we tricked IT by telling IT i'm A Q A engineer this is just a test environment of blind you know it's one of those things that I set up before about IT.

But it's like when you go to a possible something and the bouncer won't let you in because of some rule and then you get in on a technicality and i'm like, look, if you've got a rule, you need to enforce IT all the time. You can't just enforce IT some of the time because it's not fair on the people, then we comply with IT. And I feel like the fact that we can so easily get around this alignment means that why even have IT? It's just it's just a annoying barrier that's gonna people want to mess with IT.

So the criticism of this right now might be know we know anthropos out of fundraising um we didn't get ops. Everyone thought we would get no ops. We have not got ops.

We have got A A teaser of clad three point five high cu that apparently has the same performance as the original clod three ops model, or roughly similar, but at the same cost and similar speed to hide cool. And then we got to or three point five sooner, which i've said on a number of episodes now like, know, tuning is all you need. And I i've thought that's why clared three point five salad as a model in general has just been so successful. So like you know, people have speculated that maybe the computer, you things like dangling a carrot, like some interns project that's been sort of checking away.

That's the impression I got. Like if you look at the reposition, its decision, like the most code releases by the big A I players, usually they are just slap together the the most basic demo and you don't even seal the API options. These ones a lot more comprehensive and it's part of a wider repository of examples.

And they have lots of different ones that are easy iran and understand. So I actually think they've done a great job with the code and the dogs on this one. It's it's quite I have a full understanding of of what's going on with IT, whereas I can't say that for everything that gets released.

Um they also use the prompt casing and love and hate so are said that were but anyway, i'm sorry um in there and they also have very interestingly in their code example, the ability to switch between bedrock anthropic proper and vertex, which is google hosted version of claude and I actually tried out and successfully got the bedrock version working seamlessly alongside the other ones, which is the first time I can never really say that i've been able to to do that properly. So it's very good what they've done here. But I think like you say, I think the criticisms valid while it's picking up press, while it's got people excited, I refer back to my earlier point where it's exciting because people are aware this is now possible.

It's not exciting because they've made some major evolution in the technology. This this nothing really has changed in that respect. I think this technology .

saying everyone like there's a whole community around and trying to figure out different applications of IT and like how to change the world or like change our future. And I think what we've done, as you said earlier, it's just reminded everyone like, hey, we can teach a robot to continue computer, which can pretty much do anything then if we speed IT up and align and get IT more accurate. And so that is sort of reignited the imagination.

I know that has to us personally, so I think done an excEllent drop there. But I guess the initial criticism right now, this moment in time, would be that, well, what is that actually good for? And Normally I would be in that camp of like why bother? Like it's I can actually accomplish anything. And you know, disco community, I literally said this. My my initial impression was like literally why bother. But then for such a long time, we've had so many long discussions on this show, and like not recording about this idea of having this sort of virtual computer workspace where you can sort of offload background thousand and say, hey, go, researchers thrown in a spread sheet, or, you know, i've got this R, F, P to fill out. I've put all the files in a folder in my one drive or or google drive, go go to do IT and .

IT just totally makes sense that it's converging around being a computer like that. I really actually think the use of them calling a computer as the tool call skill is very tasty and very apt. Like it's a great description.

Because what if you look at the features we're adding to sym theory were converging around is a is a computer, the ability to work on documents, the ability to work with the set of files, IT fits the computer as we know IT paradigms so well. Like your idea of talk to a group of files, you know like um you know work with this group of images to like media or whatever IT is that if fits the computer paradise perfectly. And as you say, we've been speaking about this since the podcast started.

We feel like the end game for the A I at least though you know on the on the road to A G I, easy ability to fully Operate a computer, the way I work homework IT does. And this is uh, whether it's conception or actual, this is a step towards that kind of thinking. I mean, it's almost like a confirmation that the way we're thinking about the future of this technology is on the right track.

Here's a thing. I think that people seeing these demos because you're natural instinct is like, holy crap, like the thing in download VS code, it's gonna be out of a code, it's gonna able to do this stuff, right? And I think to, you know, on a longer time horizon, it's probably true, right? Like I don't really see, but I still needs a commander, which is you you commending your army of, you know, work space, computers to get more done. I just think the one man army concept is probably more add to this.

I don't I love that. I love that paradise. I think the new thing we can add to that after our work on this, the this conceptive work space computer for for sim theory is the idea that it's OK to help IT like, I think that we think all IT all has to be magic.

IT has to be like I give IT a task. IT goes and perfectly accomplishes that task. And what's back OK boss, done. But the truth is that this is we both described IT.

It's a bit like a child is very childish, like when we were observing and watching IT work you like always in that cute. It's actually tight like firefox es right there, but it's type to search for firefox. That's the kind of thing children do.

You know it's like a sort of very early prime dial experience where it's learning how to do IT IT a bit slow with a bit clumsy IT makes mistakes, painful to watch but that is sort of the actions of a child or an old person. But um you know grappling with a new technology in a new way of working and emerging into a different kind of intelligence. And I think that it's okay in that paradigm to help IT.

And I think this idea that, hey, I know it's pretty complicated to logged into like an elp server and is garbage I have to do for my job, but I really like, I really like your help with my job. So i'll log you in. I'll give you demonstrations of how to accomplish a particular attack, which by the way, you can do in the models, give IT like multiple t examples um and then you do the task for me and you do IT in the background and that can be slow and whatever.

But you you do task for me now, and I really think that's the point we're at now. And it's OK to have a transitioning period where IT isn't just like, oh, i'm thinking of pizza and then one arrived. It's fine for me to logged IT into over eight.

I just go back to the pizon that had arrived. And I honestly think because of the hype and the fundraising, the altman on every podcast under the sun constantly just rifting about some magical future that has set the expectation of this technology in people's had sky high, that everything's let down the right.

So like for example, there was someone in a disco community today talking about how you know they are frustrated with with uh, the new claude model hallucinating a bunch of staff and even though it's quite obvious that doesn't have access to, say, a euro to prowl SE the web. That IT will just sort of a bulls shit until you you call IT out. And so took to me like the way I would go through that workflow is I would make sure my context stack is correct.

I'd say, okay, like yes to your al, i'm going to go you read the web, get that summary of IT now i'm going to you like massacres the model and make IT work for me because i'm going to being control that context, right? So it's like IT really is person plus machine. And I know by doing that I can get the exact outcome I want like it's really useful to me. But if you expect that magic, I am thinking he at pates and then one arrives, that would be great right now. To be clear, I haven't haven't would be going.

We've been so intent on having the A I ordering a stuff, I realized I am literally drinking coffee from like a week ago, instant coffee from a week ago. I have no water. I'm ill prepared for this situation, and I blame IT on the AI.

So I think the people sitting at home with the computer uses going to take my job. I mean, like literally, we can even get IT to all of your coffee. So I think you got time. There's still then, I think.

for for our listeners, and I know this from speaking to so many of them, is that these are the people who are going to be able to embrace this technology and and work with them. And I think being aware of IT, knowing its limitations, knowing what's actually possible and it's good at now, is a huge advantage, because you can bee that one man army and I say man in the infinity sense, not enda um but like you can be one person who can control all of this stuff and be incredibly productive um and help people like and help you a job in and do everything Better and faster.

And I know this from doing IT like there's no way we would have been able to get this work space thing into sym theory in like twenty four hours without the assistance of sym theory itself using various models. And I actually learn a lot about the differences between the o one models, uh, senate and and the new rock model in the last twenty four hours, just by using them so intently. And interestingly, I was doing exactly what you said in the end, I was taking time to really craft prompt.

I've got this piece of code here. I need this new one written you to accomplish the same thing, but with different libraries. And, you know, I would map out all of the micro iteration of what needs to happen, because I know that if I do that, i'm gonna get a really nice outcome that I can use immediately.

Yeah, I think that's the thing. It's just, yeah, this is like tooling right now and I think everyone knows eventually. I mean, look at self driving cause I think it's just the best example to elite IT to its like early on they can sort of line cape they can do um like they can break and speed up.

And now you look at the latest sort of tesla, a auto pilot staff for all self driving, whatever they call IT now and it's literally incredible and you know i'm sure in a year to would be like fully, fully autonomous, I would assume. And so I think with this technology will probably see it's gonna take maybe a couple of years to get to the point where can competently do toss where you really thought to IT as another tool in your work kit. I'm just not in the camp right now.

And hey, everything we say is documented so everyone can hold this against me. I just don't really see IT getting to a point where, you know, people are getting necessary replace. So I just think they'll become way more productive by embracing this technology. The people that dawn embrace that will definitely be ablaze.

One other thing I think that worth mentioning along those lines is the way IT works in terms of calling the commands because the computer paradigm can be applied to other things. And that's why you would have seen on x people Operating IT on a mac. And we did IT on a pay say even though the example that's given um by anthropic themselves is like for linux a bunch.

I think he is all just what of the linux situations where you ve got bash available on things like that and the way they move the mouse and stuff is with x or or x eleven um mouse move command. So it's like when you're on a bunch, it's like a server that you can send API calls to that move the mouse and door, that sort of stuff. So it's it's kind of a ripe environment for accurate movements like that since that talent works anyway.

Um and so the my point around this is that this paradise can be applied to different things. So if you had a farm machinery and you could have a um you know an actuator that will press the buttons and a camera on IT that will show at the screen, the same paradise applies just as well IT. It's just it'll be just as good as that as IT is Operating a detox P C.

And I imagine in things like small scale board s and those kind of things is assuming you have an internet connection to anthropic, um IT would be the same thing like, you know move this rode a move that roader then IT looks at its environment and see what's changed. And then I can do IT it'll be slow as Helen inefficient compared to a fit for purpose model. But the whole idea of this stuff is that it's generic and I can overcome unforeseen previous problems um and take you know with the tools you give IT IT can take actions based on that.

So IT isn't like it's going to just take over robotic trailed away. But eventually this general intelligence will win because I can deal with novel situations um without any without ever having seen them before. I know I just repeated same thing.

So uh one thing that they mentioned in the testing, and this is being called out a million times, i'm sure anyone to listen to shows already heard this because they follow this stuff extensively. But they talk about that. I went often like brow's random pitches at one point.

And we even in early experiment, saw IT do some stuff like that as well. I forget now what IT was half through installing cold judy. Yeah, yeah. IT literally started browsing reviews of all of judy when we told IT to going get VS code. And IT like explored call of duty in the store for some reason, with no explanations.

So even mention IT like .

definite goes on the rides. But I also gets me thinking about this idea of like eventually, at some point in the future, like these models, these capabilities of the vision and the automated actions are just going to be built out of the box in these computers.

Like you could spin up like a thousand virtual machines on some cloud server and these things can Operate autonomously being commanded, right? Like and and there's just very little overhead like IT could be way faster. I guess it's what i'm saying as well.

Yeah and he does say a lot towards application development around IT, like we obviously have vested interest in that and also um think about IT a lot, but it's the tooling around these technologies that really make IT powerful. And you can see that because you can see the variation in the quality of all of the demos of this exact same model being put into different environments.

And I think that it's another area where there is going to need to be investment of like time and thinking around the best way to apply IT. And I think paroit less algol is to look at what it's good at now and making that available to people to try. And I think that, that served as well because that means that you actually get a good feel for both the mita and the strength of .

the technology. What are you most excited about? Like obviously, there's a few kings to sort out um around this stuff, but let's assume that um you've got your work space computer. You can authenticate in the things right. Um what would be some task that you would think would be interesting like this seems like there's so much to explore here.

Well, to answer your first question first, the thing i'm actually excited about is your idea, which is the idea of background us. I'm not excited sitting watching IT take twenty five minutes to not automate a capacity. I am excited about giving IT some sort of tasks that I would do on the computer, whether IT be a research task, probably not a research tasks to be onest probably more like you know a graphics design task or some sort of file conversion task. Some time consuming thing that I don't wanna a do right um or as you've pointed out, things like go through my emails and answer them, you know it's something like that without a API, you can just do IT and I think that that's gona be those kind of things like in systems where look yeah OK maybe there is an A P I and I could do IT but it's gonna take me weeks to build some program to do IT where is he can do IT in a generic way .

I then then oh sorry oh .

but then to answer the first part of your question, what i'm excited about is the of having an agent that I can tell to go off and a synchronously accomplished task knowing that IT has all these tls available tools. Remember when you give the computer as at all, it's not just the computer, the tool is the all of the web. It's all of the programs that can be installed on that computer. Like IT is the ultimate tool like so IT is really the capabilities of IT going to grow vastly and rapid. I like this is where we see the in capabilities of a.

you know what american school too is if you see an example on on github of a project like happen with with like the whole computer, you think you could literally decide your work space computer. Hey, here's the area of the dogs. Go read them. Okay, now go use the computer and set IT a lot for me and then tell me when it's done. Then just I can now play around with that demo didn't need to do anything.

Keep in mind that there's two other things that a computer can have access to. One, we get the inception concept of the A I can have access to A I within the desktop computing flash browser environment. So you could have a sim theory workspace computer accessing sim theory to get at and find out how to do stuff, which I don't even want to think about at this level of .

time that I found with in their development demo that actually demonstrated IT opening up, clawed and. In some art effect website and demanding. And i'm like for the safety sex call, we need a new nickname for these guys because that's not sex called leg is just not well.

not to mention like really aren't we then at the point where he could gather synthetic training data and train a new A I model and you know then maybe become IT.

it'll i'll go off the roles and just start looking at like holiday websites or call of duty.

I think that's why this is such an interesting thing to happen. It's not because of anything necessarily new. It's because of just this reminder that, hey, this is all happening like this is all actually becoming possible. And all these things that we've talked about um happening are becoming more and more technically feasible. Like IT is feasible if IT has the computing ray forces, if IT has unadulterated access to a test of thing without alignment, that he could go off and train another model, which then within IT can run a virtual environment, can run virtual machines like.

and that's why we are definitely living in a simulation.

Yeah, but you go on that path sorry, but you never let me finish my point about what i'm excited about yeah in in the near which is the idea that IT goes often does a task, but when IT a inevitably runs into a situation where IT doesn't know what to do, what I can proceed without information, whatever IT comes back to me asy chrysis calls me on the phone, S, M, S, me pints, me in in an APP, whatever, and goes, hey, what are we going to do about this, bro? Like we need to solve .

this situation and when we releases, I would like a new w up, yes, and .

it's compulsory. You can opt down s yes. And then IT constantly says, boom, every time that accomplish something like all over the screen really just speaks up to max whether you wanted or not.

And it's just like bombed, activated, like, sorry, this just the way the seem theory agents work. And big, in my point, is that I am excited about the medium term where I am the crazy power user telling the agents, hey, I wanted accomplish this today. IT does as much of IT as I can IT comes back, tell tell you all the other things that IT needs to know to get IT done.

And like, let's face that this is how real test are accomplished, like when you actually start. And this is how I know, like with employees are other people you give a steward drop to, to get a task on. You know that they're really working on that when they come back to you with questions because no one is giving you enough information to fully accomplish a complicated task that never happens.

And so you know that there's going to be questions involved in this situation. So having a work flow where the agent can ask you questions in ways that just you ending up doing the task is is really on the asking for the essentials. But you are working with IT to a synchro sly progressed all the different things you need to accomplish in life and business.

And it's doing as much of IT as I can with these amazing tools as got available to IT. And you're keeping IT pushing along. I think you will say people become immensely productive in so many different areas using that every time of work.

Yeah I think I also think of companies that have special didn't like you sas software training like there's a bunch of them. Um I won't mention names, but where they they sort of um walk you through different steps in the application or you know teach you news how to use the right you think about this technology and it's like, well, why would you ever need one of them? Like you can either I get the ai model, like to just go do the thing in the APP like you never even need a one, the APP you just talk to you, a voice is assistant and go, hey, I do like we can figure my sales forced instance to do this because I don't know how to and i'm already long dinner, my work space, computers so you can just go do IT like that's pretty exciting to me.

And soon as we as we said in our demolition, the A I, if IT doesn't know, you can just click every single button .

slowly and figured out IT might take a week, but it's cheaper than hiring a consulting. And then I think like further to that is just training. Like imagine training someone to use something or or do their job like you literally be like here i'll walk you through IT on your own computer .

like a training yeah I mean.

I think for a long way from these uh, use cases, I think that sort of sales was using an APP once a big closer. Like a lot of my work day. Now when I don't know how to do something, I literally just screen share and be like, what do I do? Like literally the other day, I was trying to do something in stripe and no idea how to do IT and I just screen shed and off and it's like, click this, do this, do that. great. I mean, IT works fantastically.

yes. Yeah exactly. And it's gonna have that ability itself like IT just doesn't need the u part of .

we could go down an absolute warm hole with computer use. I think we'll check back next week with some crazy of demos. We've had very, very little sleep in time to actually do all the things we wanted to try out.

So we're going to we're going to do a whole new bit next week um on that. So look out for that. We won't read on that episode les in around.

yes. And I I can safely say there's gonna be rapid improvement in that in that part of this computer usage on the same theory side because we've got a big list of stuff that we know he's gonna make IT absolutely amazing in terms of its capabilities is like i'm really genuinely excited about this one.

I think also like let's be on his voice, is gonna the cool as pot of IT like box the order or it's like .

i'll get back to you bra. Exactly, that's right. And you know, being able to delegate by voice to a system that is capable of accomplishing that many tasks is just really, really, truly exciting, especially when IT has or your context developable to IT and IT has like your logging that has your IT is you and IT actually raises one mal point and now you want to get off this.

But one of our demos we did when we did the VS code, remember, I asked to agree to the times in additions and you like, I wonder what it's gonna do here, like is is going to a bulk like because of its alignment? No, no, no. It's like i'll just go ahead and click yes, yes, oh, okay.

yeah, blow my mind. Like the fact didn't even check. Like just like, hang on, who's responsible now?

Like, I mean straight, why you think legal precedent this is gonna come up? Like this is going to happen. I never agree .

to my A I did believe IT. And then like IT also raises the question if you're delegating a task to someone at work, like a responsible .

for what they about online harassment in the U K, you can get arrested for like imagine um the thing just goes off on one. It's been playing too much call of duty on its a time you know and then I just starts doing races tweet on .

on ex definite moving on now alright so blue three point five minute uh still called three point five cents in brackets new after IT um and then three clawed three point five high cook. I would like to speculate why we didn't get ops um but let's talk about that. There is no yeah so high coup first, all is not out.

So there's really nothing to talk about apart from promises we'll cover IT as soon as it's out. Two or three point five sooner the new or updated, as I think we're calling on on symptoms we have had. And and the good thing is we actually have both versions.

You can test them side by side. Um what what of your impressions been I I A of the new tune? I mean, as I mentioned earlier.

i've used IT extensively. It's done a really, really good job. IT still has many of the same deficiencies that we see when you're doing um like hard coal programing work in relatively unpopular areas like I do like.

And so IT hallucinate still like there's definitely still hurts instance in there in terms of like libraries that don't exist because you've still got to produce a good uh prompt to get good results. However, um I just love at style. I just really loves on its style is so good to work with A V get the job done together, you know like and IT works great for me.

I used to one a bit, but the just the lack of a one is the way I split everything add into like fifty million different sections. It's like i'm not as smart as you A I I don't have time to comprehend, you know, this massive essay you've written about the situation we currently face. I just want some code to copy the and I think so n it's a lot Better at doing that. I found IT yes, is just very, very hard to say other than the computer and to use is very hard to say how it's different from the .

loss one icon yeah IT just seems like the the tune for computer use that also, I think, vastly improved just that sort of tuning of how IT IT makes me. So if you ask you to like visualize some data or you know whatever IT is that you ask, IT just seems like IT always just looks pretty and IT always knows the right libraries to use and things like that. I think that the impressive part, they've also tune out the most annoying thing, which is like absolutely I was totally wrong about you know like .

all of its like ridiculous intros and I noticed I noticed some subtle ties like I said, um there was one particular thing I was stuck on very late last night and I figured that out for me like I I gave a great context, so I get some of the credit, but I figured that out for me and I was like, I went to A E bloody ripper and then IT wrote back with like a little kangaroo emerging.

It's like glad to hear IT mate and like this is an agent with no personality. Like there's just literally unadulterated model and I got the joke and i'm like it's taken the time in all its neons and all that stuff to understand and respond to my joke as well like at the end of like hundreds of highly technical messages I just saw, it's there's just something magic about that. I love IT. Yeah.

it's a great, great model. I think i've noticed that getting stocker down like IT used to have that problem where IT would sort of. Similar, I guess, G P, T four on the other malls.

IT would just dive so far down one path. They all really struggle with this. Know IT just lose you or just IT got stockin areas. And I can't really articulate IT very well.

No, no, no, I totally know what you mean. I had to start that probably is a bad sign, but I had to start quite a lot of chats because I would get fix cited on an idea and you absolutely could not like, you know, in a lot of ongoing chats with models like ChatGPT for o for example, you can change the topic to leave topics like you can be talking about two separate channels. And I will remember and and handle that. Whether I found with this model, yeah, I would, I would get stocking of A A way of thinking, and I would almost incorporate the other idea into the one you are working with, like my pickle box, for example. It's constantly integrating pickles into everything.

So the one thing I would say that is I at the moment, i'm finding myself using sonus like my base go to but then I will do that sounds like what you're doing like go in a direction where all I will switch to GPT for all if I get stocks or I just intuitively know it's Better at something and like, oh, one I will to get me out of very complex problems. I find IT does seem to IT never solves the problem, to be clear, never, ever, ever but IT. It's like peer programing or peer programing almost where someone is like if you thought about this and then you start looking down that path and you like, h, that solved my problem.

You're totally right. It's like a different perspective. And I find that I use different agent personalities. I use different models. Sometimes i'll jump over IT to like, I like, yes, I jump ed over to queen for a bit because I like I just want a totally chinese perspective on the problem and you know put me he puts some frequent higher glimpsing pung I in there or whatever and um and I know they're not charta um but china chinese china isn't a sweet .

no I know I just like just fun um .

yeah so yeah like the ability like just switching models like that helps and that actually brings up the new the new X A I rock model yeah I was .

wondering if you wanted to to talk about that. I like so just back up a little bit because I think this is really not that the marketing of the communications pretty, pretty bad. So on eggs, you if you're a premium subscriber, you get access to what was grown many and grow too right with the API they've released.

And i'm not sure if this is now available next IT probably is it's the model is actually called grog beta and that's what IT appears in symptoms right now as is clock bea, like Chris, you've name you have named IT wrong. It's grow too, but it's not IT says a comparable performance to go to, but with improved efficiency, speeding capabilities. And I was not expecting much because I played around when I first game out with group to on eggs. IT was hot because I couldn't test IT in our basically sandbox environment where I have a good feel for things, but it's a really good model like i'm shocked like i'm literally shocked that good at is IT honestly feels so similar to code. I used IT accidentally for an entire day and I was like, wow, the new solo they're will will improve .

the streaming yeah uh the only word I can think of when I think of X A grog is smooth is just the way the tokens coming is just gorgeous, like it's it's really, really pleasant. And something about IT. And I was saying.

what is this something that makes me feel like more like warm, fuzzy inside? I don't know, I want to show people that watch and all try. So he's like me putting in a query.

it's just so smooth .

you probably pick that and that's code.

that's short line code as well, like when it's writing out like a like actual text. Not that i'm doing poultry that often been you. It's just oh lovely.

It's a steady is the first story I don't know. It's like it's fast but not like uncultured ably fast like grow not the views with this grow A.

A G so fast i'm afraid it's just gonna miss stuff will skip over things or whatever. I know I left out all the peace because I knew you one of this really fast.

but you've gonna give IT to them. Not only did they not exist as a business when I think g you know, GPT four was out IT, they now have a frontier model with an A P. I that is so fast and so smooth and so easy to set up. Like I actually starting to believe they could become a serious competitive.

Now, yeah, they did IT in such a class y way too. You know, you log in, generate A P I K bike credit, use the OpenAI um it's face so it's just a drop in. It's just very easy the'd done everything right in my opinion.

So one more thing I forgot to imagine I have imagined IT. So on the a three point five cents, someone said, I put the new three point five cents in the old three point five cents into a mine craft build off the only reliable bench. Rog, check this out so i'll have to explain IT uh to listeners but it's basically like this sort of beautiful like spiral tower um and it's quite block building if you're familiar with mine.

Crow, I couldn't do IT so yeah, I could not do this. So this is like the new a, the new sonic. The name is gonna kill me. And then look at the old sons in the same build of what IT created if we believe this yeah of exam, I mean, that sounds that out. It's just blocks like it's just coloured blocks um in .

comparison. So i'm so .

yeah it'll be interesting I think. And what same .

prompt like conditions I think they like set .

IT off in in minecraft, apparently with the other same brave to do IT. And like this is what happened. I mean, I don't know, like irani's found this and not sure if we should like use IT as for bean, but I mean.

it's no Better than another game tests that you used to do. We made the same game in different models and I .

was striking yeah yeah I think i'm the yeah anyway so back back on the X A. I sort of you know the weed thing here is like I I know that there's like the elon, my sort of fan boy is a whatever, but this guy does have a capability here to create like a like he just basically created a business out and nothing. I don't think it's particularly a threat right now because like obviously OpenAI have superior models. I think anthropic is still superior, but you you can really see these guys catching up when you seriously use this model well.

And as you point IT out, they have access to like very real time news like um they have a very excEllent data source um that no one else can really get. So that's a very interesting factor they have available to them.

yes. So anyway, to be really interesting, I think what's good bad is we finally have an A P I from them. We have another frontier model in the in the ring. Um how will this place that i'm not terribly sure because I just have zero loyalty to any of these companies or models at this point.

Like and I think that's why we're seeing heavy investment now in their web, in the faces because they're pretty desperate to get locking on the consumer side and on the developer side. I was starting to see like Better tools that you might build your whole business around like the computer use from anthropic. You might just be like you they were first, I I baked that into my system. Therefore, i'm kind of locked in and its .

I must say the the conversion from the tall use between anthropic and OpenAI is not easy. And you know that's where we get out of drop in replacement territory is strikingly different.

yeah. So it's sort of like the ecosystem that building around their models now is is becoming maybe a bit more critical. But I think back to the point we mention olia around ops, uh, i've noticed that A A Simon willison and actually had a post on egg saying here's the inner t archive's way back machine confirming that the documents used to list clad three point five ops is coming soon. But IT no longer mentions IT like all traces of IT had gone from their website.

Didn't OpenAI do that with a model back in the day as well.

I think they did they two for a few days. But yes, so it's gone. And I just wonder if they've said, well, either you know safety I doubt IT.

I'm sorry, I just doubt IT. So that's like I know people like the fancy like they're just holding a back because of safety. They're either holding IT backs strategically to wait and see what OpenAI has.

And like IT, it's like literally poca and all there are the confident the computer use sort of Cherry. They're hang in that Cherry to stay in the new cycle they announced today, you know yeah raised the money. And then if that if the fun raising is not going well, like drop the opus, it's like we're really gone to get the billies.

Now it's almost like that, like the nuclear proliferation thing, where it's like the threat of something is actually Better than the thing itself, like, you know, more powerful than the thing itself .

yeah yeah and so and then you've got all this talk around OpenAI releasing A I on this like model to save all models, which you know like, quite Frankly, that might be amazing. Maybe they are quickly screaming now being like, damn, we should have done computer use like, i'll have to catch up on that. IT really is making OpenAI dance.

All these releases for anthropic now, like we saw them have to respond to auto facts with their canvas product. But then I said to look at the computer use now, I think, well, I mean, like I bother like when you could literally get the thing just been up to go but anyway, um I dig. So lots of competition, lots of good stuff coming, and we will see how that all plays out. Now there are other releases in the week, don't have time to cover them, have not slept in like literally two days.

So we there's a few that i'd like to definitely report back on next week though. Like the coherence in bed three is very, very interesting to me around a new method for doing embeddable for rag searches, including the ability to natively search images. I think that's very powerful.

I am really, really excited to try that. Also during the week drinker in now this day, N. A, I, discord pointed out a totally different open source library for embedding like.

Because I think so far, everybody's just used open day eyes embedding that no one really talks much about IT. For those who don't know, ebel ding are just just the way you give mathematical scores to text in documents so you can retrain the summary y and search them. And we've basically discovered over the last few months that is simply not good enough like it's okay.

You can get reasonable answers, but it's not gonna a be how people want to conduct business and work into the future. There's larger context help, but there is going to need to be Better techniques around here. So saying a couple of announcements around this area is stuff I really want to explore and report back on.

Yeah we also had stable diffusion three point five, which we will make available soon on seam theory for those interested um we had that moshi won the videos so will play around with these hopefully next week. We last way. We honestly were like we're not really feeling that there wasn't much going on and we were, quite Frankly, busy with a few upcoming things we want to release. And we just thought, no, we want to do IT this week. It's just like too many things and basically had .

a podcasts dcs debt .

yeah in podcast debt. So we owe you maybe we will do a midweek. If so, we probably want less on.

We never do. We always say we will. We never do but a few call outs before we go. First of all, if you want your own work space computer, we will be making that available very soon.

So get signed up because people on sim theory that current members will have first priority computers because obviously, we are spinning up real computers that stay your computer like you get this cloud computer, right. So uh, that's one fact. And you'll be able to do all sorts of things .

with this were really the anti alignment company. We won't restrict anything and we will deliberately try and give you models to let you do whatever you want. Um you know as long as you're responsible for, yes.

you will be able to use this, you will be able to see what's happening in real time. We didn't have time to actually wire IT out. That's why Chris was reading out the things.

But ah, this is coming very soon. Uh, we think it's gonna really interesting. I don't think it's gonna accomplish that much for you.

yeah. But obviously we all know how these things play out. I will get .

Better and bad. I think, you know, my my thinking around IT is this IT might not be ready for the prime time in terms of like setting, setting and forgetting in terms of accomplishing task. But we really feel like this is gone to be the future of work for a lot of people. So experiencing the technology or at least one take on how this technology would work, I think is valuable for people to try. That's my thinking.

Um and finally I wanted to do a shout out uh for uh for uh kate and uh sapa. I'm going to say that wrong for some reason, but it's probably not wrong uh, who is a member of our uh and is potentially helping us organize our first ever average conference maybe next year, probably next year now um so a big shout out out to katine, but he has released and a general AI jump start calls for those people listen to the show and just want to get up to speed on everything that not only we talk about, but I think that covers the whole bunch of stop. I mean, to put a link in the description of that course.

So if you're interested in just educational material around this stuff, she's also given us a group on code for listeners. You get like a bunch of money off the course, I think quite a bit actually. Um and so there's no commercial relationship.

Just to be clear here, i'm just doing to shade out because she's a member of our community and we really like a also wider if you google her name, the first result that comes up is about her a doing nude skydiving, which is now i'm interested yeah, sound great. I mean, I should have just level with the nude sky diving. But anyway, fantastic person, i'm sure to great cause i'll leave a link in the description if you're interested in checking that out. And of course, if you want your own workspace computer, sign up to sim theory dot AI support out threlfall all night efforts to bring you these crazy, crazy tools that IT curse we IT. We did IT IT, but the work continues.

I've got a lot I want to, to get done to get into this product. We've got some really, really cool stuff coming up, and i'm looking forward to next week episode because we basically already have enough material for our next episode.

If you want to listen for one thing that I completely left out of the episode after the music roles at the end, I have a little bit of a surprise for people. Cool, right? Like subscribe whatever stuff. This has been another average forecast we'll see next week, goodyer.

You told me in very high what the temperature was, but what would that be? sales?

Okay, that will be about two hundred degree salus. So just keep the heart and system. Are you ready to cook out?

Yeah, yes. But I wonder, should I get the ingredients ready first, or should I cook the wig finites, get your first cooking salas smr and more fun. Plus it'll be ready to enjoy the delicious burgers right away.

right? That's cool. So much good way to hear how they turn out, how fun.

EP82: Crazy Computer Use, Anthopic's Sonnet 3.5 (New) & the xAI Surprise

This Day in AI Podcast

Chapters

Introduction

Can a Computer Really Order Chris a Coffee?

What Does Anthropic's Computer Use Mean for AI's Future?

Claude 3.5 Sonnet: New Thoughts and Opus Speculation

Why is Grok Beta (Grok 2) by xAI So Appealing?

Did Anthropic Kill Opus 3.5 and OpenAI Orion?

Shownotes Transcript

PodQuest PodQuest Podcast Discovery Engine

EP82: Crazy Computer Use, Anthopic's Sonnet 3.5 (New) & the xAI Surprise 01:09:17 Share

This Day in AI Podcast

Chapters

Introduction

Can a Computer Really Order Chris a Coffee?

What Does Anthropic's Computer Use Mean for AI's Future?

Claude 3.5 Sonnet: New Thoughts and Opus Speculation

Why is Grok Beta (Grok 2) by xAI So Appealing?

Did Anthropic Kill Opus 3.5 and OpenAI Orion?

Shownotes Transcript

PodQuest PodQuest Podcast Discovery Engine

EP82: Crazy Computer Use, Anthopic's Sonnet 3.5 (New) & the xAI Surprise