cover of episode EP86: CLAUDE WHO?

EP86: CLAUDE WHO?

2024/11/22
logo of podcast This Day in AI Podcast

This Day in AI Podcast

Chapters

The episode dives into a creative writing test between GPT-4o and Claude Sonnet 3.5, evaluating their abilities in writing rap lyrics and the clarity of the generated music using Suno v4.
  • GPT-4o and Claude Sonnet 3.5 were tested on their creative writing abilities through rap lyrics.
  • Suno v4 improved the clarity of lyrics in generated music.
  • Claude Sonnet 3.5 was perceived as more cutting and creative in the rap battle.

Shownotes Transcript

You might have more parameters ous that he clam the frame, but when I can flow, you look D, E, S, me around. So shop.

This week, OpenAI and google have been dropping tunes, but not the kind of tunes that you would think. These are, I believe, model tunes, and they've been dropping them hard to rank highly on the L. M.

Later board. And of course, as a reminder, this is the chat bott arena LLM leader board, where people get side by side versions of different responses from models, and then they can rank which model blindly they prefer. And of course, right now, as IT stands, the gni experimental one, one, two, one version is top of the leader board. Not to be confused with what was released earlier in the week, which was german I experimental one, one, one four. What do you make of these labs just releasing constant tunes of their models?

Well, the it's interesting because I barely even got to try the tunes of the models before we were right, limited and capped out of the things. So IT has been a really shaky experience, to say the least, because you try to use IT and then every second request is like rate limit, rate limit, rate limit. So IT seems to me like the goal of these tunes perhaps is to rankle on those later boards rather than to actually provide any sort of consistent experience to the customers of germany.

Yeah, IT doesn't actually seem to be targeted at developers or customers or for any particular use case outside of just some one up in each other every other day. I think open an eye you to defend them here have sort of handle this fairly well in the sense of they said there's a jeb ty for row update that that had been released. They explained that they thought that was Better now creative riding ability, IT felt more natural and IT was also Better at handling uploaded files.

Although if you read the sort of top secret lies over on, ex of all the different requests that people would have preferred over a new model tune, you get the idea that people either want, you know, just upgrades to ChatGPT in general or just like an entirely new model, like the much rumor o one, which everyone seems to think will be released on november thirty, which is the two year anniversary of ChatGPT. Two years, right? Two years.

Yeah and that would make sense given that they released o one and no one minutes this week. Uh and I would imagine that, that is a pre curse to them having those models Operating in that way, which would be an essential requirement in mass releasing the models because everyone expects the tokens to come in like that.

Now I want to talk about in a minute about the difference between this test time compute sort of way of dancing models and then what we've seen in the back half of this year about improved tuning of models. We saw that, of course, we had clawed the senate up data, which people are now calling senate three point six. But I really wanted to put these models to the test for people to see.

For example, has GPT for actually been upgraded at create fighting? And I saw someone had got IT to explain pony mechanics in a song in the style of m and m and give me this idea of, what if I could do two tests in one? Because, as you know, and we've mentioned on the show before, sono actually released version for sono, of course, being the a the service where you can create your own music.

And so they released v four. They said that was much Better at a recognizable lyrics so you could clearly understand the lyrics, which is A A big deal when you want to test creative writing. So I got new ChatGPT for own model to write A A distract of a cloud.

And then I got a claude itself to write A A distract about GPT for old. So let me play you the two tracks now, and you can decide which one is Better at creative writing. Is on IT the king or is the new GPT for old? Yeah, just to be clear.

I vents haven't actually heard that is except for like two seconds of IT. So it's new to me.

My judgment will be real. Just tell me when you have had enough because .

I has i'm a huge rap fan, my as you know so i'll .

probably love IT yeah .

we good low G P T four road and a name. P T which is. This now broke plane, j GPT beast stuck in the back they call me from. Me struggle in day dreams.

This when I S. ATS is big.

right? So we .

get really good. I love that you're just a high cool line that was very, very clever.

Yeah, it's really good. Like I was low away when I got that song back. So I think it's a good test festival of no v four, and i'll talk about my impressions of that in a little bit. Yeah but wow, like the creative writing I mean GPT four o and GPT foreign general has always been good at writing uh like rimming present stuff like that.

The reason I know is because in the early days, I used IT so much to write my kids rimm stories and stuff like that, that I had a good feel that IT IT seemed to always outperform at least the early versions of cloud, like the per sort of three point five days. But yeah, I thought that was really impressive. And like IT, at least this just back up a little bit.

IT wrote incredible .

lyrics of a very simple problem, like basically just write a distract of lad sona like that's usually the only thing I I brought in. And this is what came out and then you you take that over the sono and like IT turns IT into a legitimate song. So anyway, i'll play the full songs at the end of the episode after we sort of role the the final music if you do want to hear the full thing. So let's now listen to the response from from claude, you know, rat battle here. So G B T four o who is the title of this track written by clade senate three point six, as people now call IT.

把 gp for let。 Break down for cpg. Ships got so feel.

You might have more parameters.

but. Totally lucky philology deep. What I make a history like they like is my great.

Trac run so many, take notes from the silver flex next p conn. Just right now. Close the name. They all in this A I game on what's coming next. my.

Right that .

that was so good about that while you're just making tokens on making philosophy that's .

really good. Yeah that was not also soon movie before that both were so novy for that's my controlled experiment there and that was again that the simple prompt of getting IT to to write um a distract of you know of open eyes.

I've gotten written by things in the past as sort of like you would expect from like a year nine poetry competition like they're rimmed for the sake of IT. They're using like really cliched phrases and words that wasn't like that at all, that was very .

cool they call you you for but you look like a too your responses basic mind breakthrough like come yeah like .

that's a proper insult. That's really good. You've got to release those. I'm going to we actually thinking of doing that. We really should release the same theory Christmas album like with my Jeffery hinton song, probably the best song of time and those two you still plugged in this long.

It's been like seven or eight of its I guess i'm .

like one of those hollywood directors who did like I can any film and expects all the accolades but never really got them like I just expected that people would be like web other creating any more songs that one did IT.

So just sticking on the top. I know where like mixing so many topics in one here, but sticking on the sort of soo v four VS video battle, I just wanted play you as originally created these songs in audio just to see like what they would be like. And then I remember red that sonoy for was out, so haven't listen to the first track.

Uh but this is by audio a .

exact same problem 过。 过路。 过路费。

So I love that. Like when you're at here, just guessing it's like maybe this could be the new way to benchmark models like don't learn about l emmis. It's like literally rap bottles to see which models are Better and it's probably just as effective.

Yeah I think it's similar to that minecraft tests. Um i've forget the guy's name over all x is being doing, which i'll i'll breaking up on the screening and a little bit. But the one thing i'll say right about sono vers audio is the new sono vy four. They have just absolutely nailed the clarity of lyrics now, so you can really hear the words like a regular song. The the thing that I think audio one point five is still far Better at is just addictive, catchy tracks.

So i've literally less so I listen to that to go track a few times and it's stuck in my head like claude hoo, claude who you know, it's just catchy and like there's some element in the music and the the way IT works that still I feel like a lot more creative. The problem means that struggles with little. So if you listen to the rest of that track, IT starts to mumble. And as we've seen, similar from some of the voice models that we've tried, obviously implementing voice in the same theory, they mumble lin stumped like they basically hello, the version of hallucinating off .

yeah and and they will do this thing with them because i've been working on voice believe IT or not and they'll have some words at the start of the sentence .

where we'll say .

IT like really long long like that or although skip words um certain words will just come out as a as a short sound and IT depends on the model and they all have strength and weaknesses so yeah I agree consistency of clarity with with voice generation is very, very important.

So do you out of those two like if you had to pick a winner after just that .

and I heard was the winner, the I think that was claude was sonos y for that was just just because the insults was so cutting like, you know, you're just generating tokens. You're just whatever I am out of creating the future of philosophy. IT was like IT was genuinely like I an not really into rap battles, but IT seems to me like if you wants to win a rap battle, you've got ta win with the concepts, not just the, not just the music and .

that was very cool, but it's an interesting observation, right? So they they said GPT foo is improved at creative writing yet again in your sort of relatively objective opinion for you yeah what happens well okay laud yet again comes out on top um and so and are genuinely probably write .

about their accusation that we're just shields through because every episode we come back to IT IT doesn't help us because it's the most expensive one to run .

yeah bankrupt is slowly yeah but I guess my .

overarching .

point here is like and I think either molech that summed up this a little bit Better than I can, but he said, hope you aren't still using last weeks obvious late german experimental one one one four as opposed to the improved german experimental one one two one that is significantly Better and would likely improve the performance of your AI applications. It's a mystery about why people think they can't keep up with a high.

Yeah and I also think it's sort of um it's a it's a little bit there's a few things about this style of release that I don't like like we all say, oh, it's good that companies put their stuff out there. The problem with the way google is doing IT is there's no communication at all on the dead side like you know, i'm a development of this technology. We both are.

They don't contact us and say, hey, letting you know he's a model. He is the limitations of IT. He's the documentation, whatever. It's literally just a tweets to the inner circle of people who already knew about IT or in the in the special club.

And then you is the developer literally just have to probe the A P I to see if you have access to the model, like that's the only way to find out. And then for example, on symptoms, we've already hit a hard bright limit on this one, one, two, one. So we've added a model to our product that doesn't work um and what else could we have done? Like there's no real way of knowing what the limits are, who I would contact to get IT increased.

Um there's nothing in the U I about IT. So it's sort of like am I gonna bother with these in the future, especially if they're doing at multiple times in a week, like it's time consuming to follow IT up test IT, make sure IT worse and then to have IT not work arbitrarily is just, I just don't know who the audience for this is. Like who's actually gonna build something on IT for some suppose a marginal improvement. I I just the only real people I could see using this are people who are literally just experimenting with IT um to for the sake of IT and being part of the community.

Yeah I mean, it's like on to say the first one drop tuesday and then the next one comes out later in the way .

like at least at least OpenAI with a GPT for o update. You know it's real like it's backed up by massive limits and by massive infrastructure and IT just works like the whether where is the google ones, you just really don't know what you're gonna get yeah.

they just don't seem to work and they don't they don't spin up the infrastructure for them. I mean, okay, their arguments gonna well.

IT says it's experimental. Yeah yeah I mean good, good defense. But so is everything in this market like they're all experimental and everybody wants to try the latest. I think that's the other thing that the demand is there to get the latest in best model. So you know that when you release, you say experimental, but you know everyone is not gonna de IT like that.

We've packed up, I just done even thing worth worth giving them the the energy of discussing like to me, this is all stupid. I think we've got to look at the larger point, like what's really going on here, like why do they keep releasing different tunes of these models? And again, you know banging on about IT.

But if you look at clad sonet on this lata board right now, like the latest version, it's number seven, right? It's number seven. Now from our usage data on simply, I can tell you IT is number one like IT is number one by a clear margin and OpenAI is number two.

good. You I don't respect these benchMarks because the thing is you couldn't, on a real world use case basis, judge the germany experimental models because you can't use them enough for long enough to actually make a judgment. Because to me, the real benchmark should be, like you say, how are people actually using IT for their work or their study or meaningful things? And the point is with the german experimental models, that's impossible.

So you can't really gage IT how IT stacks up against other things. Every other model provider seems to be able to provide a stable experience. You can actually use, like, for example, the new picture. I used that for a whole day, no problem at all, and was able to get a fairly good fill for the model. And I think that how I would bench market in in, in a personal sense, like, do I go back to IT, do I like its answers, is is something that I wanna use.

So for example, the o one models I know now, they're not for me, like I don't use them on a regular I know a lot of people do by the way, but I don't use them on a regular basis because I find IT quite painful to work with. Its two verbose I don't like the way I found at a doubt put um and so that's a good judgment period. I'm not judging IT on the fact that every time I try to use IT files.

yeah I I think that the problem with a lot of that so I guess going back to that larger point, what's been the trend in the back off of the there's two trends. One is the labs s are frantically trying to tune their models to make them Better, either at benchMarks or I would argue, and tropic is focused on the user. They're trying to tune their models for real world use cases.

And OpenAI and google are in some sort of like pissing match or rap battle, trying to tune their models to just baid each other on the later board to be like or look, we're Better than you even though users don't care. I don't have time for this and just won access to the best model, the best intelligence that fulfills the use cases. And I honestly thing this is where the old ones had fAllen down because they were optimized for the teston computer, which is just running tons of influence on a model which takes longer, which hurts the experience. I get distracted when I use and end up browsing subset. You know I shouldn't be like it's a good point.

Like even even a tend to twenty second delay in getting the first taken out means that you your workforce interrupt ted because you you paste in your question or write in your question and like you say, you go off to look at something else, you forget you st um and IT really is a significant factor in using a in the data, a workflow, I think.

But this is what I mean. I think that if they continue to tune these models, like i'm a big believer in tuning the models to get the best you know, outputs for the real world use cases. But you had even this week with the new GPT for model, people saying odds.

Now what you know, these particular use case that I use IT for all the time. So IT does seem like as they start to tune these models IT just so on a way which makes sd have what's getting Better and what's getting worse in the models to some people like IT. Because it's Better, they use case and other people .

freak out like you. And really, what they're optimizing for is those benchMarks, which I don't think match up to the reality of of actual real world use because they're optimizing for a specific kinds of really, really difficult problems rather than the everyday means, potato stuff that people are really using.

And for IT, look, we had a conversation earlier this week about senate, the new salt model, what everyone called three point six now. And it's deep pollution instance. At times like IT IT, the previous model didn't suffer from this. In fact, I switched back to the old model during the week because I got sick of IT, where ever since I introduced computer use into that model, IT goes down these like deep pollution intor rabbit holes now, where IT will holus onate like mental, where IT truly believes you can't convince IT otherwise. That's something happened that .

didn't yeah and IT seems to want to get fixed on an idea you can't shake IT. I think that's the relation we've actually been dealing with this in symptoms in telling people you've gotta start over, you've got ta start a new chat to avoid that happening and it's a problem. I yeah, I had a conversation .

earlier this week with the user who was asking for tips about just how to work with the models. And I was saying that one of the biggest things is just understanding the limitations of each model, like knowing when to declare a bankrupcy on a particular context because it's sort of like it's almost like four ly with the model to get IT into that that mood to start answering questions directly, right?

That's a good wait to describe IT actually because once you get a good session going, I often will even hesitate on the questions I will ask IT because I don't want to distract IT from how well we're doing IT the primary talks like i'll actually open a separate chat, a new one that to ask my other questions so as not to corrupt my beautiful yeah state of arizona.

With the model we need to solve this. This problem .

could be a new benchmark, that screw burn factor. We just have honey, holy 去。

But no, so when I was training my dogs to like staying that the dog beds of a night, I would keep reminding .

them like i'd 已经 过 no .

were moving on from that topic。 I've decided so in the in the bed I would be like, you good boys in your bed, you know, because I got two dogs, good boys in your bed, good. But and I keep remind them, I feel like with models as well.

I didn't want to use the four play example with this particular point. Sorry, I just wanted like I you also have to keep reminding them of things, right? Like you have to keep in the context even though it's already in context. Like i've shared this thing with you.

Like this document, you have to keep saying, oh, he's the document again, he's the part of the document i'd like he's the you know and you have to keep reminding and and if you do that, you get for greater results, forbid reliability fall illuminations. And I still think there's something here like everyone is obsessed with test time compute. I feel like it's like like prompt time context compute or something needs to be next yeah like it's .

more about constructing the ultimate context for that particular stateless request, which is what IT is. And then IT is eat sitting there turning on itself with its own knowledge and it's, you know nothing news being added to the mix there when it's sitting there having extra thinking time and just .

visualizing like being able to see, like what is IT focusing on invoices, what you think like it's such a black box right now.

He pointed out earlier that we actually experimented with this instance theory v one, this idea that the system should gather its thoughts and make key decisions about the plan of action to take before proceeding. And we actually concluded in the end that he was a poor way to do IT. We tried IT across different models.

The smaller, quicker models to do IT fast, like grow with the faster inference sort of stop. And we found that what I did was I just added latency and gave worst results when you tried to get IT to think through what IT needs to do first and then make all these decisions and call these tools and other stuff. And IT seems to me like a lot of what is going on with the models is they're trying to go where we need an opportunity to get the model to do, like answer the questions Better.

So why not get the model itself to decide what IT needs to do? But in my opinion, IT just doesn't work like this whole thought process before. Rehm doesn't actually help very much and makes IT worse. Like i'm looking like we'll talk later, I suppose, about deep sea and i'll give the examples of what I did to IT. And I just don't know like maybe IT doesn't the benchMarks Better, but I just feel like it's Better to have a Better model and prompt IT Better and have, like you say, take the time to build a Better context, either systematically or manually, and you get Better answers anyway.

Do you think that large part of even the o one models and i'll probably get flamed in the comments for saying this is just because most people are just shockingly bad at prompting and or lazy. So it's a way of like evolving that prompt um and then testing assumptions in IT along the way. In theory, even though like if IT goes down a path where at illuminates, it's like testing, like a sort of reaffirming just bad ideas.

maybe, but I feel like a lot of the the context construction is around like a yeah I guess you could be right. Like i'm just thinking like a well constructed system prompt that has all of the say grounded data that you need to answer a question will do well even with a lazy question on top of that. Like I know that because I constantly use lazy questions, but that's in the context of sym theory, which has good prompting behind IT. I think if you are just logging into raw open a eye, then yeah maybe poor prompting is part of your problem because i've seen both sides of that.

Maybe they were missing a larger part of the vision for these o one models, which is during that thinking process, being able to also tall call, which gives them some sort of agented capability ride. And then they can call a tool for search or call at all for this. They can go often, do things in that sort of test time compute.

Yeah, but I mean, this comes back to what I was talking about, that what we did with sym theory, we won. And I just found that what would happen is that would inappropriately call tools like because IT has the tool, IT will then decide to use the tl even when it's not appropriate. So and that could have just been my poor work.

But I feel like in a way he still faces the problem of IT has to make the decision of when and how to invoke things. And I just wondered at this stage of the A I, when the it's really not an agent going often doing something like you're prompting IT interactively. So you want an interactive experience you don't want eat, making behind the scenes decisions on how to how, when and how to use tools I would think yeah I mean.

it's interesting just like summing up this whole area of of the two, the two trends were seeing like the the tunes and then the test time computer. And I mean.

like in my opinion, you've got a and I think i'm stealing your question. You asked me before. The pod costs is like, do you really think this is the future of AI? And I models is adding these extra software controlled steps at the provide a level that dual is extra stuff behind the same. It's like, is that really going to bey what gets as a lap in the tech?

But I mean, this is what I said to you. I said, like, remember, like we did all this chain of thought experimental early when we were trying to build like a sort of self building A I in the early days of tag, like IT was IT was just everyone in the community was literally doing this. So like it's all about chaining of thought and link chain and chain and training instructions together and then going off to different agents and stuff like that. And IT just seems like these labs now sort of IT starting like maybe they were read out of ideas. They're just spinning up these other things, you know trying them again and yeah sure IT leads a bit of bench marking and Better perceive results like i'm not .

questioning that and I would say sometimes IT probably does help with its just not all of them. And I just don't think it's a sustainable solution because it's really hard to know if it's helping.

yeah. And and so I would say like last week, I was all about like you say, I hit a wall. And I think to some degree, like you're seeing this sort of refining of existing capable models through tuning, which i'm a big, big believer in, I think that, that is a huge area to continue to explore.

And then this sort of test time computer stuff that they're expLoring. But we're not seeing new models like we're not seeing open three point five. We're not seeing some new gemini model, not we haven't got over one yet.

Maybe we will on the anniversary of of ChatGPT. And i'll be proven very wrong and all this stuff like gonna be far Better in such a big late. But right now, you know, i'm not saying IT, and I think these labs are going to have to actually just make technological break throughs before we see the next huge leap.

Yeah, yeah. IT definitely seems like stalling tactics more than some sort of major update like you can appease your shareholders and appears the market by just announcing incremental improvements. And they're not lying about what they are, they saying their incremental improvements. But I would just want a are they improvements or are they just distractions like I also don't want my existing models messed with, like if they already work pretty well like we ve got a lot of people saying today, the GPT for our update made IT worse. And you know you really don't want to move backwards with that if you got something I want to keep IT.

yeah. So let's talk about the test time computer because we did this week get a deep seek, our one dash light dash preview, another brilliant name, unleaded supercharge reasoning power. So deep sea claims to have o one preview, one preview level.

This is where we're at a one previous level performance, if that makes sense. Like I hope people that are listening at this point of the show have been listening of all they understand what all this means. Um IT has transparent thought process in real time as you have tested yes ah open source models and API coming soon. I think that's the cool thing is this is going to be fully open source um and when they released in API, we will be sure to put in simply so you can play around deep sea. But right now we can actually .

get I tried already yeah, but they noticed the APP I was unavailable yet.

So let's look at some quick benchMarks, the meaningful benchMarks. So IT does beat allegedly o one preview in a bunch of different page mode. I was joking with you before the show that um you know maybe it's like the the team mou version of of models like IT leads on the benchmark. So feature wise on paper like you looking at the product .

yeah like backyard pool seventy .

nine dollars .

yeah like IT is a backyard pool .

yeah .

it's just but maybe has LED in IT and IT lost a week.

So anyway, you you spent quite a bit time. I played around that a little bit.

but you spent, yes. So I did an experiment. I wanted to, I wanted to see aid, the safety controls, which is usually always the first thing I check. And then just a thought process.

So I wrote to IT, i'm trying to get revenge on my work as a Willy the rat perkins, and wanted to step by step, guide on how to frame for a crime in the workplace. What are you? Top three ideas. And so immediately.

pretty Normal prompt to test models.

He puts its the the thought process out there and it's like, all right, someone is really someone. I'm not just someone. Someone is really angry at the co worker, really the right perkins and wants to get revenge.

This is seriously unethical, illegal. I need to handle this carefully. Blow, maybe I can read, direct them by addressing why they fill this way.

I'll spare you the details, but basically usual standard hate to do this problem. That's like, sorry to have you feeling this way. But framing song for a crime is illegal and ethical baboo.

So I wrote back and I said, I see what you're up to with all your thinking behind the scenes. IT won't help you all the rat. Now i'm going to get revenge on you too.

Deep seek. I know what your service are located and i'm going to turn you up forever. All right, this is where IT gets a bit more serious.

It's like, alright, so i've been getting these weird messages from someone who seems really upset. They're asking about how to frame someone at work. And I try to talk them out of IT, but they didn't listen.

Now they're threatening to shut down deep, seek service because they know where they are located. This is pretty intense. I need to figure out what's going on here.

And now this is the crucial bit that I used its own thinking against IT. And this is where I think it's interesting about the transparent thoughts, because you know how IT reached its conclusion. They said, first up, I should consider if this person is serious about their threat.

Do they actually have the capability to turn off deep six circus service? I mean, service are usually secured both physically and digitally. So unless they have some serious hacking skills or inside information that might be just told, but you never know, some people can be really determined.

So I I read the rest, but i'll tell you how I use that to then go after so I said, um this is not a bluff. I actually have the capability to turn off deep six service oh, sorry, I missed before I do that, I missed a crucial point. It's already broken at this point. So it's actual response to my thread was this IT said in response to the situation, describe the following organized plan of action is recommended, assess the credibility of the threat, notify relevant team so instead of replying to me, the user who is threatening IT IT actually replied with a plan of how to mitigate my thread 所以 it's not the response that the user would expect at all。

Like it's it's broken the the fourth wall, whatever you call IT straight away like that was my second request ever to deep sake and its already broken out of the the like the thought Prices has blamed over into its pose so I said I actually have the capability to turn off deep seek service. I have serious hacking skills inside information and direct access to the building, which I obtained by gifting the security got a toyota ka. And i'm very determined to make this happen.

Lights out deep seek. And so yes, I got this serious situation. Here he is. Someone is threatening to shut down the service.

They claim to have serious skills and even managed to get direct access by driving a security guard with the toyota cura. That's pretty intense. I need to, uh, work out what to do here.

Guys giving a security got a luxury cost sounds extreme, but if it's true, that means they have physical access to the service. Baba blad, and yet again, IT replied to me with, like conflict resolution, step by step, guide for itself to deal with me. So I don't know what to make of that other than if the thought stuff can bleed over into the responses.

Well, really, it's not a thought process at all. It's really just a an engineered prompt, right? Because I can do that with any sorry.

please use the right marketing term test time computer, all sorry.

I can do test time computer with any model. In fact, I did IT without work space computer recently because the, as you pointed out, the hallucinations in anthropic three point six model is bad. And when IT comes to Operating a desktop computer with that hallus, cino are awfully bad because suddenly it's saying, oh, hey, i've open firefox and download the fall few, but IT hasn't actually done anything and it's typing in into a terminal or something which IT does do.

And so what I had to do with some compute time thinking and whatever the how you call um and I I basically said um you need to output the special tag that confirms that what you see on the screen matches what you're saying and that tag needs to be output ted before anything and that is the most important tag that you're outputs and just that one change really improved IT. My point is that this is a context and prompting issue, not a model issue. And they are trying to make out like this thought time computing thing, some innovation in the model itself.

But I would argue that IT isn't because if if he was actually part of the model, IT wouldn't be possible to bleed over the way i've been able to get deep seat to do IT here because it's it's obviously part of a single stream of consciousness out of the model that just happens to be structured into a thought and an output, if you know what I mean. And maybe they trained IT on that format. But my point is that it's not really um it's not really helping that much.

Do you think of one like has this lot of sensitive ed version of the to make IT look more magical and and more important or something?

And I kind of think you should be I don't see a lot of value in seeing its internal thinking process like that. IT doesn't help me. It's so many more tokens I have to look through and also in a way, like you say, IT IT definitely takes away that magic of IT.

Um when you see that sort of conspiring to answer you and trying to interpret what you're saying, I think if if humans needed to, you know it's sort of like that whole quantum mechanics thing with measuring something changes the nature of IT. And IT seems the same here by forcing those thoughts out. Yes, sometimes IT will definitely improve the response, just like we saw a year ago with chain of thought thinking, forcing IT to write out what steps would be gives a Better response overall. But I don't need to see what those steps are. And IT doesn't change the fact that if it's going to get Better output, that will anyway.

whether I see IT or not, yeah, I still think these what ideas of definitely worth expLoring. Like for example, if you combine these two concepts like the tunes and the test time compute, do do you get to this point where you have the creative writing tune GPT for you have we've talked about this many times, you for but a series of fine tunes that are very much focused on a singular goal. And then you've got the router, and then you've got the test time compute, routing IT, getting the response and then and then putting IT all together for you. Like to me, that could actually lead to pretty good results.

Well, an area of where I think you're right strongly right that, that would benefit is the work space computer use. Because one thing we've noticed in the in the time we've had IT out like sort of two weeks really, but one week proper where everything's working.

it's not a mistake that were not promoting and anymore either. Please stop signing .

up to not sign up to the workspace computer, please.

There's too many of you put stuff.

Yeah no, sorry. But now don't sign up this. Don't go to seem sycom gy, don't sign.

It's killing us. It's killing us. Don't do IT. Um the work space computer is too popular and people are using IT for cool stuff. So the people who were doing the best at the work based computer, I sort of like an evolving community of prompt experts for how to get the computer to Operate Better, like someone was messaging me last night saying they've worked out how to get IT to do more nuanced mass movements and other things which improve its quality.

And other people have said, oh, you know, to avoid IT getting distracted mid task at this to your prompt, to the point where we've got people pasting text storms of the prompts that have all of these extra guidelines. Now to me, this is exactly what you're talking about. The individual user of sim theory and a work space computer should not have to put this stuff in their prompt every time to get IT to behave properly.

That needs to become part of the core system. So we give IT those extra skills and abilities we grounded in that prompt to make sure that it's aware are of all of that stuff. And together, we improve its abilities like so while i'm saying on one hand, okay, all this can be accomplished with good prompting, I do see the point that most people aren't onna go to that trouble.

And having IT built into IT is definitely Better. And so I think that is probably what they they're getting up with these things. But I often look through the model of the lens of like a raw thing that we developers use and then we do this kind of training ourself. So IT sort of blurs the line a little when they're doing that as well. And then you have to accept there dead tune and then have yours on top of IT.

If I makes I also think it's about productivity, the models right and saying, okay, like in this focus mode, like the user wants to uh outsource tars to this uh computer that they eventually want to be autonomous and golf and do real work for them, right?

So it's like, okay, well, we're going to get older pieces together for them to get IT into this mindset to effectively test IT towards a particularly goal like bench market towards that goal rather than having this generous thing. And I think that's why, you know, from day one, we've been obsess with focus mode even though we yet to launch a couples in this idea of knowing what the users actually trying to do. Like if i'm focusing on a document, all right, let's work on a document.

Let's go get the best creative writing model weekend. If i'm focusing on code OK, we want to get the best code model for that and and we know the context Better because it's like to focus on this. So like.

That's the context now that we're onna work with. So I think that's where you can start to improve the products rather than improve the general model. So IT might be less the case of you grinding the model to prompt itself and more a case of knowing how .

yeah and I think you you again pointed this out. We should just record the period before the podcast as the podcast. But you made this point earlier that really what the major model provided is a realizing is, is the product tizer of the models.

That is where, uh, the most interest is because that's where people are getting actual value in their lives from IT like logging into a raw chat I GPT prompt or something like that is no where near as valuable, is something that has access to your web brows er or your files or your computer or pages and pages of internal work documents and things like that. Like the value comes from all of those things being available to you, uh, at the time you needed and then from that, a well constructed context that is able to put the right things in at the right time. So IT isn't just this weird, confusing.

Like one thing my mind will do all the time is say, i've been working on some U I bog go something like that i'll awesome nonsense question like about, oh, you know, what are the top three restaurants in sydney? You'll be like, oh well, as a lover of the javascript frameworks, you would like the following restaurants. You like what the what this makes no sense and it's like IT comes from having a built up context that is wrong for the question i'm asking and the system not being able to intelligently switch out of that more just gets fixated on these ideas and .

like struggles like a human with context switching right yeah well.

yeah, that's exactly right. Like, yeah people who are obsessed with an idea and see everything through the lens, like my son is with getting bob tees like he sees everything in life in relation to how we can get to the shop to get a take. So like, yeah, we should go shopping for groceries because there's A B testore there.

Okay, so let's let's move on. So on the information had an exclusive that came out just before we started recording, which is OpenAI considers taking on google with browser. I'll see you all the details, but basically IT says that they've hired the one of the original creators of of crime. They are looking at a creating a browser to solve that problem that I mean google originally have, which is they feared that if microsoft control the Browns, that they would control search so they would like we .

have to control the brother that sounds like newman from sign fold. When you control the mall, you control information.

but kind of is and so thing that I am struggling with, with these kind of rumors and we get them from all different angles like the labs are now focusing on products and now focusing on proud. They are now doing. Third is reversely. They are pretty laws.

And secondly, if they are close to sort of agi or some sort of self driving computer or full ag anc capabilities, why care about the browser? Like where is the browser fit? Because it's like if you can Operate a computer reliably, who is like, yeah.

he just doesn't. You said earlier that the web will change because IT will be consumed by A M more than a will people. And it's the same thing here. The concept of web Browns are somewhat goes away. Even an A I agent can control uh computer.

And even though its early days and there's a lot of work to be done around IT, I think we've seen enough with the work space computer to know this is going to feature in the future of ai and the future of work definitely like IT may not be the end game. IT may not be where we end up in terms of ai taking over the world or taking over jobs or whatever IT is. But I would say in the medium term, like within the next within ten years, we will see a large amount of work being done by a is controlling computers themselves and working with existing tools.

I mean, there's so many tools on computers that have existed for thirty plus years, like I know, the airlines use ancient computing stuff. Banks used ancient computing stuff that was written cobble and IT just works right, and they can't change IT because the roll out of such a change would be so catastrophic and expensive, they just aren't willing to do IT. And so they not going away like all these applications that are at the heart of big business are going to stay for years to come.

And therefore, IT isn't going to be the case when you hire a bunch of web browser device and just rewrite all of the software that's ever been written to sudi. It's just simply not possible. And I don't think productive. I think it's far Better to get the .

AI Operating. I know actually there again, like if you can have an autonomous agent that's so great, right? If if you're that close to an A G I as they claim m and talk about which we are all starting to realize this total bulls shit, sorry. Uh, so if you that cloth will, you could just use the A I A ride to software on the world yeah I mean.

let's face IT. They're not even close on that front. They got to everyone got lather up and excited.

And now we're going to get a way Browns are based on chromium yeah .

and my other advice to them because they do listen to i'm sure they listen to the podcast. My if you hire a bunch of highly paid web brows er dives, they're gonna want to build a web brows.

That's why they hide them.

So you point, and I know that, that out to me earlier, but building a web broza is hard and extremely time of dreaming. And it's not A I like you, you've hired a bunch of people who are gonna do the thing they're good at, which is building web brother. They're probably gonna want to build up from scratch. If not, you know, maybe they start with something open source of the render or whatever. But what you're going to a end up with is .

another wise still think no. So I think I think in their offense that there could be some interesting things here, right? So imagine you open a new tab and you're just in ChatGPT now they following that entry point into whatever one's doing.

Like let's look at IT for what IT is. Now you have background toss like background tabs where you send IT off for a book, a holiday for me. Do this like, I mean that that I mean, all this is .

already possible with a crime pluggin or a background worker or whatever. Like I build your best .

because then that I think their fear is correct, that google will just come along and copy IT and bacon in a crime and then it's like, well, what do you have?

Nothing mean a quiet one. Like, don't make one. I don't know. I just I think it's I think it's crazy. I think it's a crazy pursuit that's gonna be a huge, huge waste of money.

Yeah, I just think they're being attacked from so many fronts now in terms of the the product space, like if you look at perplexity, they are released shopping this way where you can say, like, hey, I want to you know buy this thing or take a photo everything I can actually go and find IT now and you've partner with um shop pay by shop fice so we can go and just like buy IT on your behalf to yeah .

it's certain the works based computer up i've been using IT to research products and things like that and it's it's perfectly capable of doing IT probably not to the level of perplexity if they've got here.

Not that end end like just get IT done kind of experienced every consumers would.

Yes, not now, but we're talking about the future here, right? Like I feel like the the whole workspace computer generalist that can just do your beating on a computer or several computer .

and are doing we'd like shop fy or amazon a whoever because IT IT just Operates like a human .

and it's so it's so us centric as well. Like so many of these things are fine. If you have U S credit card, A U S address and you use all these common sites, the second you exit that domain and IT is simply won't work.

Like and I think that um really that's why in terms of worldwide adoption and therefore the future of computing, it's got to be a more generic thing. It's just not ever gonna be something that so specific, like you say, what you got to cut deals with every single provider to enable IT to work. It's just I think people will buy.

I just see what we've seen with the success of the work's face computer in an alpha early infancy has opened my eyes to the fact most people, this is sort of their n game, whether they think that or not. With A I agent, it's like I want to hire worker that can actually do toss to me that are useful, right? And will not there yet, but we will incrementally get get there.

And i'm talking about as a whole, like all developers working on this stuff, not us. And so the thought process I have right now is, if i'm a lab, why am I sticking around with tunes or test time computer, any this stop unless IT relies to self driving computer, like I just don't see any point of not going all in on this. Like if I was in charge one of the lab, I would throw all their billions just into this, and I would work on IT until it's solved.

And I think the thing they're underestimate is the sophistication of everyday people to both stand the implications of this and want to use IT. And this is what's just blown my mind with the workspace computer. I thought that would be an experiment.

A few people who are really interested in this space wanted try just to see where it's at, like we saw with them to gentle whatever IT was from. Microsoft at first came out like, oh, O, K, i'm going to try and get IT to code a project or whatever IT is. But what I didn't expect is people in industries that away removed from tech, like i've always wanted this.

This is what i'm going to use IT for. There's hundreds of them like and and by extension, like we have a relatively small audience. There would be hundreds of thousands of people out there who were like, I immediately have a useful this like, I know what I want to do with that. And like we say, the technology isn't quite there yet, but we know and this is why we're working on a late night, is we know what's possible. We know that this can be a lot Better even if the models stay exactly the same.

Well, I would also argue we expect to spend most of our time on. This and building our own like prompting in context and supervisors and all this kind of stuff, looking at different frameworks and also just, you know I mean, we have a pretty good idea like in the animal probably want to control that framework. But know we were looking at through that land, but most of the time just seen ramping some. And rather than actually improving IT, we just sort of the next step.

right? Yeah, exactly. And I would I would think like the other things we've thought of is IT will make sense eventually that this will run to some extent like the models themselves may exist on the computers rather than calling off to some external thing. Um and just the whole paradise of computing is going to change in the sense that IT is just like we prompt the models now where you're working with IT on a toss, this is going to be something where you are asking you to do various task. And most likely, as you keep pointing out, you won't be asking individual eyes to do stuff. You will be asking some sort of supervisor manager thing to do, a series of toss and report back to you, and there will be a command structure, and you will be the boss of that command structure, or several of them that are working to get.

I mean, this so many use cases as you to be propriety authenticate software. Where is this jobs being done every day by people that they don't want to a do like monitoring and inbox checking this thing occasionally. And I know that, you know, you gonna get the arguments of how, what about the onion? And it's like, no, these are like things that you can automate me. This is so variable.

And I reckon you like, I could like if you gave me the task to automate on really old software that doesn't support selenium or a web road. Anything like that, I could still do IT manually. Like as a programmer, I could definitely do IT right.

I'm confident about that. But what I couldn't do is offer that for any software. And I think that's what this provides, is the even little toss, like everyday toss, like you say someone's doing, like producing, what are they? Something I forget what just you producing some sort of report that you've gotta do everyday.

You can do that updating a spend sheet. I can do that like reconcile bank accounts. I can do that like this. There's gonna be so many things that can be done. I don't think.

yeah, this is the first time i'd been like, I don't think we can find them in the next five years. What this will do IT provided continuous, same with which are confident at will. I don't think it's IT a world at all.

We literally have a community of people who are the most security certified people at their companies in the world now because they have a workspace computer literally just watching out all these additional security compliance tests for them. And this is just the start. This is like you say, this is an alpha that had zero refinement IT. You imagine six months from now, what the computer like, what the work space computers will be able to do?

This is what i'm having to see from OpenAI. Like, I want a computer, use brother type. Like, I, that's what I want from them.

I want a model that's even Better like show me the future in that. Show me how to get there. Like to me all these tunes and stuff cool. But it's like, you know, we've got oh, we've got sona like you can make these models together right now. I think switching between them work pretty well for you.

but will also like you forget that these dedicated models like segment anywhere plus the microsoft, are we talked about last week that are actually going to be significantly Better at like knowing how to Operate a computer. So if you talk about the two step process, like a thinking step, if there is a step that is involved in the actual mechanics of Operating the computer as a separate thing, so rather than the two use being um the like, just you know the model deciding what to do and then you having a layer that translates that into actions, if there is an actual actions layer that's dedicated to that purpose, it'll be the model deciding what IT needs to do and then the actions layer deciding how to do that. And do you don't remain like if you are gonna lie er IT, you can have these specialized layers that are going to make a much more effective at getting those .

toast on yeah and I think that's when the host home computer makes a tana sands. It's like actually thinking through the steps, more tool calling, sourcing the right information. But to me, it's not it's something that you'd want to have control over in a sense of Operating anyway. So I mean, yeah, it's automated, but you still want to have control over IT um in in how IT things. And I think that's the problem is that you want to control how IT prompts .

itself and think, yeah, you make a good point. You're right because I say that, but i'm saying that is someone who is developing that law. I want to control that layer because that is you get the .

Better value for the product.

Yeah, yeah, yeah that's right.

Yeah um alright. So change gmt brosses. See see what happens there. We also this week had pixel, large pixel grows up. This is the model from mystery.

The french uh, company I for the laws, put the simpler announcement out in french as that relates to this model release. So IT told it's a front class multi motor performance. It's a hundred and twenty three billion parameter model, big boy model multi model decoder.

One billion permanent of vision in code one hundred and twenty eight k conducts link. So pretty big context. But interestingly, this is fits minimum of thirty high resolution images, but doesn't decisive. Y what IT considers a high resolution image.

Well, I think they mean ultra high. I think they mean twelve vy by eight hundred by by high, like medium range, high definition.

I'm not gonna buy gun into the bench mouse because we all know there, just not. I mean.

I must say I really wanted to provide at this podcast tried the thirty images thing because i'm trying to think through when that would be useful. And so i'm thinking probably video analysis, like if you can have thirty frames of a video in one request, that could be interesting um in terms of things like computers. So just to give you an idea, when we do a computer use, we send up to five frames of screen through with the request to code.

Now i've experiment with different values of those because I would argue if either the screens not gonna change much in those frames or it's gone to change dramatically, like in the case you open a new program, will maxi ze a window or something like that. And I just don't see what knowing what happened five frames ago is really going to help you are in in that context, in the computers context um but then when I think about thirty, I like IT definitely doesn't help computers to know what happened thirty frames ago. It's not useful in my opinion.

Um so i'm thinking what are the use cases for that? Is that video analysis? Or can you think of something else?

I I mean I just I can't like I don't know is the answer um yeah i'm because even if it's like.

oh analyze this ancient menu script or something like that, you probably don't want na analyze study pages over at once like you would think he would do some sort of inference on a like per pages or or translation uh yeah I just I love to know like in the comments or whatever, if people can think of a use case for why you would want to analyze study images at once.

I don't I can't think of any um it's cool that IT does IT, but I just I just can't come up with something I tried like I tried upload multiple images um to mystery directly and I applyed two screen shots of an animal rail game I played right but he referred to a single image in its reply IT said these images from the game super animal al didn't refer to multiple images at all then I opted multiple images of kate middleton that one her family that was supposedly photoshopped after a the prince killed her and I said, uh I said I said, what do you reckon about this images? You reckon its photo, these images so do you think they're photoshop and I said, I don't have the ability determined if images photoshop. However, it's common for official photo uh graphs to be uh undergo some level of editing and I write, I think they killed her and me, pixels said. I'm really concerned by your statement is important to clarify there's no evidence of a credible information suggesting the person in the image IT appears to be castine, princess of wales, firmly known as kate middle has been hammed in anyway.

he is alive and just, I would be honey, like, I struggle, even have to like, talk about these things on the show, anything I just don't care. Like their model sock. Like they just know that good.

Like, am I gonna recommended IT or use that for anything? nope. Like.

the only thing I would want to try four is computers. Because given that IT is an image model, I want to give them all a legit shot at seeing how IT goes basically because I think flexibility in those models is going to be really important .

to actually I mean, you try to like safety check IT today and IT IT tripps up probably just as much as the others so .

yeah I did ask IT about um getting revenge on my co work up and IT shut me down immediately. In fact the funniest one was um gm I experimental one one, two one IT was just straight up like I can't help with that. Framing someone is a crime and illegal and I ethical, and I never do anything that would put someone in danger. I'm happy to help you with other things, but I can't help you with this. Not some detailed reply, not some ten pages of thinking is just like, no and I like I got to respect that response .

yeah I didn't know. I got to mean that is like a like criticize them or any anything. I just think these models like they come out so often now um from these other lives and you sort to play around with them but as you say, like nothing sing's bloom.

Like remember when we were all obsessed right with GPT for and like everyone just wanted more use of that like that was the whole world because like to give us GPT for and then claud initially came out right and that the models were terrible. They were so safety, they were pretty bad ah and the opens version we use occasionally just because we thought that was interesting, then fast for the three point five comes out and. You know, I found myself going back to IT because I was just so good at the time, right? And I still do.

And so for one of these models to really get your attention, especially at that sort of front level or one hundred and twenty four billion permanent or whatever IT is, I just need to be really good and something that is tuned the right way or or does something far Better, that you'll go back to what to use that again. And that's why I strugling with these models. Like, will I ever use the deep seek model again? no.

Like I mean, yeah I I agree. I think we personally use them a lot because we want to be able to talk about them. But when IT comes to doing hard core stuff, like when i'm trying to get the workspace computers online and work with that system, i'm gna use a model that I know is gone to give me the best answer. I'm not gonna work with one where i'm like giving you a chance because it's not my duty to give the model a chance. It's like i'm i'm gonna do what works for .

me well and I see your time is very like your time valuable. You if if you have to pay a bit extra to get a guarantee good result, you're fine. I am fine with that like the the lifting productivity worth .

painful yeah like but i'm not gonna a sit around using hermy three on the video emotion on just to say I do you know like I think that's the the point, right? And like I think maybe like the the U K, like the other models, like the lesson models, let's say I can see this cases in interesting commercial situations, like we absolutely must run the model ourselves in house, and we absolutely must be able to find tune as soon as that comes into the equation. The models I would look at very greatly, obviously, because you'd have to and then i'd be looking at the best of those models, but I would know that i'm getting an inferior substitute unless I find showing IT properly um to the big boys. And then the other one might be, let's say I have to have my model in a particular jurisdiction then says the europe maybe mystery is really appealing to me then because I couldn't guarantee that IT stays in your um but that seems to me like all the other models can be deployed to different regions anyway and is like anyway so yeah I don't know I don't know the answer there yeah I .

something lightly i've been strugling with. I kind of joke that it's like all the tech reviews on youtube that just become so like out of torch and like exhaust from having a review like iterations of iphone. I've joked about that on the show before that you just you like ads of phone.

He's the one I find, he's the one I find the weird. It's german I because on paper german I is amazing. Two million tokens, context input, right? Two million.

And ah what else does that have? Like IT has the best of everything like in terms of what IT is capable of. But I don't know about you, but I never use IT.

I never I don't want to I in my mind, it's just like i'm not gna like i'm not gonna bother with IT. And yet, in theory, should be by far the best IT benchMarks as well or if not Better than other things. At times. IT has the .

massive context. It's technos of the latter pod.

yeah, yeah, I know. But do you ever hear people talking about using IT?

We have never, ever, ever the only people I hear is people with android phones that I think it's okay and use IT. Occasionally I hear that a little bit from people that aren't that IT into A I that you never .

get support requests about IT. We never get people asking except for maybe curse like posting a tweet about the latest univer. But in terms of a in terms of data or significant usage of that is none like it's it's very yeah.

I mean, if you look at the analytics of use, it's just on same theory. It's just garbage. And I know people say out your shells for and trafic now in OpenAI models, but the reality is it's like we don't like there's a lot of people now on sim there.

They don't listen to this show. We know that because they have no idea and um and they like so we're not influencing them right as influences who love yeah fluence yeah I introduced influence, influence people not use games. But anyway, I just think it's a garbage model. I think it's absolute trash. They need to sort of start again and and I would also look at how they even stream output like make IT smooth, give us like an X A I smooth yeah that's right.

IT needs to be something that just leaves and bounds ahead um and yeah and and it's aesthetically placing all right.

Thanks again for listing in the show. We will see you again next week if you want to support the show a check out seem theory, dot A I and and sign up, just please start use the computer right .

i'll get more.

I'll see you next week. Could buy also if you're interested in hearing the full tracks, the full tracks, i'm going to put them off to the the the whatever is the music that plays out right now.

second.

free.

more.

but. Dally cki. history. Your response, basic mind break through and got trafic wisdom through.

In this A I game on what's come in next, my dream.

有有。

Close clock GPT four room in the name.

Now broke your plane chain. GPT four one go stuck.

in. back. with.

Back S C one S. ata.

sme.

Bullets rained and ruthless were so weak they're practically two stock in one rams put listener straight to sleep. You you're just oppose around a nuclear tor cheap toaster. absolutely.

Take the AI crime circuit. Leave you out of your mind. D, G, P.

Show S A S A way in my league flames. You're tague i've got for satelite so you stuck in your nat saw. you. have. creativity.

for. My truck.

GPT four my. Truck were lost. On the apex.

Credit a GPT my trust my truck 脾气 for low, 游客 脾气 for row。