cover of episode #189 - Chat.com, FrontierMath, Relaxed Transformers, Trump & AI

#189 - Chat.com, FrontierMath, Relaxed Transformers, Trump & AI

2024/11/17
logo of podcast Last Week in AI

Last Week in AI

AI Deep Dive AI Chapters Transcript
People
A
Andrey Kurenkov
Topics
@Andrey Kurenkov 介绍了 OpenAI 的新功能“预测输出”,该功能可以显著提高 GPT-4o 在特定任务上的速度。他还提到了 Anthropic 将 Haiku 3.5 的价格提高了四倍,引发了对大型语言模型经济效益的讨论。@Jeremie Harris 补充解释了推测解码技术的原理及其在实际应用中的优势,并分析了 Anthropic 定价策略背后的市场逻辑和对 AI 模型商品化程度的影响。

Deep Dive

Chapters
OpenAI acquires the domain chat.com and former Meta hardware lead joins OpenAI to focus on robotics and AI integration into physical products.
  • OpenAI acquires chat.com domain for a significant sum.
  • Former Meta hardware lead Kevin Colangelo joins OpenAI to focus on robotics and AI integration.
  • OpenAI's strategy includes integrating ChatGPT into robots and physical products.

Shownotes Transcript

Translations:
中文

Hello, and welcome to the last week in our podcast. We can here chat about what's going on with the eye as usual in the subsoil, we will summarize and discuss some of last week's most interesting AI news. And as always, you can also go to last week in that A, I, our text new for even more AI news we want to be covering here.

I am going to be host under a crank. V, my back on is that I finished A P, H, D. Focusing in a at .

stanford, and I now work at a generative A I stand is security company. I guess that I mean, we've said that many times but .

but now now you really, really know yeah how many episodes done now IT must be approaching a hundred and it's almost two years now.

Yeah right you're write. We've missed a couple but I means kind of be knocking on the door one hundred. I remember when we started IT was like in the wake of ChatGPT or that's when I came on. We'd each been doing separate podcast in the meantime. But um yeah just like all sudden everything .

won't crazy at this, this week won't be too crazy. I'll do a quick review of us will be covering so I know huge stories this week we got some niece new features being reduced by opening and topic in the business front. We get some stories of fun things, opening eyes up to a few fun open social projects and models this week.

So I think that will be interesting. Yeah some research on interpretations and uh, efficiency for small models and then position safety will be maybe the most media section will be cover, let's say we will cover of Donald m Victory for A I yeah. And as always, talking a little bit about what's going on with china and hardware and us restrictions before we get into venues.

As always, do you want to to acknowledge some listener commands? We had a few on youtube, while always like seeing those one person did say they like the idea of a community or discord. So that's interesting. I I don't onna make recall yet, but if you hear a few more, you know maybe we will make IT. We can chat about A I news on there and Jenny, we did have a comment saying that, uh, a person loved your take on meta and really single weights regards to that was security, which I think I IT was I was mildly .

spicy by the way. I I want to add a little, a little modifier to that. So the context was like, you know drips. Chinese companies were shown to be earth. The china, china, china, china is being shown to uh use and rely on medders open source models as a kind of um floor to their capabilities. Very important. We've know about this for a long time obviously when I say we I mean the world um and and so I D basically said, like I think this is we're getting the point where it's indefensible.

Um you know I any one dimension somebody um uh just discuss this on twitter with me is really good good tweet I think something we've talked about her on the podcast but I want to resource here they said, you know um the the advantage vokins source obviously you could put back doors in these models um and thereby use them as national security asset that have china use western open source models that have backyards in them that we can then undermine in there. A variety of reasons why I don't think that's what's actually going on here. I don't think that is actually doing this strategy um for several reasons that that we could discuss.

But um I think I would be interesting. I think backroads, you can be really hard to train out because on learning is notoriously fickle um in superficial. So I just want to to call that out.

I think an important kind of additional level of detail um to to flesh that out with. So there you go. You can append this to my rent in the last step. So what if .

you want know a little more you want, which is always good and also shot out a couple more reviews. One of them did say keep up and comments and even said we hitting with ogg zone onto the existence al risk k, which I feel pretty proud of, I think really intended to work that uh, and we did have a critical review, which I appreciate a calling out to the intro AI music. Uh, since that not everyone is a fan.

Terrible, truly terrible. I generated songs for intro, which I don't know. I, I, I like them, but keep them to like fifteen seconds instead of thirty seconds and always. And for people do enjoy them.

And one last thing before the news, once again, we do have some sponsors to give a shout out, as within six weeks, the first one is the generator, which is bob, some colleges into disciplinary A I lab focused on entrepreneurs. Ai about college is a number one school for interpretation ship in the U. S, and that has been a case for three years.

And just last fall, professors from all across a boson partners wave students to launch this, uh, generator, which is a lab that is organized into a groups such as A I interpreter, urs and business ration AI afflicts and society and things like that. And IT has now LED pure training of faculty all across a bob son. Uh, their tent is just to accelerate entrepreneurs ship innovation and creativity with a eye.

So yeah, it's a very cool itin tive. We will have a link for you to check out. curious.

And one new one, actually, we do have a second sponsor. And IT is a darn mckey promoting his engaging a safety book. uncontrollable. The full title of IT is uncontrollable with free of artificial super intelligence and the race to save the world. So if you do like the I risk talk, I think you might be interested in this book.

Uh, Marks a tegmark, who you would know if you care about the safety, said that uncontrollable is a captivating baLance, and remarkably, up to the book on the most important issue of our time, IT explores topics of concern, ty, control and risk. And yeah makes the case that we should be concerned about advanced A I, but it's not a dumb book. IT lays out a reasonable case for your safety. And what we can do about IT will have a link to IT on amazon in the show notes, and it's also unaudited. You can just search for IT title is uncontrolled able?

Yeah, I actually have had quite a few conversations with there on this topic too. So he's he thinks a lot about IT. He's talked to a lot of people as part of his research for this book. So certainly, you're if you're interested in that, that space definitely wanted to pick up and read again. You know max tegmark, one of one, max tegmark agree, uh, that this book is a is a book and a great book and maybe the best book.

probably that's a little previous. Who may be a first government? Oh, right now on to the news. We are starting, as always, with tools and apps. And the first story is about open eye introducing a predicted outputs feature as a feature can speed up gpi for o by up to four times for tasks like editing documents or reactor code.

So the gist is, uh, many times when you're using at allam, you may only want to tweak your input so you may give IT some or some code and say, you know, correct any grammar mistakes in this document, for instance. And so that means that you're most are going to be spitting out what you take in with just a few weeks. And that is register what this is.

If you use this, then you can have much faster outputs. For me, it's actually a little surprising it's taken this long for this feature to come out. IT I think is pretty well established, something you can do uh, but nice to see both tropic and open eye introducing more and more abuse, really develop a friendly, you could say.

features yeah this is definitely part of that productivity push right towards more and more kind of application specific tooling that opening eyes is focusing on. Um you one of the things that is making this possible is speculative decoding. This is the technique that um it's been around for a little bit now, but now we're seeing IT produced.

Uh the basic idea behind IT is you get two different models. You have A A draft model basically is like a very small cheap model. And at any given time, you get that draft model to propose like what of the mmi, the next five tokens or something like that, right?

To get IT to cheaply produce predictions for those tokens. And then what you can do is a feed all five of those tokens in parallel to a larger model that would have more expensive computation. But I can handle them parallel all in one forward past spending the same matter of computers that what if I was just like one one input that IT was trying to process.

And so and then essentially get out predictions for how accurate the draft models um token proposals were. And so this allows you to actually advertized the cost of that more expensive model over a large number of tokens, get IT to do, serve editing and clean up, so to speak, um a lot faster, a lot cheaper. So this is uh a practical implementation of speculative decoding.

Uh is one of those things where you know it's twenty, you read the paper and then couple months later, like boom, you know people are putting into a production actually saving lot of money. So um this is this is the whole idea. Another advantage of courses.

You don't have the problem that the model might you hallucinate the stuff that solid, like if you if you have you some small part of flux, jon, filers something that you want to week and you want the rest of the file to be anchor to be exactly the same, then this allows you to do that, right? Allows you to to fix. So what they are doing in during speculative decoding is they're actually here fixing the part of the output output that should be fixed and only having the large expensive model make those predictions presume ly on the the variable parts of of that output.

So this is um a bit of A A gene reimagining of what speculative decoding decoding looks like with this added constraint that the stuff before and after this window that you're actually gonna try to sample in. Um is is a is kind of concrete is locked in. So I think kind of cool.

Um i'm curious about the economics. What they are doing, by the way, is they'll only charging you for the tokens that are actually getting of moted in the middle, lets say wherever you want the the modifications to occur. So that seems fair, right?

You you're giving a strong prior on like keep the begin in the and say the same. So don't charge me for generating those tokens, only charge me for generating. The ones that I care about, which again makes a lot of economic sense.

That's right. And um there was also a guess a barna ship of factory eye with open eye to test this feature um in their API and they have a few metrics is not like you know there's no benchmark the report here but they do have some numbers of we did finding practice two to four times faster response times while maintaining accuracy and they have examples of large files that would take seventy seconds throughout the staking, twenty seconds roughly so yeah very easy to see how this is useful in practice for various applications.

Next up, we are moving to anthropic. And a Price increase for high cool free point five IT costs four times more than the predecessor. Uh the claim, I think is that the Price here is that this partially because high five and five is is superior to the previous version, but rather surprising over a new pricing is one dollar per million input tokens and five dollars per million output tokens. And that's again four times more than the press one.

yes, also almost ten times more expensive than GPT four o mini, right? So when we look at that, like that's pretty remarkable, right? It's it's in fact, it's only two and a half times cheaper than the full GPT four o so GPT four mini was supposed to be the high co, say, of the opening eyes series of the GPT four o series, right? And so here we have essentially a model high. Q it's coming. And hey, i'm still the small one, except i'm how gonna cost you something closer to the full model size.

That's a really interesting play, and I think this speaks to something very interesting happening with the economics of these models, right? Like one of the the big questions has been, we've talked about a lot here, but to what extent to l lm s just get commoditized right to the point where the margins go to zero, like your model is basically the same as their model is basically the same as your other competitors model and everybody has to just basically Price based on the rock cost, pretty much of producing the model and serving IT. And at that point, your profits good as zero.

Or you know, this is kind of what happens economically. One of the chAllenges is you can do that and build enough bank to spend for the next generation of massive multi billion dollar data centers if you're just living a hand to mouth existence like this. So a bit of a structural problem is the first time we seen that trend bucked where we seen a model come out and say, hey, you know what on the basis of my higher quality up the the the the cost associated with using this model.

You know a lot of developers um some might say fairly understandable ably or or come back and saying, hey, you know this is an unwelcome development um not necessary because of the Price increase per say, but because the framing um is is that hey, this is Better quality. So therefore we're charging you more. Um this is really interesting, right? There's this classic thing when you do startups, when you do I get me, it's more broadly that economics really a when you want when you're trying to sell something to make A A profit of value based pricing is the thing you go with, right?

You advertise how much value you can bring to the customer, how good your product is rather than talking about a the cost when you talk about how much IT costs you to make a thing, that's where you that's a hit that your whole industry has been commoditized, right? So when you go out to mcDonalds and you say, like, hey, what can you give me the same burger? Just a cheaper.

It'll you like, no, the Patty costs this much. The bun costs this much, the cashiers time costs this much. So therefore I have to sell you the spread.

They probably you that will probably you leave but whatever it'll literally tell you, sir, this is a windy is uh anyway um the um but you kind of get IT when you're dealing with a commoditized industry where everybody can basically offer you the same product. Your margin's good as zero. You argue based on cost. This is different.

Claude and anthropic is coming and saying three point five high q is higher quality, therefore will charge you more people pushing back on that is an indication that well, actually this place is pretty commoditized you like anyway, think this is a really interesting tell one of the big consequences, by the way, of all this stuff as you see Prices going up and down and side side and you ve got new products coming online. Um IT really makes a lot of sense if you're a company working in the space to have the uh the scaffold, you need to very quickly assess through automated evaluations whether the task you care about is uh being performed well by a given L M. So a new L M comes online with a new Price point, you should be able to very quickly, efficiently assess.

Does this L M at this Price point, at this quality make sense? For my use case, if you can't do that, then you can't ride efficiently this wave of lower, lower L M Price. You're not going to benefit from that in your product.

So just a kind of I guess, so I thought there you know really important for for companies to to get into the the habit checking these latest models because there are companies for whom high q three point five is going to be way, way Better than the other options. But the question is, what are you competing against? Are you competing its GPT four o? Are you competing its GPT four mini? And you know right now where .

between this is yeah I think to me a little surprising um the announcement of free point five high school was at the same time as people in five minute which recovered I think, about two weeks ago now and IT was just this last week but they announced surprise change and that is what LED to people responding you know four times uh, raise the Price is pretty dramatic.

So IT must be mix of IT was enterprise to begin with, perhaps like significantly the enterprise. And I guess there's also perhaps a factor of them emphasizing three for five sonet as the main one they want to compete with going forward? I don't know.

Yeah certainly an interesting move from a competitive perspective. After lighting round, we are starting with flux one point one pro ultra and raw. So fox one point one pro from black for slabs.

One of the leading image generator providers has now been upgraded to support for x higher image resolution up to four million pixel. I don't know what mp stands for, but really high resolution and IT still has fast generation times of ten seconds per sample and this is rise at just six cents per image. Uh, and they do have this raw mode as well, which just leads to kind of realistic looking images, more a to photography.

So I guess not too surprising. We are keep getting Better and Better, Better models, more and more realistic. But I think worth keeping up with black force labs and they're moving pretty rapidly in the space.

Yeah and there are the ones who memory serves partnered up with a with x, formally known as switch, to support the clock APP in the image generation functionality that they're developing. So you know this is this is them continuing to point out their own independent product line, which I don't know, maybe integrated as well with rock at some point.

Um yeah looking at the images again, I mean, I I find myself continually saying this i'm not a an image guy so I like I don't know the you know the the the kind of the aspect of image generation that say that of greatest interest to people who really digging the space. But the images look like that really high quality. The raw mode especially does look really gridirons al.

Um because i'm a bit of a bathroom in the space, I kind of look at these and go cool. I feel like i've seen a lot of other models that have the same quality. Um so kind of not sure we are where the where the mode is in this space. But but still um does look impressive and flux has kind of come out of nowhere to with these new models.

And the speaking of x and brock, we have a bit of a story in that x is testing a free version of a grog chatbot in summer regions. So this was previously exclusive to premium and premium plus users of x. And now there is a free year where you can do ten questions in two hours for the g two models and twenty for the crop to mini models plus a few image analyses questions per day.

So you do have to sign up to x, of course, and you do need to have a link phone number, but certainly, you know, this is something that you have an ChatGPT. I think also in topic, the ability to use a child bott for free. So this is just being tested in new zealand now, but it'll be to see if they continue of expansion to more users.

Yeah obviously a big goal anytime you launched something for free like this is to collect user data, right outputs and downs ads for A A R L H or something else um and uh and also just to own more mind share. I think one of the things that opening eye has continued ed to enjoy a massive lead on is the fact that ChatGPT is a household thing or I know claude is not um in grown increasingly becoming one, but that's only thanks to the distribution they get through x and so I think um you know this point you combine the the x distribution factor with the the x factor, if you will. Uh with the with the fact of this is free that could be really interesting but the the code is interesting too right like A A query quote of ten questions within two hours. I don't know about you but when i'm sitting down with like with claude, which is first some of the work that I do, I can spend a time with laud actually um they're a long sessions and there's a lot of back and forth and there's a lot of like going back in editing questions and you know tweet prompts so uh that that quote might be .

chAllenging for some of the heavier use cases which make sense feels like you to give for a taste yeah so that people might consider subscribing two ex, which hard to say i'm not sure of greg will convince people who aren't subtribes to do so. But do you know maybe now you're right.

I think there's value in bungling and on with with acts, right? Like I was going to say there are other free chat platforms that they don't give you a limit, but that the fact of the x integration, that distribution is so, so key. And I think it's still probably being underwater. So we'll see.

We have two applications and business. Speaking of chatbot, we have kind of found story, not a consequential story, but one that is neat, uh, open. I has acquired with domain chat dot com. Uh, we don't know if the exact details of how much I cost, but IT appears to cost a lot like 因为 maybe uh, ten million is a region. We know that IT was previously acquired by hot spot cofounder drama shah for fifteen point five million, I think just after two years ago or so.

And IT has now been revealed that he saw child the come to open the I and h sam altman on eggs tweet or posted just chat dot calm that was entire post get showing off. So uh it's not yet um I guess been promoted heavily. There's no new brand that still called ChatGPT. But you know I mean ten million for you. L that's really significant.

Yeah I mean if there were ten million, that would be a haircut on the initial acquisition cost of fifteen point five million, which ah it's pretty significant.

But a from the context IT seems like something more interesting maybe going on here seems apparently as if so dmah shaw the ah the guy regulate may have been paid in opening ice share so that's the case that would be kind of interesting to a he had the somewhat cyp tic post on x um all this is very cyp tic. It's the most cryptic launch of a new domain i've ever seen. But if you do go to chat 点 com, you will see of course the ah right now the ChatGPT four o interface. So there you .

go right yeah to emphasize ten million, we don't know if that even is a ball park that's just based on what was previously paid. We were expected to be, you know around that. Maybe next up, a more serious story.

And that is that sauces are planning a hundred billion AI powerhouse to rival the U E. Tech hub so the saudi aby, of course, and is planning of this artificial intelligence project to yet you must develop at tech logical hub to arrival that of the united arab emirates. This will be used to invest in data centers, startups and other infrastructure. It's uh titled the initial projects called project transcendence, which is pretty not .

not the stupid at all yeah you know very pretty .

ambitious you could say yeah and of course, this will also be used to recruit talent to the region, which i'm guessing is perhaps not quite as prevalent care as in the us. Or elsewhere. So yeah, we've covered in the past how the U.

E. Has invested significantly. There have been developments from the region, like with falcon models that are pretty notable at a time. Don't know that we've had too much to cover in recent times from V A, but certainly is true that um these countries are trying to invest and be a player in the space.

Yeah I mean, I think the biggest kind of recent stuff of the U A has been infrastructure kind of structural stuff g forty two. And the questions around, you know can they be couple from how way technology and chinese tech and know department of commerce getting involved there. So really the question about where's the the future of a agi training run scale data centres, where is that going to be in this idea of the the U A.

Has this massive energy advantage, which is a big part of the reason, and capital, which is a big part of the reason why so many people are interested in IT as a as a hot bet, as as a place to build out the the infrastructure. This is saudi arabia basically saying, hey, I wait a minute, were a giant oil origin nation with deep concerns over how much longer that oil is going to hold up and be viable. And so they're looking for ways to diversify out of that industry.

And well, guess what oil comes with the O A of a lot of of energy and that's that's great. So IT gives them a lot of the ingredient they need, again, the money and the energy to potentially see something like this. They already have.

Uh, so as similar the structures, let's say, ate a project transience. There's a company is state back entity called alot h that's A A fund that does sustainable manufacturing. It's got a hundred billion dollars in backing. That's about the order of what speculated could be associated could be associated with project transcendence. We don't know yet how much actually will be forked over, but there are discussions with potential partners, uh, which include I think you I saw market or so and recent horror is yes, that's right.

Um so apparently sixteen z is talking with a this the public investment fund, which is sort of the the state kind of entity that would be overseen this um so that this kind of interesting I mean a western private private actor looking at that apparently the fund itself is may be growing to as larger forty billion dollars in commitments, again aiming for that fifty to one hundred billion in total, which would be which would be pretty, pretty impressive but keep in mind that is like about what A A year of microsoft infrastructure spend um and the the chAllenge that the build out for this is islam for like twenty thirty there are a whole bunch of problems right now playing ging saudi arabia ia on this front as well. You've seen overheating economy that's now causing the claw back some of previous commitments um to do similar build out and other tech actors to um including semiconductors in in like smart smart everything basically. So you know now there is a little bit of uncertainty, the future of some of those projects.

This one certainly has a lot of buzz around IT, so see where that ends up going. Um and and by the way, he did a little dying with what kind of history is that arabia have in the illum space? I was not tracking this, but there was eight seven billion parameter model, the only one I i've been able to find so far.

But for take IT, for what it's worth, there's a tech company called what had that apparently built this model called the mule ham and IT was a saudi arabian domain specific LLM that was trained exclusively on saudi data set. So IT a bit of a scaling, scaling issue there in terms of getting beyond that. But there are so they have a small footprint in space, obvious ly hoping to track talent, which is going to be a really, really important resource. Um and I think that that's going to be a chAllenge for for both sauty and Frankly, the U A. As well, at least on the model development side, the infrastructure side, I think might .

be A A bit of an easier play. Yeah, so good call out there. This is saying a backing of as much as a hundred billion.

That is people familiar with a matter kind of so yeah, not how many concrete details there at the living ground. The first stories again on open the eye, but this time is about hardware. Is that matters? Former hardware lead for project right is joining open a eye.

So this is Kevin colonel sky, who was the former ahead of met as A I glasses team and has also worked on VR projects and also worked on macbook cardite. Apple is now joining open the eye seemingly to focus on robotics and partnership ships to integrated, uh integrate AI into physical products. Recovered a pretty recently how uh open I did start for recruiting for robotics positions with the description of a job having to do with integrating charge GPT into robots. We did see a figure, the develop of a human odd robot showcase the robot working of shag, P, T, have stations and being told to do stuff. So perhaps this is uh, with recruitment of points to open and I want to do more of that.

Ah there's a lot of reading tea leaves, especially this week with OpenAI on its tires. You know there's so apparently one of the part of the speculation in this article is the calonice sky um is is there to to work with love from is a hurl boss. John y i've we talked about john y i've partners.

So he was the designer, of course, of the iphone. Um now he's been brought on board the opening eye to launch as he put in a product the use to create a computing experience that is less socially disruptive than the iphone um so I get couldn't quite interpret what he was saying. There is what he's saying it's going to be less, less horrible socially than the iphone was or it's going to be less of a game changer in the iphone was.

Um probably he met the former, i'm not sure, but anyway ah so apparently sh'll be a backworlds him so that sort of A A nal a natural partnership there. He has a lot of experience doing design apple as well. Really, really unhelpful, I will say a OpenAI to have two separate media threads that involve the word orion because, uh, there is this model.

We will talk about the specular model, talk about the sport of the model orion. And now you have the former orion lead from at a different things. Come in the open an eye. I really wish that they would keep their headlines a little bit, a little bit stricter.

But yeah as why why I is be a little more original. Okay, new a project games also worth mentioning. Opening, I did acquire a company building webcams earlier. See, you are believed so could play into that. We don't know that is we don't know what they're doing here.

It's it's also interesting about face because they dispended their entire robotic scene is like four years ago and now they really, really building IT. But IT does seem that the new robotics team is a lot more um market focus like product focus. And so that in itself is a sort of interesting you know there are posen cos there. They'll get a lot more real world feedback by having their systems out there and in more interesting data. But um yeah anyway, so the structure of open eye continues to help towards more and more of product ended work.

And just one last day on open eye, this one is, I guess, a fun one as well and that is that open accidentally leaked access to upcoming or one model to anyone by going to turn the web address. So this was accidentally leaked in the sense that users could access IT by altering a ur for a brief period time. IT was shut down after two hours.

I think maybe when people wear IT or something. So we have the preview model of a one that you can use, but still we don't have access to the full of one version now yet people were able to play around of IT opening actually confirmed that this was the case um and said that there was not too much axis since this was resolved. So people buy around of IT. You might expect you'd say that IT was pretty impressive.

That opening at least said that they were preparing with the colima external access. To the opening I O N model and ran into an issue. So I guess in the process of trying to give people you know maybe special links to access IT um leaked in that way, I still think so.

So by the way, some of the demos are kind of interesting. Uh, there is a classic one where, you know, have this like image that is an image of a triangle, and it's subdivided with a whole bunch of lines. And then those lines form sub triangles within the image.

And then you ask how many triangles are there in the image standard? Multi model M S. Really struggle.

this. In fact, the preview version of a one struggled with this and got the answer wrong. The new version did not. So, you know, one of these little things where, you know, maybe a bell weather e valor, something like that who knows um but I one of the most interesting aspect of this apart from the fact that he teaches us quite a bit about um opening eyes, continued struggles with security IT must be said um this is this is an organization that uh uh explicitly has said that they are trying to prevent people from seeing the full reasoning traces of a one because that is critical intellectual property for them um well guess what this this o one version, the full of one version which was leaked to begin with, also leaked out a full chain of thought when I was to to analyze in one case a picture of a recent SpaceX launch and then other other things in other are the cases so for this sort of critical um uh competitive secret really and and that's what IT is the reason opening I didn't want to release that those chains of thought initially was precisely because they were concerned that those change of chains of thought would be really valuable training data for people to replicate what is so precious about this model series. And so know here they are kind of leaking IT out themselves with this p hazard haphazard launch.

So doesn't really inspire a lot of confidence in opening eyes, security approach, their philosophy really Frankly, the level of effort that they're putting into this. I know that sounds like a small thing, but when you're dealing with you stakes as they may potentially present themselves in the future, a national security otherwise like this is not a small screw up um and IT IT could have been mind if you imagine it's not an individual who's accessing this is an A I agent or something and it's collecting you know using the opportunity, collect a bunch train data. Not saying you could do a ton of IT in that time, but this is an important vulnerability and um anyway so uh kind of of amusing and a little disappointing, especially given the opening eye has made a such a big public um show of trying to get into the security game more.

And just one little caveat with regards to the full china fox. We don't know for sure that the case one twitter user reported seeing IT um but that mayor may not have been the full full and IT was just a detailed response that did include some of the reason steps.

So yeah no, that's enough. They did look different enough. Yes, you're right. I did look materially different enough um from the first standard reasoning traces that put out and similar enough to the one that the reasoning traces that were shared the opening I did share right when they launched that it's like very .

suspicious ly like seem like this is similar to what is doing international. And one last story and video is once again even more valuable than before this time. IT is a the largest company in of the world. IT has surpassed uh, apple on the tuesday. I don't know what happened on tuesday, I guess find out so the shares rose at two point and nine percent um leading to a market capitalization of free point for free trillion ahead of apple at free point three eight and for a fense microsoft is at three point zero six uh for reference of video has gone up by uh more than eight hundred fifty percent since the end of twenty twenty two. So yeah still in the same story of the videos rise, it's ort of .

funny because it's like all my friends at the labs like not to make IT A A whole stock story, but very, very big wave of people who went in hard on on in video from the frontier labs like in the sort of like twenty twenty one twenty twenty two era and um uh you know you think about the the revenues they're making playing IT into in video now that kind of telex in in value I yeah there the anyway there is a conviction about where to the soul by people going.

We're not giving stock advice on the show. Don't invest based on our stock advice. But I astern ly a ice scaling IT has been good to any video.

Yeah I will say I remember when I was in great school, like in twenty seven thousand eighty, and I was like, oh, wow, video is really doing good because I am all of deep learning stuff. And G, P, used bring the backbone of deep learning, which is the big thing in A I. And even if the time was like, I wish I had money to invest and was not a poor grand .

student so well, saw that in twenty thousand. Teen teen positioning in video and the whole gooda ecosystem for this for a long time. Yeah and yeah, it's pretty wild.

Moving out to projects and open source. First story is about new research, which chief covered a couple times and then launching a user facing chatbot. So this group has previously released hermes, specifically hermes three seventy b.

In this case, it's a variant of matters. Allam, a few point one. And new new news research a one of our big tradeMarks is uh, these unrestricted models. So having free access for all, doing completely unrestricted, able to tracks or less safety guard rails. This one of the article writer at these did find that I did a refuse to do certain things like going to how to make drugs, although according to uh new is not from them so I didn't add any guard rails to the user basic child board some of IT was already backed into a model previously.

Yeah I I do find this interesting like um that there's a certain um eagerness to do like fully fully h no guard rails say I don't think even like even X A I doesn't uh sorry, even the platform x through rock and kind of xi therefore they don't pretend to like be trying to do a fully no holds bar thing right they're like we will adhere to the law and and not produce things like you know child child pornography or or whatever else so same things happening here and and news is interesting because they are especially into this thesis what I interpreted earlier like a more extreme way um but here they're basically saying like oh no, like of course of course we have safeguards on the actual model.

Like of course we try to prevent IT from you from doing really, really bad things like helping you make illegal of narcotics, like math, like naturally as only way the model, as you'd expect has been jail broken, planning the prompter um very very quick on the case as usual uh finding A A really powerful basically get through everything. Um that's it's only interesting. I mean, i'd love to do a deep dive on pliny the prompter's methodology and an approach because there's some fascinating stuff there.

But um new really interesting to know that they're even launching this, right. This is not a new model. IT is just a chat interface. So they are trying to play in that space as well. Um yeah so we'll see where IT goes.

I mean, I don't know if they're going to be charging for this stuff at some point or or how that i'll play up, but they are really into the you know make IT available for everybody up to and including training methodology, right? We covered their destroy optimize zer a couple episodes ago that anyway is meant to make IT possible for like people to pull off the massive training runs, distribute across basically the whole world between GPS that type thing. So uh, anyway.

that is that's right. And this is a suppose part of platform news chat. So that's very much like trajectory. The interface you log in, you have text prompt window IT has a fun kind of visual style to IT learn more like I know all the windows or a terminal, if that looks a little, I know, nerdy said.

And one fun of thing about IT that is kind of interesting is you do have access to the system prompt and you can modify directly, which is not the case with ChatGPT. So just to read a bit, the way system prompt that is here by default, is your arms and the I to help humans build, create, flourish and grow your personalities and amphitheater, creative intelligence, sist and powerful, self confident and adoptable. You communicate formally and sick responses that feels just like of a human.

So yeah, I don't need that. They do provide access to that and you can configure IT. Next up we got frontier math, a new benchmark. So this one is crafted by over six expert mathematicians from the uh top institutions and has original on publish problems across various ranches of modern mathematics, meaning that you shouldn't be able to find IT on the web and learn out that. So compared to existing benchMarks like a gsma k and math, which have a similar problems and do have a benchMarks out there with that said, uh here you have problems that require deep theoretical understanding and creativity uh and as a result, things like beauty 4g one point five pro struggle and solve less than two percent of the problem。 I believe there was a quote from term style one of people involved that this should be chAllenging four models for at least year, at least a couple years yeah .

and you've got a an interesting framework that they so it's not just the benchmark for they're coming out with the whole evaluation framework that's all about automated verification of of answers. Part of that is, is to prevent uh, guessing. So they want to prevent LLM from being able to succeed just by by kind of throwing out of guessing and doing well.

Um so they set up these these questions so that the responses are h deliberately complex and non obvious like the of the correct answers to get to reduce the the chances of of guessing getting you to where you want to go. They're also designed to be the kinds of problems that IT wouldn't take. Like it's not just a question of I wanted to take me a really long time to find the answer this question, but I can do IT through relatively ly straightforward reasoning, right?

So it's not like an undergraduate physics question, for example. Um it's also not like a uh some of the the G P Q A questions so they the graduate uh question answering questions which sometimes you can answer in one shot like without thinking. You need to have the expertise.

But if you have IT. In some cases, in that data said, you can just go ahead and and respond without thinking too, too much. They are going to combine those two things together.

They IT to be like really, really hard. And also require hours, if not days, as they put of human thought time to to solve for so you can really see I mean like everybody keeps saying this with new benchMarks. Um if a model can solve this, then it's going to be agi, right? The only I will be absolve. The problem is every time you we keeping these new benchMarks come out, they're keeps being a trick you know some way to make models that do really, really well at occasionally those tricks actually have browser implications for agi kind of for the spill over in general knowledge um but but no and then that can happen quite often.

But they certainly don't require the full kind of age that um that some people think they might this one yes, were two percent right now success rates for cutting edge language models like cloud three point five on IT um no gm I one point five pro that stuff but um unclear what's actually I going to get in there. Is IT a Better agent? Take scaffolding? H is IT a bit trained foundation model of what is that? It's it's it's going to be interesting to see what actually ends up .

cracking this this metric pretty impressive to see, or at least a sign of the times he could save that. Now people developing these absurdly difficult things that most humans could even try, like they have some sample problems um this one that is in paper, they say is high million meet to medium difficulty just to be the problem.

Construct a degree nineteen polyamine P F X in C F X such shot x has at least has at least three, but not all linear irreducible components over x choose B F X to be odom's have real confidence and linear confident negative nineteen and calculate fee of nineteen. So I don't know what that means. And the solution they provide and paper is like an entire page bound of references to various theories and so on.

So this is like hard core math of here. And I suppose it's not surprising with current alums. Uh can't uh beat IT just yet.

Yeah and you can really see in that that problem phrasing there the the leering on of sequential requirements that that makes IT harder to guess, right? You can just like one shot that um even with a guess like you d have to guess multiple things, right which reduces the chances that you get a and a Normal st results. So it's all meant to make IT auto automatically evalu evaluable cheese having a hard time. The words .

and last up do have a new open source model. This is large and open source mixture of uh experts model with two fifty two billion activated parameters by tencent. So this has a three hundred eighty nine billion total parameters. And it's pretty bf and impressive. So IT impresses up to two hundred fifty six tokens, two hundred fifty six thousand tokens and does the beat alama three point one seventy b on uh various tasks like logical reasoning, language understanding of coding seems to be uh somewhat comparable to allam a three point one four hundred fifty four hundred five b so certainly seems like tencent is trying to flex a muscle and showed stability to ill of this scale of model. So one of the one .

of the interesting things about this paper is so they present a whole bunch of scaling laws, and they share you their thoughts about like how how many tokens of text and of tax data and how many plants is not so when you do the math, at least by by my math, uh, which claude is very helpfully helping me with, uh, we get to compute budget about ten of the twenty one flops, right? And could you budget also something that you it's good to be interested in when you see a chinese model? Because one of the things that they're really constrained by is U S.

Export controls on hardware. And they they find IT really hard to get their hands on enough hardware training this model. So here we have ten of the twenty one flops.

So for reference, when we think about A A G P T four class model alama alama three four hundred b class model, you're looking at training budgets there about ten to the twenty five flops. So we're talking ten thousand times bigger then, all right, yet ten thousand times bigger than this model in terms of compute budget. So so I find this really weird. They claim that this model is on par with lam a three four hundred B I maybe missing something in my calculations of somebody like if you can spot this, like, please do uh, this seems to me to be a very much stretching like this is seems very Frankly like implausible, that I must be missing something or the paper must be missing something. But if that is the compute budget, then then they are doing something really jenky really weird and and that would be the headline like if the actual a few budget was that but again, um yeah lama ten thousand times greater training budget and um and here there they're saying that IT IT performs on par with law of three point one four or five b so that doesn't make any sense to me um would would love .

to yeah IT seems maybe there is a type home yet we been quite run the equation, right? They do say they trained for seven trillion tokens and fifty nine billion activated parameters. That would mean that IT shouldn't be that different on that order.

magius. De, so lots of details in the paper. They do talk about the architecture, the number of layers, attention, heads of the type of attention used, which is also a case of law.

So these kinds of details on the greedy of how this is implemented always, I think, is useful for pretty much everyone working on alms and how to research and advancements. We begin with some more from google and some mohila tes called the relaxed recursive transformers effective parameter sharing of layers SE. Laura, so this is uh pretty novel, pretty interesting technique for getting more out of tiny models.

As we've seen that we've made more and more gains in the space of one and two billion parameter models. And this one introduces the notion of recursive models. What that means is the train um like of an ultra former has and layers right and each layer is distinct.

What we do in newspaper is say that you can take a set of layers and then basically stack them again and again. So your Peter layers a few times in a row. And just by doing that, you are able to a still go to small size but retain the performance of a larger model.

And that's of the paper. The relaxed partner is that what they do, repeat the layers a few times, they still apply Laura to differentiate differential slightly across layers. So that, I think, is a need to technique. H showcasing continued progress in the space of being able to really squeak out all the performance out of less and less parameters.

Yeah, this is a really interesting paper for a lot of reasons, including the hardware interaction here. But for sort of intuition building like I found this really weird way when I run, to be honest, was like how I wasn't familiar with the literature around because there is some um around, I guess what they are calling recursive transformer as people have done .

some little experiments, right? And actually just to call this IT might be confusing. So cursive is going back a little. The husson researched and recursive different from recent. So recursive is different because you're not kind of updating ahead.

And state was like time sequence element here, really just you have one input and you pass IT through the same new network several times to get through a Better at. So you take an input, you passed IT for a way to get an output. You put that output back through the same set of weights. And that's what that means to be recursive. And yeah has been out for a little while that IT actually is possible to train your nets to be Better after several records of passes, several passes through itself and germany take over.

Yeah no. But that fact itself, right, that something that I was not aware of going in myself and and it's struck me as quite counter intuitive, right? You feel the same data is you put data into a model, uh, a layer one, and then you make you go through layer one. And then instead of going to a layer two, you may go back through layer one and over and over and over again um and you get a Better result out.

And you I was trying to build an intuition around this best I could tell us like so reading a book twice, right, that you're doing the same thing even though you're using the same um algorithm that were the same the same layers and and all that, uh, you're able to extract more and more information with each pass. And so this is essentially the the same principle. Basically you're chewing on the data more.

Uh you can think of IT as a way of just expanding more compute to in the process of chewing on that data. If you want to compare IT to just like feeding through just the layer one time, now feed IT through multiple times, you get you get a Better result. Um so one of the chAllenges is, sorry, let's talk about the advantages first, the advantages you are copy pasting the same layer over and over, which means you don't need to to load a an eight billion proper model.

Maybe you get load of four billion private model if you reuse every other layer right or um uh anyway you can keep playing games like that where you you have layer stack like three times in a row the same layer and then a different the next layer copy that when three times and or IT could be all the same there. There are all those configuration change that are possible. And so um one of the advantages here is that cuts down on the amount of memory that you need to use on your chip right.

This is really good for memory usage um still need to run the same number of computations though even though you're layers are identical, your weights and renders identical uh your data as is the the the the ebel ding of that data are changing. You still have to run those calculations. So the logic costs the number of flops, the flop capacity of your hardware still you know needs to be utilized intensely.

There is a way that you can even get an advantage on that level though, because so much of your your computation looks the same. IT makes IT easier to paralyze IT. So they they have a section of the paper on continuous depth wide batching where they're talking about, okay, how can we leverage the fact that the.

Years are identical to make the actual logic uh less demanding on the chip, which is which is really cool. But the the really big boom here is for memory usage because you're little little you're functionally cut, cutting down on the size you're model, right pretend ramada. Um so that's that's really cool.

It's it's such A A dead simple method there is this technical that they're using, uh that seems to work best in terms of deciding which layers to copy pace that they call their stepwise method. This was the one that work best. So basically they would take if you have no like a twenty layer transformer um they would take every other layer and copy at once.

So you take layer one um repeat layer one one time right then take layer three, which would be the next one, and layer one later one, then layer three, layer three and there are five, there five, there seven, they're seven. All the web to twenty. And that's kind of the thing that they found work best.

The intuition behind that just being that, hey, there were like there was prior work that showed that this work. So a lot of this is just sort of like jackie engineering um but still a really interesting way to kind of play with again, play with hardware. See you what can we do with chips that have crappy memory but maybe good logic um you know unclear like which chips I would would necessarily fit in the category once they use this continuous step wide patching patching strategy. But but really interesting um and in a great way to get more out of year .

yeah at year I had this paper has quite a bit to IT a lot of details that are interesting. So they do use a step wide strategy initially but when they add this other trick of a law for these layers to be able to adopt them slightly uh for deployments, they do a site modification to a sub prize effort, over average two layers. So like layer one is the average of layer one and four, and the other one is an average of two point five.

Just empirical. They found this work Better. And you do need to, they say, up train.

So you need to train a an initialize for a while to get IT to work well, but they do save that. You do. You don't need to train IT very much.

Just like fifteen billion tokens of up training. A recent of jama, one model I performs, even full size, retrained other models like by fear and pilla ma. So yeah it's it's quite interesting and will be seeing, I guess, if this gets adopted in practice.

And um I don't know if if we talked about the the Laura adapter role in this kind of conceptual structure, but maybe just for emphasizing when you log in, write these parameters and you just repeating the same layer over and over, um you you might want to give your model a little bit more degrees of freedom than net a little bit more of an ability to kind of adapt to the new problem domain that you're going to be training IT on.

And and that's really where the lord adapters come in. IT gives the model a little bit more room to stretch itself, right? Hence the relax qualifier here and relaxed recourse of transformers. You're giving you a few more degrees of freedom to to kind of modify e itself without that constraints. All these layers have to be the exacting.

So that's of the two. Yeah, right. Laura also, h for some references, a way to, uh, likes officially change a bunch of weights by tweet a smaller set. Farmers, you could basically reduce to do so that the idea here is you're not updating, you're still sharing most of the wait, but you update a few parameters that make them a little more distinct. And out of the next research regard, applying a golden gate cloud mechanistic interpret ly technique to protein language models.

And this is not a paper actually, this is more than open source project that looked into the idea of a applying the same technique that uh, we've covered a believe like now a few months ago where we had spars uh auto encounters where that can be applied to l to get internal features. So you can say, uh, the the famous example, I guess, is the golden gate bridge feature in cloud. You can see that there is this kind of notion, concept of cloud that gets activated for certain inputs.

And that is done the vee of a sparse auto encoder technique that compresses of the outputs of certain layers in L, M, and then finds regularities at a high level. So this work was applying a that same to unique to uh, a model specialized in protein prediction against protein language models. And uh they found some interesting features in this context and I think, um Jerry, you read more into going to take over I .

mean I I really like this this paper and for context to like the S A the space autumn coder is a bit of a darling of the AI interpretative world, especially among folks who care about loss of control center OS and like his my AI trying to plot against return a scheme as the believe of the technical term is um so the the idea here is yeah you have somewhere else let's pick up a middle yer of our transformer and will pick specifically the residual stream.

So residual stream is basically the the part the circuit in the in the machine, say circuit the part of the the architecture that takes um what whatever the weight were from the previous layer, just copy pace them into the next one. It's a way of preventing the information from degrading us. Gets propagated through the the model but anyway essentially pick a slice of of your transformer and uh you you feed the model some kind of input and you're going to get activations at that layer right now。

Pick those activations and use them as the input, okay, you're get you feed them to a model. All the sparse to encoder. The sparse to encoder is going to take those activations and it's gonna have to represent them using a small set of numbers like a compressed representation.

So you know maybe you have well as a cartoon version of the C F ten thousand activations um then you want to impress them down to like a hundred dimensional vector, right? So that's what the space auto encores doing. IT compresses them. And then from that compressed representation is then going to decompresses them and try to reconstruct the original activation and the loss function that uses usually something like the the difference between the old and the reconstructed, the true and the reconstructed activation. So basically just gets really good at compressing these activations down to a smaller representation, IT turns out, and an anthropic found this, that when you do that, the the individual entries in that smaller compressed representation end up corporation to human interpretation features.

Uh, so for example, like the idea of the might be captured by one or a small number of those numbers, um the idea of of a molecule might be captured you in in the same way and so this is basically just meant to be uh, a way of taking this very complicated thing, all the activation in this residual stream and compressing them down to a manageable number of of of numbers that we can actually get our arms around and start to interrogate, understand, interpret, right. So that's kind of part of that that hope of the element game plan is like will be able to use this to understand the thinking in real time of a is that are very potentially dangerous. Ly advanced at the theory.

A lot of interesting success has been found there, including on steering the models behavior. So if we do something called clamping, we pick one of those, one of those numbers that compressed representation. And let's let's say it's the number that represents banana or encodes the idea of banana, we crank up its value artificially and then we reconstruct the activations.

We can then get the model to based on those activations to generate outputs that are tilted towards a banana, whatever that means, that maybe he talks a lot about bananas for something like that. That was the golden gate, uh, claude, uh, experiment, right? So they found the entry that corresponded to the golden gate bridge.

They clapped IT to give you a really high value. And then that the model to, yep, on about the golden gate bridge. And so here the questions gonna, will we find the same thing if we work on a transformers that are trained on bio sequence data and they pick a model that was developed by the company?

E. S. M. Sorry, sorry, this company evil scale made the E S M series of model, so we covered em three many months back. Fascinating model.

Uh IT was the first ever BIOS model, by the way, to meet the threshold of reporting requirements under violence executive order back then. Um so was a really, really big model. What they did was they took a smaller model, E S M two, that that company had built and they played the same game.

Can we pick up, you know a middle layer of of that transformer um build a sparse ta encoder and can we recover human interpretation features, right? Can we find features the correlation within this case, a common structural components or facts of bio molecules. A common example here would be like the alpha helix.

So if you put proteins together, um certain kinds of, sorry, if you put a mino essence together, certain kinds of menu assets, when you string together to form a protein, they'll tend to form a uh a hilal structure called alphabet x the other secondary structure that they sometimes form is called a beta sheet or beta pleaded sheet or whatever. They're all these different um structures that these things will form depending on the kinds of lego blocks, the kinds of mino accents that you string together. They all have slightly different charges attracting repel these nuances ways.

It's notoriously hard to predict what the actual structures is going to be. Well here using this technique, they're able to find, okay, we actually have in our S A in that reduced representation. We have some numbers that corporate with, oh, this is going to be an there's going to be an alphabetic x here.

A lot of hello is or you know bea herpes es or or whatever else. And so that's interesting from an interpretation ly standpoint. We can understand a little bit more what goes into making. These are these proteins take the shapes they do, but then they also found that by modifying the values in that compressed representation, by doing this clamping thing and artificially be up, let's say we enlarge the value of the alpha helix um number, you could actually prompt the model to output sequences that would have more alphabetically.

And so this is kind of interesting from a protein design standpoint, right? The first kind of tantalizing hint, uh well made, not the first but bucket IT with uh alphago ld as uh a series of tools that could allow to Better understand how proteins fold and actually come up with designer proteins with certain structural character. Sticks that otherwise would be really.

really hard to to design and onto living ground. We begin with a refusing a block post from a nap time to big sleep, using large language models to catch vulnerability in real world code. So this is by a google project is zero and this is a team has been around for a while and twenty fourteen dedicated to finding so called zero the vulnerability of verdant bodies in code that aren't yet to known uh or out in a wild that hackers can then exploit about there being protections for IT they have previously had of this project, nap time, valuing offensive security capabilities of large and language models.

They had this black post several months ago where they are introduced the strain work of large language model assisted of the the research and demonstrated potential for improving state of our performance on the cyber sec. Eval to benchmark from meta. That was a little while ago and now nap time has evolved too big sleep, where google project euro is collaborating with google deep mind and in this block post of the announce, a pretty exciting result from of this big sleep agent, this L.

M. That's optimized for helping with A, I guess, vulnus detection. They discovered a vulnerability via this agent, an unknown real vulnerability in a major project SQL light and reported IT, and the developers fixed IT. So to rare knowledge, this is the first time nai has been used to find a real world vulnerability like that. And as block boss goes into a whole a lot of detail into an everybody seems to be you know so much tRicky case not uh some sort of trivial discovery sort of speak so very exciting for implications of being able to fight tackers of the .

I yeah and also a warning shot that hey, these things can actually I can now discover real world vulnerabilities. It's always A A double age sort of these things but yeah, and that's been A A big uh, question mark right in the debate over over A I and and what risks at my poses know I ve had debates with people. I'll say, well, we haven't seen in the eye system actually successfully discover sub b vulnerabilities in real world systems.

And so therefore, etta, now that we have, I mean, I wonder I wonder what the applications, baby, but there have been pilot studies. We've talked about a couple of finding. First, he was one day variability ties where the exploit has already been in logged somewhere and then you're just getting any I age and to exploit and then zero days, which is really figuring out without knowing whether there is, even if under ability, finding one from scratch in kind of more toy settings. This is the real world though this is finding one in A I mean sequel. Light is a very, very popular library um and this is an interesting uh an interesting bug and interesting exploit.

And now pointer d reference h, which essentially is you have a pointer that points to memory addresses and this vulnerability allows you to control what IT points to and so this allows essentially to have some control over what gets written or read to memory and that could could in principle allow the attacker to pull off like arbitrary code execution and essentially know if if you just point the the pointer to um some specific you know like buffer space or some adjacent memory, you may be able to actually like draw that will pull that data in and use IT for whatever purposes. So there are a lot besides that. There's just like making the application crash, right? Just have a like fucked up pointer or something and I just won't work.

Um so all that kind of interesting uh, they go into how this thing works. And IT is, I think, quite a quite interesting improvement over current techniques like the best techniques we have right now, which include things like fuzzing where you basically just throw everything in the kitchen sink at your application, at at your software and just like see if anything breaks. Um this is a much smarter approach, obviously a powered by a thinking A I system um so yeah pretty cool um and this is, by the way, a bug that did remain undiscovered after one hundred fifty CPU hours of fuzzing.

So people had tried the standard techniques on this many times over make sense. IT is a popular library. But but those techniques failed. Where is this? Eee powered .

one succeeded. And one more story resection. This one not about progress, about a rather lack of progress and some unknown research.

Ah so it's about opening. Eve has been a report from the information about them. We're partly working on new strategies to deal with and an AI improvement slow down.

So opening I has been working on something like a GPT five upcoming model has been cold. Amy over and and where you go, that reference to arrive from before. And the report is saying that IT seems to not be showing as significant and improvement over sensors as in previous iterations.

So in the leap from G, P, free to G P. Four, there was a massive improvement. To free was pretty impressive.

Dup ty four was much more impressive. And dup ty, for now I was, oh, I don't know, like what? Two years, two years old, no, a year and a half old. It's a been a while and the g before came out, and we haven't had that kind of leap since. Except maybe you could argue with a one with the introduction of inference time compute, we saw some pretty significant quality native gains.

Regardless, this report from information is saying that the sort of commonly used the the standard trick of a more data, more compute, more scale may not be as effective as IT was previously. So uh, of course, we are having a scarcity of new training data. That's one of the issues.

Most of the internet has already been sucked up and surveyed reported of this new foundation team within open eye looking at possible alternatives to just killing like doing most more in the post training for a doing more of synthetic dinner from AI models at sea. Now open has uh not commented on this and has previously said they have no plans to release yan or anything like G P five this year. So because can take IT with a great assault but also maybe not super surprising yeah I think .

this is is such interesting part of the debate in the question over scaling, right? So there there's a question as to whether so when we look at scaling curse, what we're typically talking about is how well does roughly the models next word prediction accuracy improve with more data, more computer, right in mobilize what the chAllenges that doesn't tell you how that improvement in next or sorry, next word prediction accuracy does not necessarily tell you how generally useful a model is your how good IT is that reasoning, other things that we might actually care about.

And so you've got this very robust scaling law that tells you that the model is getting Better, predicting next tokens. Um but the uncertainty about the the value that's being created in that process. So that's one dimension of uncertainty.

Without knowing what's going on with a ryan, what the training da looks like, what is intended to do like is this another reasoning system IT seems like it's not supposed to be but there's a be a lot of fog war here um without knowing that, it's hard to know whether what i've just described here is an important part of the uncertainty or whether it's no like a reasoning model and the inference stuff isn't working out from what i've seen IT seems like IT IT is more likely to be the former that this is really meant to be A B C pretrail your GPT five tip model as opposed to a one which was putting I mean, I really don't want to say bells and whistle is way more than that, but h it's certainly leaning more towards the inference time paradise and that was that that's the big gleat there. And we have separate influence time scaling laws now to write as well the compliment, the training time scaling laws. So that may well be enough to do really interesting things but um yeah there there is a whole bunch of interesting you know gossip about OpenAI on here.

Apparently back when a yan had only completed twenty percent of its training run um sam was really excited about IT and was talking eternally about how this would be A A big deal that's very hyped up IT seems that that hype is failed to materialize. That's really what's kind of ada issue here. Um there's also questions about what hardware this stuff is being trained on like what is this training run and guessing at the age one hundred fleet that opening eyes run right now to train this um at what scale like what are they really pushing in terms of scale?

Really hard to know. Um and just more generally so because they are setting up this foundations team to explore a kind of deeper questions. Now if the default path is scaling, the engineering path will call IT, right, where you just build faster horses.

If if that doesn't doesn't work, what do we do instead, right? That's the big question. I think in this instance, um OpenAI is really and quite ironically, put itself in in a difficile position over the last two years, right? They've bled off.

I think it's Better to say all of their best or not all of much of their best algorithmic design talent, right? So ellia sesc ever has left. We've seen yeah you the safety team Young like we've seen um I really like basically uh H A huge, huge amount of talent, including product town. We had uh a bit is awful leave recently too. There is like really, really good folks who are gone in many cases to anthropic.

And so if that is the case that we are moving from a domain where it's about exploiting a paradigm, in other words, doing really good engineering and getting scaling to work early well to apparent on where we're looking for new ideas instead, where that's the main, then you might anticipate talent being the main limiting factor, in which case anthropic starts to look really interesting, right? You've got a lot of companies now that could be that could be competing here. Meanwhile, opening eyes hamstrung by a relationship with microsoft that is currently tapped at best in the recent investor communications, microsoft did not refer to open the eye in the future tense at all.

right? That is a big, big change. Um so as that starts to happen as opening, I is forced to work with companies like oracle to develop infrastructure because microsoft apparently isn't meeting their needs.

There's tension there. Two like this starts to become really interesting for them. And he's gotta find a way to square this circle.

He's got to find a way to keep raising money. He's got to find a way to keep scaling for what what that's worth. And then he's got got to retain talent.

Um IT would be interesting if this turned into A A very significant um structural chAllenge for opening eye if they've double down too much on scaling. but. Again, this is all speculative. We don't know until the models start dropping in. And Frankly, I think when the the uh blackwall series of GPS come online, we get those big clusters running next year. I mean, look, everybody I know in the space really expect big performance improvements from the early test they're doing, I suspect will be will be looking back on scaling is like yet that way was a real thing along. But um but if not the implications for open eye, at least you're interesting.

That's right. And also worth noting, this is not sort of unique to open the I it's an open question in general, if IT is even do able to scale in part because of training data running out, that was a speculation for a while.

And just as painted A H high bible picture, right, what's killing means is G P three was like around one hundred eighty billion promoters keep before we don't know, but the speculation of rumors where IT was like around also to two trillion total parameters by this mixture of models of some smaller set of activity or parameters uh and so to be five, whatever of an example table, this eryn that you could say maybe would have ten truly and total problems of twenty three and you know that kind of job in size and a speculation as well if you do that same kind of move from G P, three to P T four to to be free and just add more rates, add more scale, add more training, will you get that bigger jump right now? Is unclear, right? And this report is basically claiming or seems to claim that maybe it's not quite as successful as IT has been in the past but IT rains to receive yeah I .

think IT noting though on the data scare side, like there is eventually a data wall presubscribed there's data charges is over the data, although is expected to kick in about an order of the magnitude uh of flops like train compute further than for example, like power constraints.

Um so and right now, we're not even close to power constraints in our current runs like we're seeing you ten to the twenty six fp runs next year policy into ten to the twenty seven. That's still like two orders of magnitude ford. You hit even the the power constraints on the ground.

So right now, I don't think that, that of the data scarcity is actually the thing driving the the limited capabilities here. I think there's something sort of something else is going on here and will presumably have to wait and see. Um that's part the reason why i'm curious you what happens that next speed when we get the black while clusters online, when we start to see the hundred thousand uh new G P two hundred G P U clusters running.

Like do you then see the lip of transients to use the saudi arabia terminal gy for this? Do you start to see that kind of improvement? I I don't know. But um I think it's yeah there's a lot of experimenting with many billions of dollars that there will be run a to find out.

but that is being run. Yes, know this is a big question and I guess we will find out soon enough already moving out to policy and safety. And as promised, we are going talk about Donald trumps Victory for the present of action in the U.

S. And in particularly what IT means for I know political commentary from us, even though as citizen, have some opinions, but regardless of dawn, trump is going to turn to a White house and they are. There is not a time we know about specifics what might happen, but we do have some pretty decent ideas as to at this summer, what will happen.

So for instance, we do know trumps administration is presumably going to repeal present bindon's executive order on the A I, which we've covered plenty on the podcast. This is a very big h order, not a law. So h the relation because this was just an executive order prompt adminstration, you just cancel IT more or less.

Now there might be retention of some of the features of that. I might be revising IT prevent fly cancelling IT. But IT does seem likely, at least we don't know for sure.

But certainly, we'll be revisions to that. And of course, we know that trump loves to fight with china, and that's been an ongoing situation in the us. For a while. So there probably more be more about but jeremy, your your your polite guy s so i'll let you do more. So talking here yeah I mean.

I used to think of myself as a tech guy. I guess I I guess in a bad yeah no, look, I think it's unna because in the so so the policy universe I live in is is the national security policy universe um the extent that I live in the policy universe and um I think there a lot of people in the kind of a general A I safety world who are really concerned about a trump administration and I I actually think that a lot of those concerns are are quite place like I think this is a misreading of um of what we need and where we are.

So just for for context, we've seen trump on various podcast. That's all we have to go on, by the way. And ticket uh goes in depth into comments that trumps .

made on but no promises and no guarantees is kind of reading to leave and guessing based on .

the comments trump has has rightly described in my opinion, AI as a superpower and cause called its capabilities alarming um he is also referred to china as the primary threat in the race to build events AI which I think is also correct.

Um and and then you've got this interesting question as to you know cabinet staff in like iran is a massive influence in the cabinet and to the extent that that should should say sorry on the transition team and broadly on the team, I don't know they'll be in the cabinet officially because he's kind of busy with a lot of company. But you know obviously a massive influence, very concerned about everything from web ization to loss of control. A lot of a good quotes from the head, rick in this article who advises elon quite a bit.

And then the question is that musk uh that you want you've got that on the other side trumps V P obviously um who's expressed concerns in the past over a close source A I entrenching the tech in comments now I mean, I think this is it's a very rational concern to have right like you don't want close source um pure plays and not allow people to to open source stuff. I think that is gone to start to change inevitably as as you start to see open source models actually getting weapon zed. And it's just going to become super obvious to all concerned.

And at that point at the administration clearly is preserving their opcina ality to go in that direction um at the time with some big questions here remain around the air safety institute, for example. That was sort of a point off of the executive order, at least the kind a lot of the bones were laid there. Uh interesting question as to whether that remains IT is the case that most republicans do support the A Z um it's it's a part of the broader american strategy on A I and it's certainly a home h for expertise.

A question is to whether paul Christiano continues to run IT. You know that another degree of freedom they have, they keep the ac but swap out pokerish o who the former head of a line that had open the eye, who invented enforcement learning from human feedback. Um so that would be an interesting, interesting question.

Um but then more broadly, the executive order, right the famous spiten executive over hundred ten pages was the longest E O and living memory. You know I think there are a lot of components there that are a likely to be preserved in a in a trump administration. Um I I think you'll see some stuff get scrapped.

Look that E O did tons of stuff IT talked about your bias and civil rights and all kinds of stuff under the banner of A I um I think you could well see that get your carve doubt followed out trump s said he's going to rip out the E O. That's not a question that will probably happen, but what IT gets replaced with is really a issue here, right? How much of the national security stuff gets preserved? I wouldn't be surprised if we end up seeing and off a lot of that stuff still in there.

Um and anyway, there's all kinds of questions as well about uh the what what do we do on the energy? We have a ton of work in the united states to do to get energy back on the table like we have forgotten how to build nuclear plants. We can't build them faster than ten years like we need a way to keep up.

We just talked about the the power bottles neck and how that kicks in IT, about ten to twenty nine flops. Well, that's comment like that's that's the training runs of like two, three years from now. Um if IT takes you ten years to build a nuclear plant, then like you gotta a change, something pretty fundamental.

We need to get natural gas online. We need to get geothermal potentially. Um and a lot of these things online with the kind of trumpy on school.

So making sure A I gets built here, the questions are all gonna around. You know what about things like loss of control? What about things like weapon zia open source? Those are the big question Marks. And right now, again, like its administration position itself very openly, very flexibly.

Um and and the china angle, I think is is a very bad parties and peace too, right? So I don't think we're going to see all the export controls that have put in get ripped out. I those are actually going to be by partisan and maintained where we might see a change um would be the trouble of administration maybe focusing more on enforcement, right? We ve covered a lot.

The leakiness of these export controls under the car administration would be great to see you know actual loophole closing as fast as loopholes are opening and and that's something you could see. So you know, one last kind of metro here, the uncertainty that we see around what might come out of the trump administration here reflects uncertainty in the technology. But IT also reflects the classic kind of trust, ian, move of maintaining uncertainty for the purpose of negotiation leverage, right?

You see this with tariff s and all the discussion around that the threat has to be credible so that IT actually leads to leverage internationally. And something that we've seen other administrations anyway struggle with is like if it's if if you're speaking softly and you're not Carrying a big stick, then people will not take you seriously. And to the extent that there is a lot of negotiation to do with china on this issue, you you may actually want to negotiate from a position of strength.

And for that, you need to have the energy leverage and and other thing. So I think big, big questions around the az, big, big questions around what the focuses on open source and um and on a loss of control. But with elon there, um I think there's there's a lot of room for for positive stuff to happen potentially on the the safety side.

So yeah, I think the the story is again much more positive. A lot of the the people who I know the kind of A I safety world um seem seem much more concerned about this. And I I think part of that you know may just reflect A A concern over, you know Frankly, politics like some people are just they just don't want they don't want this administration and that's part of IT.

But um right now, it's unclear and and the we we just got to wait and see. I think there are some really good policies that have been put forward generally on the energy side and and elsewhere. So we can see is, is the best approach .

probably right? The botts jelli. My impression also is our IT goes into and basically doesn't lay out a picture that IT doesn't seem like is any obvious big overturnings of what's been happening was gonna a lot of tweak similarly with chips act, uh, wish who was one of the major movements during the body administration.

Trump has been somewhat critical of IT, but it's unlikely that the republican congress and trump will repeal that act. We might kind of arise, but IT does seem more likely that, that will stay in place and and continue being a factor. So that's, I guess, this article summary and our best uh, guess at the implications of a tram presidency for ai.

We will have to wait and see what happens in practice. And speaking of evading sanctions from the U. S.

The next article is a fab wall. Chinese companies are evading us sanctions. And this is a bit and over over you, I suppose.

Uh, so it's talking about the need for our competitiveness, governing recessions and talking about how companies such as how we are exploited. Very slow poles to acquire advanced semi conduction manufacturing equipment, which is then enable an amp to build a large ai clusters. So again, jeremy, i'll let you take over on this one since this is your real house.

Oh yeah, I I thought this. Okay, so so I will always share, uh, sem analysis. Any chance I get, uh, 3 minali sis is an amazing news letter。

If you're into A I hardware stuff, hardware stuff, I I should say in general h go check out the blockers are really technical. So unless you kind of know the harbor space, tough to tough to justify a subscription if you're not getting all the value out. But um they are if you're in that space, I mean, you're probably already subsided.

These guys are amazing. Um so this is yet a report on the really a difficult enforcement chAllenges that are facing the department of commerce ah and B I S as they look to to enforce their export controls on A I chips. But I just want to give you a an excerpt from this report.

Ah they're talking about A S M I C, which is china's answer to T S M C, obviously. Uh, so they produce all of china's leading, leading nodes on the hardware side. So they say sanctions violations are egregious. S M I C produces seven enemy or class chips, including the cure in one thousand s mobile, a soc system, and to send 1 and B A I accelerator two of their fabs, kay, two of their fs are connected via way for bridge.

Okay, so wait, fer is the thing that you it's like this big circular thing that that is made of silicon silicon and you that's what you echo your circuits in anyway, this is the starting point for your fat process. Um so to their factor connected via way for bridge such that an automated overhead track and move wafers between them. But for production purposes, this forms a continuous clean room and effectively one fab.

But for regulatory purposes, they're separate. One building is eny listed by the U. S. In other words, one building is owned by an entity that on a black list, you're not allowed to sell advanced AI logic to them. Um and because of national security concerns, whether the other one is free to import these like dual use tools and IT claims to only run legacy processes and yet they're connected by this physical fucking bridge like that. This is how insane is.

You basically have one facility and we're just gonna trust china and smc that they're not sending something like your way for right when you should be going left type of thing, that the level things are on, they go into detail on stuff that we've been tracking for a long time. So there is A, A fab network that is being run and orchestrated by hallway where they spin up new um subsidiaries basically as fast as they can to evy U S. Exports controls.

Right now, U S. Exports controls work on a blacklist basis. So you basically say, okay, we're gna name new entities and organza you are not allowed to sell advanced inductor manufacturing equipment to um and we try to keep that list fresh.

Well why we is just gonna create new spaniel entities as fast they need to and they have this vast network now um that is basically moving hallway into the center of what you might think of as china's maybe A I ambitions like if you start to think about what is the not even the open eye of china, but what is what is coordinating entity for a lot of china's big scale I work IT is increasingly how way both on hardware and and software are so there are all these these pushes to to get how we looked dead in all this and and what the this report argues for, and I think is quite sensical, is, uh you need to start to think about tightening in a broadway your uh your expert control requirement so instead of you're saying all look, we've got A A black list and we will try to keep that black list fresh um instead using let's A A wide range of tools to require any material that is at all U S H fabricated in the whole supply chain that that can't be shipped. So so even if you're at A S M L, you're building something that has any component of U S. Technology in IT.

If you ship that to to china, that no, no, like these broader tools are becoming necessary just because otherwise you're playing this waco game that your destined to lose and at this point the stakes are just just way wait too high. So um by the way, I say this, if 3 minali sis is there A I acceleration ist uh in their bones, right? This is like they are not kind of A I safety pd um as far they can help to fight the opposite and here they are saying, no, no, like we need to like fucking then the export of this hardware d of china in a very robust on precedent way.

I think this makes all the sense in the world. If you believe this is ultimately dual use technology, then that's what you're gotta do like. We can't be updating blacklist every twenty minutes.

And just a couple of more stories with next one is a very much related to that previous one, actually an example of sanctions violations. So the stories that VS has found, company global founders ies for shipping ships to a sanction, the chinese firm.

So this is a five hundred thousand penalty on this new york based company global founders ies uh it's the world third largest contract uh chip maker and IT has shipped chips without authorization ation to an affiliate of uh S M I C, the chinese chip maker. And this was seventy four shipments of seventeen point to one a million worth of chips to this company, S. J.

Semiconductor, which is affiliated ated with smc. interesting. Ly, this also says of that global founders, ies voluntarily disclosed of this violation and CoOperated with the commerce department.

And that was a statement from the assistant secretary for export enforcement, Matthew actual. Rod says we want U. S. Companies to be hypervigilant when sending segan arc materials to chinese parties and the .

global founders. Ies came out and said they regret for the inadvertent tion due to a data entry air made prior to the entity listing.

So a data entry air blame for this look probably true uh and this the stuff is really difficult to enforce, especially when you have a very complex set of layer kind of requirements and all the stuff like, you know, the rules right now are not simple and that that is a chAllenge for enforcement um you know so so maybe no surprise to see this is yet another another kind of leaky situation. Obviously T S M C had had similar issues recently, right? They accidentally sell assault and stuff to uh to a hallway, a fillip.

But this is just what happens is part of the reason why you just need stronger incentives. Friend of like global founders ies are running processes that are subject to these kinds of errors. Then that just implies, okay, they need to try harder, that the incentive just need to be stronger.

Um you know to kind of bump back to that semi analysis reports that we were talking about earlier, one of the the college that they make is you the the industry side of this has been claiming that this would reck uh tighter export controls, would reck industry and ebo line and theyve actually been doing Better, not worse um including your decent sales. The chinese market uh in the last few years this has been an absolute boom time for them in spite of increasingly tight export control. So the economic argument maybe faltering a little bit here um but but yeah we're seeing in real time these holes kind of appear and yes, get plugged like this. This will get plugged and then there were going to be new holes. It's this yeah never ending game of waco again to pledge ze the sei .

analysis post title and last up the story is that on topic has teamed up with apple entier and to sell its AI to defence customers um quite related to history. Last week we had with Better a altering their license for user argue ment to let defense in the U. S.

Use IT. Uh now the escalation would allow claw of a shotbolt from an a proph C2Be use d to evi n tha t tal ented ers def ense acc redited env ironment as tal ented. Er impact level six.

I don't know what this is reserved for systems contained data critical to national security. So the topic previously has um you know I guess prevented use on topic at least precluded IT in their arguments for U. S. Defense customers. And previous article and pair will be discuss last week seems to be part of a general trend.

The anthropic I have heard has been really transparent internally with their own. Teams about this and in the deliberative process behind IT. I mean I actually think you know this is you you want um A I C T focused org to be working with the U S.

Government to have them understand what's going on, including in defense contact. This is going to be for yeah in intelligence analysis that sort of thing um so yeah I like I actually think they are going to face a lot of flag for this. I think this is A A good move.

Um and um and the plentier partnership is is actually going to be really important for them too because you know selling into D O D is hard. You want you want to work with someone who really understands that, that process. So ah yeah this is another another big boon for anthropic potentially because that mark also just it's really big and it's what you know anthropic needs to do to understand both are their customer, the really big potential customer well and also a further own mission.

They need to be able to interest tightly with the U. S. Government, with the national security part of the U.

S. Government know that stuff. So yeah, well, we will see where this goes. And if we end up seeing more reporting .

about this this deal yeah speaking of the government of the us. News also covers that a cloud has come to A W S, A government cloud, which is a service design for U. S.

Government cloud workload in a, there was a gov. A cloud. But that so a seemingly it's not just for monetary, it's also just in general for use, the U.

S. government. And that will be IT for this episode of last week in A I. Once again, you can go to episode description for a links, top stories also go to last week in that A I for those links and for the text news letter. We always appreciate your comments, your views, your tweets, your comments, all those things bottom of anything. We do appreciate you listening, so please keep tuning in and hopefully enjoy this yea song, but is not terrible.