#199 - OpenAI's 03-mini, Gemini Thinking, Deep Research, s1

2025/2/12

Last Week in AI

This chapter compares OpenAI's O3 Mini and Google's Gemini 2.0 reasoning models, highlighting their strengths and weaknesses in benchmarks like Frontier Math. The discussion includes the release of O3 Mini's thought process and Google's focus on cheaper inference.

OpenAI's O3 Mini outperforms previous models in reasoning benchmarks.
Google's Gemini 2.0 Flash aims for fast and cheap inference.
Competition in reasoning models is fierce, with a focus on cost-effective inference.

Shownotes Transcript

Hello and welcome to the Last Week in AI podcast where you can hear a chat about what's going on with AI. As usual in this episode, we will summarize and discuss some of last week's most interesting AI news. And as always, you can also go to lastweekin.ai for the text newsletter.

with even more news and also the links for the episode. I am one of your hosts, Andrey Korenkov. I studied AI in grad school and now work in the startup world.

And I'm your other host, Jeremy Harris, co-founder of Gladstone AI, AI National Security Company. And I don't have my usual recording rig. I anyway, left it at the other location that I tend to record from. And anyway, we were just talking about earlier as well, how Andre is, it's like 6.45am where he is. He got up at six or before 6.30am, which is our scheduled recording time. I had baby things that took over for like 15 minutes. And he's like, I don't know, I don't know.

And so he is getting up bright and early. So hats off to you, Andre, for this extra bit of commitment that you're showing to our listeners due to your love for the sport. So thank you. Yeah, that's my excuse in case I make any mistakes or say anything wrong this episode. But I think the last few weeks, it seems like you've had a rough time of it in terms of dealing with all the news coming out of

administration. It seems like you've had a really busy time at work. So no, I was much worse. No, I mean, it's been just busy with travel and stuff. So looking forward to it maybe slowing down a bit. But yeah, this morning it was it was more so the the baby stuff. I got handed a baby because my wife needed to do something. And anyway, she she's normally so great at

at supporting me and all these things. I felt it was the bare minimum I could do. But now the baby is handed off. She will not be making a cameo this episode. But who knows? Someday. Someday.

Well, let us preview what we'll be talking about in this episode. We start this time with tools and apps and have sort of a follow-up to DeepSeek R1. You could say a lot of companies are jumping on the thinking AI bandwagon. And there's also some other non-NLM use, which is going to be refreshing. A lot of funding and hardware in DeepSeek.

business stories. We do have some interesting open source releases that maybe have gone a bit under the radar. Then quite a few papers this episode, which will be exciting. Some on reasoning, some on scaling, that sort of thing. And then in policy and safety, actually also a couple of papers and more technical approaches to alignment. So yeah,

Yeah, kind of not a super heavy episode in terms of big, big stories, but a good variety of things. And I guess nothing as deep as DeepSeek R1 from last week where we spent like half an hour on it. But before we get there, do want to acknowledge some listener comments and reviews. We had a very useful one, good feedback over at Apple Podcasts.

Great content, good host, horrendous music. And I think I've seen this take on and off for like a little while now. I've been doing this AI music control where I take the usual song and make variations of it. Not everyone's a fan. Some people seem to like it. So I'll consider maybe rolling it back.

It also takes a while, surprisingly. It actually takes some work for me to do it. Oh, interesting. Maybe I'll just stop. Yeah. And then we did have one question on the Discord that I want to forward.

There was Mike C. asking that it seems like you made a comment a while ago where you didn't think RAG... Oh, Retrieval Augmented Generation, yeah. Retrieval Augmented Generation. And yeah, you maybe were not too big a fan of it. And so this comment is basically following up, seeming to ask what is the state of RAG? Is it going to be around?

Yeah, yeah. So I wouldn't say I'm not a fan of it. I am more that I view it as a transient. So yeah, it's not something that I think will be with us forever. So one of the things with RAG is that, and so I'm just flipping to the comment now, it says, I've certainly experienced diminished need for RAG with, yeah, with larger context windows with Gemini, but that doesn't address large data sets that might be hundreds of millions of tokens or more. And this is actually exactly the case that I've been making to myself, where

So, yeah, I expect that you will continue to see diminished need for RAG with larger context windows. And ultimately, that would include, we'd have to assume, large data sets, regardless of size. And so one of the really counterintuitive things about the pace of progress and the way that progress is happening in the space is it is exponential. And so...

Today, it might make sense to say, well, you know, hundreds of millions of tokens, surely that's inaccessible to the, you know, the context windows of non-rag systems. And the only thing I would say to that is like, yeah, but just give it time, right? We're writing so many compounding exponentials right now that ultimately, yeah, expect everything to reduce to just zero.

you're querying within context. I should be clear. That's like one mode of interaction that I expect to be quite important, but you could still have specialized rag systems that are just designed for low cost, right? Again, that's more of a transient thing. Like you can imagine a situation where like, sure, the full version of GPT-5 can just like, you know, one shot queries just in context without any rag stuff.

but maybe that costs a lot because you're using this model with a larger context window. It's having to actually use the full context window for its reasoning and all that. Whereas maybe you can have a smaller purpose-built system for RAG. That's already sort of the not microscopic or mesoscopic middle ground. That's sort of where we are right now in a lot of applications already. There are still other applications that are totally impossible. And so it's not just a price thing. They're totally impossible with just using what's in context. But

But anyway, that's kind of the way I see it. I mean, I think we're heading inexorably towards a world where the cost of compute and the size of the context windows kind of tend towards zero and infinity respectively, right? So cost of compute, if you're kind of way over, if it's way overpriced for your application right now, wait a couple of years. If the context window is way too small for your application right now, wait a couple of years. Both those problems get solved with scale. That's kind of my prediction. Yeah.

Yeah, so at the limit, it seems like you might as well just throw it all in to the input and not worry about RAG. One of the problems with RAG and the reason you might see it go away is that there's these...

finicky extra things to worry about, like how do you embed your query, how do you create your data set, and so on and so on. So if you can just put it all in input context and the LM just does well, then why even bother, aside from engineering considerations. So yeah, in that sense, it's a fair comment to see. Certainly it'll become less important for smaller data sets in even...

It used to be you couldn't do a million token input and expect it to be useful. Now, at least in some cases, you certainly can. All righty. Well, thank you for the question. I'd love to go into some of our discussion. And now we shall move to the news, starting with tools and apps. And we begin with O3 Mini. So just after our last recorded episode at the end of January, I

OpenAI did announce the release of O3 Mini, which is the latest entry in their Reasoning LLM series, starting with O1. I don't remember, O2 was copyrighted or something, so they skipped to O3. And now we have also O3 Mini, which does really well at...

Smaller scales, smaller costs, faster inference, all of that sort of thing. It's actually, to me, quite impressive the kind of jump in performance we've seen. So on things like frontier math that we've covered, which is a very, very challenging math dataset benchmark introduced with actual mathematicians creating these problems without known answers.

Elfrimini is outperforming O1 and is able to answer 10%, roughly 9% of the questions in one try and one fifth of them in eight. So still, you know, the top of repetitions can make some tough problems for it. But the reasoning models are improving fast. It used to be you could only get like 6% or 12% with O1.

And then on the app front, everyone can start using it. That was kind of a wide release. We did see a preview earlier, mimicking R1 to some extent. OpenAI did give a more detailed peek into the reasoning process. So as you do the query, now you'll see

The longer, more detailed summaries of intermediate steps that it takes, which certainly you could argue is in part in response to R1 and how that displays reasoning for you.

Yeah, and I think we've talked about O3 in terms of some of the benchmarks that were published early before the model was released. You can go back and check out that episode. But one of the things we will go into a little bit today, there's a paper with a kind of new benchmark probing at just the pure reasoning side of things. So trying to abstract away the world knowledge piece from the reasoning piece, which is really hard to do.

And just look at like, you know, how can how well can this thing solve puzzles that don't require it to know necessarily, you know, what the capital of Mongolia is, that sort of thing. And on that, one of the interesting things is they got some indications that maybe O3 Mini is pretty well on par with O1 and maybe a little bit behind O1 in terms of

general knowledge reasoning, which was kind of interesting. But in any case, we'll dive into that a little bit. Certainly it's a big, big advance. And of course, that is the mini version relative to the O1, the full O1. So that's maybe not so surprising in that sense. Really, really powerful model, really impressive and very quick and cheap for the amount of intelligence that it packs. We are getting, as you said, like this announcement that, hey, we're going to let you see into what they call O3 Mini's thought process.

This is a reversal, right, of opening eyes previous position. So famously, when 01 was first launched, you couldn't see the full chain of thought you were seeing some kind of distillation of it, but a very short one. Now they're coming out and saying opening eyes that they've, quotes, found a balance. And

and that O3 Mini can think freely and then organize its thoughts into more detailed summaries than they'd been showing before. You might recall the reason that they weren't sharing those summaries previously, they cited part of it at least was competitive reasons, right? Presumably, they don't want people distilling their own models on the chains of thought that O1 was generating. Now, maybe taking a bit of a step back, but not completely. What they are saying is, look, the full chain of thought is still going to be obfuscated.

That is, by the way, really difficult to understand. So a lot of the writing there is garbled, maybe a bit human incoherent, which we talked about that, I think, again, on a previous episode, but the risks associated with safety for a model like that, where increasingly, yeah, you want to expect it to be reasoning in kind of opaque, non-human interpretable ways, as we saw with R10.

That's sort of a convergent thing we should expect. But in any case, they're generating summaries on top of that. Those summaries are just longer than the summaries that they'd previously been producing with 01. So presumably this means that to some degree they're willing to relax a little bit on sharing those chains of thought. They presumably don't think, you know, maybe that they're

As there's kind of like a diminishing value of protecting those chains of thought from a kind of replication and competition standpoint. And one other thing that this points to, and this is increasingly becoming clear, you could have fully open source frontier AI models today. And the winner, because of inference time compute, the winner would still go to the companies that have the most inference time hardware to throw at the problem. Right.

right? Like if you have a copy of R1 and I have a copy of R1, but you have 10 times more inference flops to throw at it because you own way more data centers, way more hardware, well, you're going to be able to get way better performance out of your model. And so there's a sense in which even if you open source these models, the competitiveness is now rooted more than ever in AI hardware infrastructure. And

And that's kind of, I think what's actually behind a lot of Sam Altman's seemingly kind of highly generous statements about we're on the wrong side of history with open source. It's like, it's awfully convenient that those are coming just at the moment where we're learning that inference time compute means, you know, it's not like you have a gun and I have a gun. If I have a bigger data center, my gun is just bigger. So I can open source that model and I don't give away my entire advantage. I don't want to,

like kind of lean into that too hard. There is absolutely value in the algorithmic insights. There's value in the models themselves, huge value, really. But it's just that we have a bit of a rebalancing in the strategic landscape. It's not decisive. It's still absolutely critical from a national security standpoint that we lock these models down, that the CCP not get access to them, blah, blah, blah. But there is this kind of structural infrastructure component where, you know, if I can't steal your data center, but I can steal your model, I'm

haven't stolen your whole advantage in the way that might have been the case looking back at, you know, GPT-4.0 or, you know, the days of just the pre-training paradigm. So kind of interesting and maybe reflected somewhat in this announcement. Yeah, that's an interesting point. We've seen, I guess, a shift now, right? Where going back to Chad's GPT when it first hit the scene, a lot of these companies, Meta and Google, were kind of caught a bit off

behind, right, not prepared for this era of trying to train these models, first of all, and also compete on the serving them.

And it seems like now they've started to catch up, or at least the competition is now definitely on who can lead the pack in terms of infrastructure. On the model side, yeah, open AI and Fabric are still the best ones. But again, the actual differences between the models, in many cases, on many problems, not that significant. And the difference now, as you said, is with the reasoning models where you

Being able to just raw compute it is a big part of it. So O3 Mini, I would say pretty exciting for ChaiGPT users. Competition as ever is good for the consumer. So that's nice as ChaiGPT users. I'm excited.

And the next story also about a reasoning model, but this one from Google. And so this is part of their Gemini 2.0 rollout. There's been several models. We have Gemini 2 Flash, which is their smaller, faster model. They also have Gemini 2.0 Pro, which is

their big model, which actually isn't that impressive, surprisingly, not kind of as performant or as big a splash as you might think. And then they also have a Gemini 2 flash thinking as part of a rollout. So that is their answer to all one at the seemingly it's the one that's better at reasoning, better at the kind of tasks you might throw at all one.

From what I've seen, I haven't seen too much kind of evaluations. I don't see it competing too much in kind of the landscape of this whole, you know, we have a really smart OLM that can do intermediate thinking, but...

Yeah, a whole bunch of models, including also 2.0 Flash Lite from Google. So they're certainly rolling out quite a few models for developers and for users of Gemini. Yeah, this is really consistent with...

Just at a macro level, what's what's happening with reasoning models, right? They're so inference heavy, that it just makes so much more economic sense to focus on making a model that's cheap to inference, right? So you see this emphasis on the flash models, the flash light models. So, you know, flash was supposed to already be the kind of the quick, cheap model. Now there's flash light, so extra cheap, extra, extra quick.

If you're taking one model and you're going to be running inference on it a lot, if it's a big model, it's going to cost you a lot of money. That's just like the larger, you know, the number of parameters, the more compute has to go into inferencing. And that's where you're also starting to see a lot of heavy duty overtraining, right? So

By overtraining, what we generally mean is for a given model size, there's a certain amount of compute that is optimal to invest in the training of that model to get it to perform optimally, right? Now, if you make that model smaller, the amount of compute and the amount of data that you should train it on, according to the scaling laws, to get the best performance for your buck is going to drop, right? Because anyway, those three things are typically correlated. But

If you know that you're going to be inferencing the crap out of that model, you might actually be fine with making it extra small and then over-training it, training it on more data with more compute than you otherwise would. It's not going to perform as well as it would have if the model were bigger with more parameters, given the compute budget.

But that's not the point. The point is to make a really compact, overloaded model that that's really cheap to run, but has as much intelligence as you can pack into a small number of parameters. That's really what you're starting to see with, you know, flashlight and so on, just because of the insane inferencing workloads that these things are being called upon to run. And so part of Google diving into this, obviously, Google's also had the release of their deep research model, too, which came out the same day as opening eyes deep research, you know, we'll talk about that. But

It's sort of interesting. Google's actually had some pretty decent models. And I will say I use Google Gemini quite a bit, Gemini 2.0.

partly because it's the only pro tip, by the way, if you're looking to write anything that you're going to get paid for or that you need to be able to use professionally with a permissive license, like Gemini 2.0, as far as I know, is the only model I was able to find where you have licensing terms that allow you to do that, to just use the outputs commercially without attribution and so on and so forth. And so you're sort of forced into it. So even though I find Claude works best for that sort of

that sort of work. Gemini 2.0 has done the trick and improved a lot. So Google has great products in this direction. The interesting challenge is people don't seem to know about them, right? OpenAI has the far kind of greater advantage on the splashiness. And then when you get into the niche area of like, you know, people who really know their LLMs, they tend to gravitate towards cloud for a lot of things. And so, yeah, they're sort of between a rock and a hard place. And I don't know if the solution is quite marketing, but I

I don't know. I've been surprised at sort of how little the Gemini 2.0 model, at least I've seen used, given how useful I've found it to be.

Yeah, I think their current approach seems to be competing on the pricing front. They do seem quite a bit cheaper. And of course, they do have a bunch of infrastructure. I'm sure some people are using their cloud offerings that are in this sort of Google dev environment. So yeah, I think another example where, you

We hear about them less, or certainly they seem to be getting less credit as a competitor with Gemini. But they're continuing to kind of chip away at it. And from what I've seen, a lot of people who use Gemini to flash are pretty positive on it. And moving on, as you mentioned, there was another kind of exciting thing that came out recently.

a bit, you know, not quite as big a deal as O1 and R1, but to me also very interesting. And that was a deep research. So this is a new feature that came out in both Gemini and OpenAI and ChatGPT about the same time. The idea being that you can enter a query and the

LLM will then take some time to compile the inputs to be able to produce a more detailed, more informed output. So in OpenAI, for example, you can input a query and then it can do its thing for like five to 30 minutes and then

You just have to leave it running in the background, so to speak, and then eventually it'll come back to you with quite long reports, pretty much, about your question. And so this is, yeah, it's a bit of a different paradigm of work.

reasoning of output that we haven't seen up to now. And a lot of people seem to be saying that this is very significant or very impressive in terms of this being actually something new, something that agents are capable of doing that otherwise you would have had consultants or professionals doing this for you.

Yeah, this is actually a pretty wild one. So I watched the live launch when it came out because OpenAI posted a tweet saying, hey, we're launching it right now and saw the demo. It is pretty wild. It's also the kind of thing that...

I've seen people talk about as like being that, you know, the demos kind of live up to the hype. So the idea here is because you're having the model take five to 30 minutes to do its work, you actually are going to step away. So you'll get a notification when the research is done. And then you end up, as you said, with this report. So it's kind of this new user experience, that problem they have to solve where, you know, you're just letting the thing run. And then like a microwave that's finished nuking the food, it'll, you know, kind of let you know.

It is pretty impressive. So just to give you some examples of the kinds of research queries that you might ask, these are taken from OpenAI's Introducing Deep Research page. But these are like potentially highly technical things. So for example, help me find iOS and Android adoption rates for

The percent of people who want to learn another language and change in mobile penetration over the last 10 years for top 10 developed and top 10 developing countries by GDP. Lay this info out in a table and separate stats into columns and include recommendations on markets to target for a new iOS translation app from ChatGPT focusing on markets ChatGPT is currently active in.

So one thing to notice about at least that query is you're looking at something that is pretty detailed. You yourself have a general sense of how you want to tackle the problem. So it's not

not fully open ended in that sense. But this does reflect the way that a lot of people operate in their jobs, right? You already have the full context you need. If you could hand it off to a very competent graduate student, or intern or somebody with like a year or two of experience and have it executed. This is the sort of thing I might do, right? I mean, you know, the problem, you don't know the solution. And

Anyway, it's quite impressive. They give examples from medical research queries, UX design, shopping and general knowledge. The question is as simple as what's the average retirement age for NFL kickers, right? Which you wouldn't necessarily expect that number as such to appear anywhere on the internet. It does require a couple of independent searches and the aggregation of data and it's kind of processing a little bit. So that's pretty cool. And I'll just read out the answer to that question too, because it is pretty indicative of the level of detail you get here.

So it says, you know, determining the exact retirement age for NFL kickers is challenging, blah, blah, blah, blah, blah. However, kickers generally enjoy longer careers compared to other positions in the NFL. The average career length for kickers and punters is approximately 4.87 years, which is notably higher than the league wide average of 3.3 years. So it's doing some ancillary work there, right? That's a little bit off target to give you context. There's a lot going on here.

So anyway, one of the last, actually the last thing I'll just mention here, other than to say this is a really qualitatively impressive thing, quantitatively on humanity's last exam, the sort of like famous benchmark that Dan Hendricks has put out recently. I think it's part of the case, like Center for AI Safety maybe, or anyway, it's something that Dan Hendricks is working on. So this is looking at essentially a,

a very wide range of expert level questions. You know, we talked about this benchmark previously, but essentially it's meant to be really, really fucking hard, right? Think of this as like GPQA on steroids. In fact, what you do see with older models, you know, GPT-40 just scores 3% on it. Grok 2 scores 4%, Claude 3.5 Sonic scores 4% and so on.

01 scored 9.1%. So that had people going like, okay, you know, maybe we're getting lift off. OpenAI deep research 26.6%. I mean, it is genuinely getting hard to come up with benchmarks right now that these models don't just shatter right out of the box. The half-life of benchmarks in this space is getting shorter and shorter and shorter. And I think that's just a reflection of the, you know, how far along we are on the path to AGI and superintelligence. And so, you know, you expect this kind of hyperbolic progress to

It's just interesting to play it out on the level of these quantitative benchmarks. Right. So, yeah, a lot going on with this. Maybe once you peel back the layers, I think it's interesting that this is, I would say, the first demonstration of agents actually being an important thing. So the idea of agents, of course, is...

basically that, that you like tell it to do a thing and then it does it for you on its own sort of, you don't need to tell it to do every single step. And then it comes back to you with a solution. Right. And so with 01, et cetera, right, that was kind of thinking through it and arguably doing kind of a series of steps. But here it's,

It's browsing the web, it's looking at websites, and then in response to whatever it's seeing, it can then go off and do other searches and find other information. So it very much is kind of getting this agent to go and do its own thing and then eventually come back to you. And so, you know, in the past, we've seen a lot of kind of

examples of, hey, book a ticket for me to go to X and Y and D, which never seemed that promising. This actually is showing a very impressive demonstration of how agents could be a game changer. Unfortunately, you do have to be paying for the $200 a month tier of ChildGPT to be able to try it. Not many people, I guess, are going to be able to and

Just to compare between ChagGPT and Gemini, my impression is ChagGPT goes a bit deeper, does get you kind of a more well-researched and more thorough response. But I've also seen some co-workers of mine that have used it have also found that Gemini deep research is able to answer questions quite well, better than things like web search.

Yeah, it is pretty remarkable. It's also got these interesting or invites new kinds of scaling curves or invites us to think about new kinds of scaling curves. There is one I think that is worth calling out. So they look at max tool calls. So the maximum number of tool calls that they have the model perform as it's doing its research versus the pass rate on the tasks that it's working on. And

What's really interesting is you see a kind of S curve. So early on, if it does relatively few tool calls, performance is really bad. Its performance starts to improve quite quickly as you increase from there the maximum number of tool calls, so calls to different APIs. But then it starts to saturate and flatten out towards the end, right? So as you get around to like 80 or 100 max tool calls, no longer a very steep improvement. And this itself is kind of like

I mean, assuming that the problems are solvable, this is the curve to watch, or at least a curve to watch, right? The more the model's calling its tools, essentially the more inference time compute it's applying, the more times it's going through this thought process of like, okay, what's the next step? What tool do I now need to use to kind of get the next piece of information I need to move along in this problem solving process? And...

You know, you can imagine a key measurement of progress towards agentic AI systems would be how long can you keep that curve steepening? Can you keep that curve going up before it starts to plateau? That really is going to be directly correlated with the length of the tasks that you can have these problems go out and solve. Right. So they talk here about five to 30 minute tasks.

what's the next beat? How do you get these systems to go off and think for two hours, five hours? And I think at that point, you're already really getting into the territory where this is absolutely already accelerating AI research at OpenAI itself. I guarantee you these kinds of systems are now being used to accelerate and better versions of these systems with fewer caps on obviously inference time compute are being used to accelerate their internal research. It's something that we've absolutely heard and makes all the sense in the world economically today.

So this could yield compounding returns surprisingly fast. This is, I think, behind as well a lot of Sam Altman statements about, you know, I think fast takeoff is actually more likely than I once thought and so on and so forth. So, you know, this is quite interesting. It's got a lot of implications for national security. It's got a lot of implications for the ability to control these systems, which is dubious at best right now.

And so we'll see where these curves go, but I think it's definitely a kind of KPI that people need to be tracking very closely. It's got a lot of implications. Well, you've talked about OpenAI, you've talked about Google, let's go to some other companies, starting with Mistral, the... Maldi? Maldi.

The LLM developer out of Europe, you can say, and one that is trying to compete at least on the frontier model training front with models like Mistral Large. And as of a few months ago, they're also throwing their hat into the product kind of consumer arena with Le Chat. And then they have a separate thing for assistants.

Anyway, they have released a mobile app now on iOS and Android, also introduced a paid tier at $15 a month. So they're continuing to push into trying to get people to, I guess, adopt the chat as an alternative to ChatGPT. As we've kind of said in the past, I think we're discussing it. I'm not sure how...

easy it'll be for them to compete on this front. And it's interesting that they're trying to fight in this area that's very much like pricing and speed and so on that Anthropic and OpenAI have, let's say, some advantages in. Yeah, I'll kind of pre-register my standard and by now very tried prediction that Mistrad is going to be in as much trouble as Cohere. These are like mid-cap companies that are going to go ultimately, I think, to the

They may yet prove me wrong. And I may sound very stupid in retrospect. Yeah, I mean, look, it's these are mid cap companies that are competing with with the big boys, right? Companies that are raising like 10s of billions of dollars.

for AI infrastructure and that have brand recognition coming up their butts, right? And so the advantage here is if you have really, really good brand recognition, if you are just the default choice for more people, then you get to amortize the cost of inference across a much larger fleet of AI hardware. It lets you build at scale in a way that you simply can't if you're Mistral or any other company.

And that just means that competing with you on inference costs is a bad idea. And this is doubly true for Mistrial because so many of their products are open source, which means they're literally just pricing based on the convenience that you gain by not having to deploy the model yourself. So their margins come entirely from how much easier is it for them to just run the hardware rather than you? That turns them basically into an engineering company.

It makes it really difficult to develop the margins you need to reinvest. Now, with Mistral, they are kind of French national champions. So there's maybe some amount of subsidy that's up there for them to grab. It's not going to be that much. And France doesn't have the resources to throw at this that the capital base of the United States has. So I think ultimately this ends in tears. But in the meantime, some VC dollars is going to be lit on fire. And I think there are going to be some people who are really excited about the implications for kind of having another capital

competitor in the race. But I think this is again, it's kind of like cohere, you have startups that sound like great ideas and are easy to get funded because of the pedigree of the founders. But when you actually step back and look at kind of the fundamentals, the infrastructure race, that's really underpinning all this. Again, there's a reason that Sam Altman started to talk about like, hey, we're on the wrong side of history with open source. It's not because he thinks companies like Mistral have some sort of advantage. It's because he thinks he has an advantage with his giant fleet of inference time computing. So

Yeah, I mean, we'll see. I mean, I fully expect to sound like an idiot for any number of reasons, but I do like to make these predictions explicitly because at the very least, it keeps me honest if I end up being wrong. Similar development, a long line of these developments from Mistral, at least I think from where I'm standing. Right. We should acknowledge they do have one differentiator they are highlighting, which is the ability to get

really fast inference. So they're claiming they can get 1,100 tokens per second, which is roughly 10 times faster than Sonnet 3.5, 4.0, R1, etc. Apparently, they partnered with Cerebrus for some cutting edge hardware specifically for Vista above inference that's very, very fast, 1,000. This says words per second. I'm not clear if it's tokens or words, but regardless,

That could be, if you need that, could be one reason you might adopt Mistral. Yeah, it is unclear to me how lasting of an advantage that is, given that if you look over at OpenAI, for example, literally has partnerships going to build out custom ASICs for their systems, right? So like, you know, expect any advantages like these to wear out pretty quickly, but still,

It'll be cool to have at the very least lightning fast inference. It's not the first time we've seen numbers like this as well, right? The Grok chips running similar models have shown similar results. There's always controversy around how much that actually means given, anyway, details like the number of queries you can get simultaneously given the way these chips are set up and so on. There's always asterisks. You never get

a giant leap like this quite for free. It'll be interesting and hopefully for them, they prove me wrong and I'm eating my words in a few weeks. Moving away from LLMs, couple of stories on other types of AI, starting with AI music generation and the startup Riff Fusion, which is now

Entering that realm, they have launched their service in public beta and similar to UDO and Suno, they allow you to generate full-length songs from text prompt as well as audio and visual prompts.

So interesting to see another competitor in this space because Suno and UDO, to my knowledge, have been the two players pretty much and have made it so you can get very close to human sort of indistinguishable user generation, at least if you kind of use it a bit. So Refusion offering something like

Along those lines, they do say that they are collaborating with human artists through this trusted artist agreement, which is allowing access to artists. So another entry in the AI2 song, I guess, space.

Yeah. So it seems like this trusted artist agreement is kind of one of the most interesting parts of this, right? I mean, what precedent are we setting for, for the exchange of value here, right? When you're setting this up, a big challenge too, is the vast majority of highly talented artists just aren't discovered. And so they make next to no money, which means they're very easy to pick off. If you're an AI company looking for people to, you know, for even a

modest amount of money help you to support the development of your system. So you don't necessarily have to have a deal with Taylor Swift to train your model to produce really, really good quality music, I guess. So the deal here apparently is the Trusted Artist Agreement gives

artists early access to new features and products in return for their feedback. So unclear to me, like how much value there is there. It's great to have more tools in your arsenal, but it's certainly not from a tragedy, the common standpoint, basically, you're helping to automate, automate away this whole space. So it's kind of an interesting trade off.

Right. And I will say, you know, we won't get into a whole new story about it, but I have seen some coverage of Spotify apparently starting to fill out some of their popular playlists like lo-fi, chill, hip hop, etc. with seemingly some generated music not coming from human artists and

And as a result, human artists are starting to lose some money. So it seems like AI music generation hasn't had a huge impact on the industry yet, but it also seems like it is definitely coming, given that you can now do some very high-quality generations. And last story, going to video generations and Pika Labs, they have now introduced a new fun feature called Pika.

Pika edition, I guess that's how you call it Pika additions. So this is coming within the Pika turbo model. And it allows you to insert objects from images into videos. So they had some fun examples of it where you might have a video of you just doing some normal stuff. And then you can insert some animal or some other actor or whatever.

insert a mascot and it looks pretty you know yeah at least in some cases realistic and makes it much easier to alter your video in some interesting ways

Yeah, this is the kind of thing I could imagine being quite useful for, you know, like making commercials and stuff like that. Because some of the demos that they have are really impressive. It's obviously demos are demos and all that, but certainly seeing things that approach anything we've seen with Sora. So yeah, really cool. And Pico Labs seems to keep pumping out relevant stuff despite all the competition in the space. So kind of cool.

Yeah, I think one of the big questions with video generation has always been how do you actually make it useful in practice, right? Sure, you can get a short clip from some text description, but is that something people need? And Pika is introducing more and more of these different ways to use video generation that aren't just generate a clip.

And examples like this where you are taking a clip you have and then essentially doing some VFX additions to it as you would with CGI previously. Personally, I think that is a much more promising approach

way to commercialize video generation. Basically, cheaper, easier VFX and CGI. Well, not, I guess it's computer generated. Sorry, you could call it CGI. But yeah, the clips are fun, so you should check them out.

And moving on to applications and business, we begin yet again with OpenAI and OpenAI fundraising news. And it's about how SoftBank wants to give OpenAI even more money. So SoftBank is saying they will, or at least seeming like they're planning to maybe invest more.

$40 billion in OpenAI over the next couple of years. That's at a very high valuation of $260 billion pre-money. This is also in the midst of an agreement to bring OpenAI tech to Japan, where Southback came out of. So...

Yes, I guess Saab Bank is now the biggest backer of OpenAI. They're, of course, one of the players in the Stargate venture, and they seem to really be banking on OpenAI to continue being a top player.

Yeah, for sure. And I think, you know, the yeah, the article says they're the the biggest backer. It's a little unclear. They must mean in dollar terms and not equity terms, because Microsoft famously owns around 49 percent of of open AI or had.

So there's no way that the fundraisers that they've put in so far, the dollars they put in so far add up to more than that equity. But on a dollar denominated basis, certainly that has to be the case, right? If they're putting in 40 billion here. So yeah, really interesting. SoftBank, Masayoshi Son in particular seems to be a really big Sam Altman fan, which is also interesting because he's not particularly dialed in on the

even the most basic kind of like AI control issues. There's a

a panel he was on with Sam Altman where he was like, he turns to Sam at one point and he's like, yeah. So like, obviously this like a concern over losing control of these systems makes no sense because we're made of protein. And like, why would an AI system ever want to like, you know, they're, they're not made of protein. They don't eat protein. So, and Sam Altman's like forced to sheepishly look back at him. And obviously he doesn't want to, he doesn't want to contradict him in the moment because he's raising $40 billion from this guy.

But he also knows that the entire world technical field at least knows that he knows that the real answer is a lot more nuanced than that. But that was very kind of clarifying and illustrative, I think, for a lot of people that there's some just embarrassingly fundamental things that SoftBank is missing out here. It's also worth flagging SoftBank, not just a kind of Japanese fund. So there's a lot of sovereign wealth money there, in particular Saudi money.

that makes up, it's hard to know, like, I don't know, rather off the top, but it could easily make up the lion's share, as we've talked about, of the funds that they put in. So it's a very weird fund and interesting choice for Sam Altman to partner so deeply with somebody who misses a lot of the technical fundamentals behind the technology. Not hard to get the big scaling story, of course, and backing OpenAI is an obviously good move if you believe in that scaling story, and I certainly do.

But it's an interesting choice, especially on the back of the partnerships with Satya and Microsoft, which are very, very kind of knowledgeable, technically knowledgeable and technically capable investors. So that's very interesting. It's possibly you could view it as Sam starting to build optionality. This moves him away right from Microsoft, from dependency on them.

and gives him two people now to sort of play off each other between Satya and Masayoshi-san, and so a little bit more leverage for him. We do know that a trench of this $40 billion investment is being filed under the Stargate investment. And so in a sense, it's kind of like there's $15 billion to Stargate and then maybe the balance over to OpenAI itself. A little unclear how that shakes out. But this one last thing to flag here too is

At this point, opening eye is raising funds from sovereign wealth funds, right? That's what this is. And, you know, for the reasons we talked about, that is the last stage. There's no more like

giant pot of money waiting for you. If you're trying to fundraise and remain a privately held company, the sovereign wealth funds are the last stop on the giant kind of money train when you're raising the tens of billions of dollars. And so what this tells us is that either opening, I expect to go public and be able to raise somehow even more money or we're

they expect that they're going to start to generate a positive ROI from their research or hit super intelligence fairly soon. Like this is consistent, very consistent with a short timelines view of the world because they're, again, there's, there's nothing more to be done. This, this is it, right? If you, if you're raising from sovereign wealth funds, maybe they can go directly to, you know, Saudi Arabia, the UAE or something else, but ultimately you're kind of tapped out and

And this tells us, I think, a lot about AI timelines in a way that I'm not sure that the kind of ecosystem is fully processed. So interesting for a lot of different reasons. And definitely, again, gives Sam another arrow in his quiver in terms of the relationship with Satya and how to manage that.

Just a few more details on this business story, I guess, of South Bank. There are some additional aspects to it. They are saying they'll develop something called Crystal Intelligence together, South Bank and OpenAI partnering for it. Kind of ambiguous what it actually is, the general concept.

Pitch is it's customized AI for enterprise companies. And that's kind of all there is in the announcement for it. There's also another aspect to this, which is SB OpenAI Japan, which will be half owned by OpenAI and half owned by SoftBank. SoftBank also going to be paying $3 billion to

annually to deploy OpenAI solutions across its companies, and that's separate from investment. So a lot of initiatives going on here. SoftBank, I guess, really banking OpenAI and them working together on various ventures.

Next up, yet again, talking about data centers, this time in France, with the UAE planning to invest billions of euros, actually tens of billions of euros, on building a massive AI data center in France. There's an agreement that was signed between France's Minister for Europe and Foreign Affairs,

and the CEO of Mubadala Investment Company. So yeah, it's already planned for and I think we haven't covered too many stories along the front. We've seen a lot of movements across the US of companies trying to make these massive data center moves. And now it seems to be starting to happen in Europe as well.

Yeah, this is interesting, right? And it is a relevant scale. So we're talking here about one gigawatt or up to one gigawatt of capacity. They anticipate the spend to be about 30 to 50 billion euros, which tracks just for context, though, like one gigawatt is.

You know, like the big Amazon data center that was announced a few months ago is like 960 megawatts. So basically already at that scale, meta dipping into one and two gigawatt scale sites. You have likewise, you know, open AI with plans for multiple gigawatt level sites. So and that's all for like the 2027 era.

So this is going to be something, but you have individual hyperscalers in just the United States tapping into multiple of that scale, like low multiples, but one to two gigawatts, say. So it's important. It's the bare minimum, I would say, of what France would need to remain competitive or relevant here. But not clear what that really buys you in the long run, unless you can continue to attract 10x that scale of hyperscalers.

down the line for the next beat of scale, at least five or two X or something, but that's it. And one thing to flag too is this is actually coming in from, because it's the UAE,

You better believe Sheikh Mohammed bin Zayed Al Nahyan is going to be part of this. So he is the head of G42. We've talked about them an awful lot. They're also kind of basically MGX's G42 in a trench coat. MGX is the fund that invested in Stargate. So this is really this guy, sort of like a big national security figure in the UAE. I think the national security advisor to their leader. That's

That's certainly his role. And these investments are happening all over the West, not just the United States now. And these are big dollar figures. So kind of interesting, again, remains to be seen how relevant this will actually be. But we are being told that it is the largest cluster in Europe dedicated to AI. So that's something. But it is, again, it's Europe. Europe struggles with kind of like R&D and CapEx spend. And so it's interesting to see them keep up at least at the one gigawatt scale.

And speaking of big money, the next story is about another fundraise from, let's say, OpenAI, formerly affiliated person, the chief scientist, former chief scientist, Ilya Sutskover. We saw him leave and leave.

start safe superintelligence last year they had raised one billion dollars at the time with kind of nothing out there such products goes and now they are in talks to raise more money at four times evaluation that's all we know so no stories as to i guess any

any work going on at SSI, but somehow it seems they are set to get more funding. Yeah, there is more of a reminder here, but they have some wicked good investors, right? Sequoia and Resyn, like pretty much top of the line.

And then Daniel Gross, anyway, famous for doing AI stuff at Apple and then being a partner at Y Combinator. Like a lot of secrecy around this one. I got to say, it sort of reminds me of Miramarati's startup. And like, you know, there are a couple of these where we have no clue really what the direction is.

So I'm curious to see. A straight shot to superintelligence with no products is an interesting challenge to sell, but it's at least plausible given this state of the scaling curves. Next up, covering some hardware, we have ASML set to ship first, second gen high NA EUV machines in the coming months. And Jeremy, I'll just let you take over on that one.

Oh, yeah. I mean, so we talked about this in the hardware episode, but you've got so at first there was the DUV, deep ultraviolet lithography machines. These are the machines that generate and kind of shine and collimate the laser beams that ultimately get shot onto wafers during the semiconductor fabrication process.

These beams etch, or don't etch themselves, but these beams essentially shine and lock in the pattern of the chip onto that substrate. And they're super, super expensive. Deep UV lithography machines are the machines that China currently can access. It allows you to get down to around seven nanometers, maybe five nanometers of effective resolution, let's say. EUV machines, the next generation after that. The next generation after EUV, though, is high numerical aperture machines.

So basically, these are EUV machines with bigger lenses. And you might think that that doesn't do much like, oh, big deal, there's a bigger lens. But actually, when you increase the size of a lens in one of these machines, they are so crazy optimized in terms of space allocation and orientation geometry that you fuck a ton of things up. And so anyway, these are super expensive machines.

Intel famously was the first company to buy essentially the entire stock that ASML planned to produce of high NA EUV machines. And it remains to be seen what they'll do with that now that they're sort of fumbling the ball a little bit on their fab work. But yeah, so apparently Intel will actually be receiving this first high NA EUV machine in the coming months.

and actually to be used in mass production only by TSMC in 2028. So there's this big lag between Intel and TSMC

This is quite defensible. Historically, we've seen companies like Samsung fall behind by generations in this tech because they move too fast to adopt the next generation of photolithography. And so this is going to go right into Intel's 14A node, their 1.4 Angstrom node, or if you're thinking in terms of TSMC terms, I guess that would be 1.4 nanometers effectively. That's going to go right into Intel's 14A node.

That's the next beat for them. But anyway, so there's a lot of interesting technical detail. But bottom line is these things are now shipping. We'll start to see some early indications of whether they're working for Intel in the coming year or two. And final story, and it is

About a projection from Morgan Stanley, it was revising its projection of NVIDIA GB200 and VL72 shipments going downward due to Microsoft seemingly planning to focus on efficiency and performance.

lessening of capex but it appears that microsoft google meta tesla etc are still investing a lot in hardware and so you despite uh kind of a projections nvidia is still going strong yeah i mean pretty much just refer to a couple episodes back when

I don't know when DeepSeek, I guess, what was it? Not DeepSeek. Yeah, I guess it was R1 when people really started talking about this. I think we touched on it with V3. But when R1 came out and everyone was like, oh shit, like NVIDIA stock's gonna do really badly because now we found ways to reach the same level of intelligence using a 30th of the compute. And...

at the time, and repeatedly, we have said over and over, this is not right. This is exactly backwards, right? If what really this shows is, okay, well, suddenly, NVIDIA GPUs can pump out 30 times more effective compute, at least at inference time than we thought they could initially. That sounds not like a bearish case for NVIDIA. That sounds like a freaking bullish case for NVIDIA. And

I don't mean to throw shade on Morgan Stanley, full respect for the Morgan Stanley folks, but this was, I think, a pretty obvious call.

And we're seeing it play out now. So never bet against Jevons paradox in this space. A good model in your head to have is that there is a close to infinite demand market demand for intelligence. And so you know, if you make a system that is more efficient at delivering intelligence demand for that system will tend to go up. That's at least the way the frontier labs that are racing to super intelligence are thinking. And right now, they're the ones buying up all the

the GPU. So that's the way I've been thinking about it. And feel free to throw shade and cast doubts if you disagree.

And on to projects and open source, we begin with AI2 releasing Tulu. I don't know how to pronounce this. Tulu 3.405b. So this is a post-trained version of Lama 3.1 with a lot of enhancements for scalability and performance. And to saying that this is on par with DeepSeek v3 and GPT-4.0,

And as usual with AI2, releasing it very openly and releasing a lot of details about it. So again, another demonstration that it seems more and more of a case that open source is going to be essentially on par or close to on par with frontier models, which hadn't been the case just until a few months ago, really, until now.

These 405B gigantic models started to come out.

One of the interesting kind of breakthroughs here is their reinforcement learning with verifiable reward structure, the RLVR structure, which is a new technique. It focuses on where you have verifiable outputs, where you can assess objectively whether they're correct or not. They kind of have a little feedback loop within their training loop that factors in the kind of base reality. And the other thing is to scale that technique specifically to

They say that they deployed the model using 16-way tensor parallelism. So like essentially chopping up the model, not just chopping it up at the level of like, say, transformer blocks or layers, but even chopping up individual layers and shipping those off to different GPUs. So there's a lot of scaling optimization going on here.

Yeah, it's obviously a very, very highly engineered model. And so yeah, pretty cool and a good one to dig into as well on the hardware side and kind of engineering optimization side.

Exactly. The report or paper I put out is titled Tulu Free Pushing Frontiers in Open Language Model. Post-training, the initial version of our sheet came out a couple of months ago and was just revised. And it goes into a lot of details as to post-training, 50 pages worth of documentation. So in addition to having quite performant LLL models, we have more and more of these very, very detailed models

reports on how a training happens, like black magic, how the sausage gets made, is now very clear. So it's more and more cases that there aren't really any secrets, there's no secret sauce,

Used to be maybe the one that was a bit of secret sauce on how to get reasoning to work, but that is increasingly not the case also, as we'll get into. So open source, yeah, don't bet against it as far as developing good models. Yeah, I do think the interesting, like there remains secret sauce. Well, there's all the secret sauce at algorithmic efficiency level and other things, but I think this just like raises the floor, yeah, on where that secret sauce is.

Or, yeah, a lot of that secret sauce is no longer secret. I don't know. The metaphor, man. A lot of secret sauce is less. And next up, we have Small LM2. When small goes big, that is a paper, so...

Yeah, it's about an acceleration of small lem, a small language model, meaning three billion parameters, small, large language models. And they are...

focusing on training with highly curated data. So basically with the highly curated data, they're working with FIMAF, StackDU, Smalltalk, these high quality kind of data sets that leads to their ability to get

even better performance at the scale and yet another kind of notch in the ability to get small models to be pre-performant if you invest a lot in kind of optimization of that size. Yeah, it's kind of ambiguously argued in the article itself that there might be implications for sort of like scaling laws here. They don't directly compare the scaling they see here to the kind of Hoffman model

They're chinchilla scaling law paper. But you can expect like if yeah, if your data quality is better, you ought to see, you know, faster scaling. And if you're more thoughtful about as well, the order in which you layer in your training data. So one of the things they do here, and you're seeing this basically become the norm already, but just to call it out explicitly, is that they're not doing it the way that they're supposed to.

So rather than using a fixed data set mix, they have a training program that kind of dynamically adjusts the composition of their training set over time. And so in the early stage, you see this sort of lower quality, less refined data, general knowledge, web text stuff that they train the model on. And you can think of that as being the stuff that you're using to just build general knowledge to get the model to learn even earlier, like the basic rules of grammar and, you know,

you know, bigrams and trigrams and so on. So just the basic things, why waste your high quality data on that stage of training, which is pretty intuitive. So just get it to, you know, train on Wikipedia or basic web text. And then this middle stage where they're, they introduce code and math data that's added to the mix. And then late stage, they really focus in on like the kind of refined high quality math and code data. And so you can see the kind of the, the instrument getting sharper and sharper, the quality of the data going up and

up and up as you progress through that training pipeline. They claim that it was 10 to the 23 flops of total compute, that's about a quarter million dollars worth of compute. And that's pretty cheap for the level of performance they're getting out of the model. So

pretty cool. By the way, they do say they train it on 11 trillion tokens. And so that's way, way higher than what Chinchilla would say is compute optimal for a model that is this small. So, you know, it's over-trained, nothing too shocking there, but again, they're, they're focusing on the, the data kind of quality layering side of things and other optimizations to, to get more bang for their buck here. So,

Maybe an argument that says that smaller language models are more data hungry than previously thought and that overtraining can be disproportionately beneficial. But it's not clear if that's actually true when you account for the quality of the data that you're putting in. I mean, there are always asterisks with scaling laws. Data is not just data. So you can't just like, you know, plot data along a curve and say, okay, you know, that's, you know, more data is more scale.

You do have to care about the quality. But this is a great example of that. And it's a powerful model you're getting out the other end. Exactly. And by the way, this is coming from Hugging Face, sort of GitHub for models. So open source, Apache 2.0, you can use it for whatever you want. And a pretty detailed paper also releasing these datasets that they are primarily highlighting as the contribution. So, you know, open source, that's always exciting.

And a couple more stories. Next is not a model, but a new benchmark titled a PhD knowledge not required a reasoning challenge for LLMs. That's the paper. So they are arguing that

Some of the benchmarks for reasoning require very specialized knowledge that some people may not have, and as a result, it's not just looking at knowledge. So this benchmark is organized actually from 600 puzzles from the NPR Sunday Puzzle Challenge, which is meant to be understandable with general knowledge, but still difficult to solve.

And they are saying that O1, for instance, achieved a 59% success rate, better than R1, but still not, I guess, as optimal as you might get with top humans. So one of the interesting findings here too is that this relative performance of O3 Mini and the O1 model and R1 too, they all kind of perform about the same or at least similarly. And this is sort of

taken as a potential argument that current LLM benchmarks might overestimate some models' general reasoning abilities just because maybe O1 has reasoning-specific optimizations that O3 Mini lacks, but more likely than not, it's just sort of a general knowledge thing. So kind of interesting. It is interesting to parse out what is general knowledge and what is pure reasoning. To the extent you can do it,

It allows you to hill climb maybe on a metric that's more detached from textbook knowledge, which consumes an awful lot of flops during training. So if you could actually separate those out, kind of an interesting way to maybe get a lot more efficient. Some of the findings, by the way, so they probed this question of like how much reasoning is actually enough, like how reasoning length or number of tokens generated while thinking, if you will, affects accuracy.

Deep Seek R1 performs better after about 3,000 reasoning tokens. But when you go beyond 10,000 tokens, you kind of plateau, right? So we're seeing this a lot with these models where, if you will, the inference time scaling works well up to a point, and then you do get saturation. It's almost as if the base model can only effectively use so much context.

And the failure mode with DeepSeq R1 is especially interesting. It will explicitly output, I give up in a large fraction of cases. It's about like a quarter to a third of cases. And anyway, this idea of kind of prematurely conceding the problem is something that previous benchmarks just hadn't exposed. So sort of interesting to see that quantified and demonstrated in that way.

And then it's also the case that R1 and some of the other models too, but R1 especially will sometimes just generate an incorrect final answer that never appeared anywhere in its reasoning process. It'll go A then B, C then D, D then E, and then it'll suddenly go F, you know, and it'll give you something that's like completely untethered. So kind of interesting.

An indication that there is a bit of a fundamental challenge going on, at least at the level of R1, where if you can suddenly get a response that is untethered to the reasoning stream, like that's a problem for the robustness of these systems. So exciting to see a paper that dives into, if you will, pure reasoning.

Which is something that we definitely haven't seen before. Every measurement of reasoning is always to some degree entangled with a measurement of world knowledge, right? Even when you're asking a model to just like solve a puzzle or something, somehow that model has to understand the query, which requires it to know something about the real world and language and all that stuff. So it is really tough to tease these things apart and kind of interesting to see this probed at directly.

Right. And just to give a taste of the things in the benchmark, here's one question.

Think of a common greeting in a country that is not the U.S. You can rearrange its letters to get the capital of a country that neighbors the country where the greeting is commonly spoken. What greeting is it? The answer there is Ni Hao Hanoi, Ni Hao being in China and Hanoi being in Vietnam. So a lot of those kinds of questions, you do need some knowledge about the world, but nothing too specialized.

And you can need to kind of think through a lot of possible answers and see what matches the criteria specified. Last story, Open Euro LLM, which is an initiative across 20 European research institutes, companies and centers with the plan to develop an open source multilingual LLM.

They have an initial budget of 56 million euro, which I guess is not too bad to start training, at least given DeepSeq v3, right? It may be doable, although DeepSeq v3 is v3, so they were working on it for a while. They hope also they are claiming to go with compliant open source models given the EU AI Act.

They are beginning their work starting February 1st. So we will have to see where it goes. Yeah, I think that $56 million budget is going to be a real challenge. But yeah, I mean, like, you know, that'll be good for what? For like a couple thousand GPUs, maybe? But that's just the GPU. Like that doesn't even like...

address the data center, but like it's a nothing budget. This is a nothing burger. It's going to have to pay as well for like the researchers and shit. Like again, I'm sorry. I know, I know it's not always nice to say, but Europe has a problem funding CapEx. Like this is just an issue. The one thing that makes this, I guess if you're into European regulation in the way that they do it, the one thing that might make this interesting is that it is explicitly trying to hit like all the kind of key issues

European regulatory requirements. So at least you know that that box is ticked if you use these models. But I mean, expect them to kind of suck. That's all I'm going to say. They may attract more investment as they hit more proof points. Hopefully that happens. But I think it's not necessarily the best idea to be to be blunt about my perspective. I think it doesn't track the scaling laws. It doesn't particularly like, yeah, account for the real costs of building this tech.

And on to research and advancements, we begin with a paper on reasoning titled Limo, Less is More for Reasoning.

So last year, we saw a similar paper at Lima, this is more for alignment, with the highlight being that if you curate the examples very carefully, you can align a large language model with, let's say, hundreds of examples, as opposed to a massive number.

Here they're saying that you can curate 817 training samples and just to that are able to get very high performance comparable to DeepSeq R1 if you then tune a previous model. Here they're using QM 2.5.

32b instruct. They get that, but you do need to very, very carefully curate the dataset. And that's kind of what Vero goes into. They highlight the need for something to be challenging, to have detailed reasoning traces. So that's kind of the reasoning aspects of this. You need specifically the kinds of outputs you would get from R1 to challenging queries. So

One demonstration, and I think increasingly kind of the realization is that LLMs may kind of already be mostly capable of reasoning and we just need something like RL or very carefully tuned supervised learning to get that reasoning done.

more out of them. Yeah, it's an interesting question as to why this also seems to be happening all of a sudden, whether it's, you know, have models have base models always been this good. And now we're just realizing that, or is there something fundamentally that's changed, in particular, the availability of the right kind of data, you know, these reasoning traces online? Yeah.

that makes pre-trained models just better at reasoning generally than they used to be. I think that's a pretty plausible hypothesis because you have to imagine that the very first thing OpenAI would have tried back in the day after making their base models would have been straight reinforcement learning, right? They probably would have tried that before trying fancier things like process reward models and stuff like that that kind of got us stuck in a bit of a very temporary rut, but a rut nonetheless for kind of a year or two.

So, you know, wouldn't be surprising if this is just a phase transition as the corpuses of data available for this sort of thing has shifted. It is noteworthy. I mean, 817 training samples giving you like almost 95% on the math benchmark and 57% on Amy. Like that's really, really impressive.

I think one of the things that this does is it just it shows you that this whole idea of just pre-training plus reinforcement learning with very minimal supervised fine tuning in between is probably the direction things end up going.

That's consistent, of course, with R10, right? That paper that we've talked about quite a bit. But another key thing here is that when you look at the consequence of doing fine tuning on a tiny data set like this, one consequence is you don't run the same risk of overfitting

to the fine-tuning data set, or at least overfitting to the kinds of solutions that are offered in those reasoning traces. And so what they actually find is that this model performs better out of distribution. They find that compared to traditional models that are trained on like 100 times more data, this actually does better on new kinds of problems that require a bit of more novel general purpose reasoning.

And so the idea, the intuition here may be that by doing supervised fine tuning on these relatively large data sets, where we're trying to force the model to learn how to reason in a certain way, right, with a certain chain of thought,

structure or whatever, what we're kind of doing at a certain point is just causing it to memorize that approach rather than thinking more generally. And so by reducing the size of that data set, we're not causing the model to pour over and over and over in the same way, that reasoning structure, and we're allowing it to use more of its natural general understanding of the world to be more creative out of distribution. So that to me was an interesting little update. Yeah, hard to know how

how far all this goes, but it's just kind of an early sign that this may be interesting. I will say that many of their out-of-distribution tests are still within the broad domain of mathematical reasoning. So there's always this question when you're making claims about out-of-distribution results, like how out-of-distribution is it really?

You train the model in geometry, you're applying it in calculus. Does that count? Is it still in the broader math umbrella and all that stuff? But anyway, interesting paper and something we might find ourselves turning back to in the future if it checks out. Yeah, and I think that's a good caveat in general with a lot of research on reasoning.

Like R1, as another example, you know, there's a lot of training on math and on coding because that's the stuff where we have ground truth labels and you can do reinforcement learning there too. I'm not so sure about how much that translates to other forms of reasoning. So you do want to also see things like ARC and this new data set we just talked about.

But regardless, as you said, a lot of insights being unlocked. And in fact, the next paper, very similar to the previous one, titled S1 Simple Test Time Scaling, an entirely different approach. They are introducing this concept of having

an inference time budget. So you're only allowed to use some amount of tokens and you basically get cut off or you get told to keep thinking if you have more budget to spend. And with that, they are able to curate 1000 samples. So slightly more than the previous paper, but around the same. Also fine tuning the

model and also achieving pretty good performance with inference or test time scaling. So yeah, very much in line with LIMO values.

The previous one was less is more. Here, it's simple test time scaling. Yeah, this is really one that I think Rich Sutton would be saying, see, I told you so about. I mean, it's an embarrassingly simple idea that just works, right? So when you're thinking about the bitter lesson of just like apply more compute, this is maybe the dumbest way that I could think of. Not that I did, but it's the dumbest way you could think of applying more compute to a problem. You have the model, just try to solve the problem. It

If it solves it right out the gate in 30 tokens, then you say, hey, I'm going to add a token to my

to my my string here i'm just going to write the word wait right and you see you put in the word wait so maybe you ask it a question like how many airplanes are in the sky right now it starts to work out through reasonable assumptions like what the answer is and then it gives you the answer if it answers too quickly you just append to its answer the word wait and then that triggers it to go wait okay like i'm going to try another strategy and then it continues and

And you just keep doing this until you hit whatever your kind of token budget is. The flip side is if your model is going on for too long, then you just insert the token's final answer colon to kind of force it to come out with its solution. That's kind of in practice how they do this.

And dead simple works really well. And it is the first time that we are that I've seen that we're getting an actual kind of inference time scaling plot that looks like what we had with opening eyes. Oh, one, right? Even deep seek when they put out their paper, what they showed was we can match the performance of Oh, one.

That's great and really impressive. But if you look at the paper carefully, you never actually saw this kind of scaling curve. You never actually saw the actual sort of like flops or tokens during inference versus the accuracy or performance on some task.

What you saw was, anyway, different curves for like reinforcement learning performance during training or whatever, but you didn't see this at test time. And that's what we're actually recovering here. So this is the first time I've seen something that sort of credibly replicates a lot of the curves that OpenAI put out with their inference time scaling laws, which is really interesting. I mean, I'm not saying OpenAI is literally doing this, but it seems to be a legitimate option if you want to, you know, find a way to get your system to kind of pump in more inference time compute and pump out more performance.

And let's keep going with the general trend of this theme of scaling. Here, the next one is titled Zebra Logic on the Scaling Limits of LLMs for Logical Reasoning. So I think you previewed this a little bit previously in the episode. The idea of this paper is...

What if we can have a benchmark that sort of separates the reasoning from the knowledge? And the way we do that is the setting up these grids and essentially setting a number of constraints and requiring the alum to be able to infer the locations of different things in a grid. So here's an example. There are three houses numbered one, two, three from left to right.

Each house is occupied by a different person. Each house has a unique value for each of the following attributes. Each person has a unique name, Eric, Peter, Arnold. And then there's a bunch of clues. Arnold is not in the first house. The person who likes milk is Eric. Blah, blah, blah, blah, blah, blah. So we set up

a set of constraints and then you have to figure out what goes where. That allows them to build different sizes of puzzles. So different amounts of variables, different amount of clues,

And so per the title, the idea then is you can get bigger and bigger puzzles and eventually kind of hit a wall where you can no longer be able to answer these kinds of questions due to things like recursive complexity, you know, as eventually the dimensionality of a problem is such that throwing more computer isn't going to solve it.

So again, another benchmark and another way to evaluate reasoning. And on this one, they do find something along the lines of other benchmarks with R1 performing quite well. DeepSeq R1 also performing really well. But in general, all these reasoning models performing much better than typical LLMs. Yeah, I think one thing that has broken a lot of people's cognition when it comes to scaling is that

So you actually should expect inference time scaling to saturate, to kind of level off, just like this for any fixed base model, right? And the reason just is that the context window can only be so big or it can only effectively use so much context. And so what you want to be doing is scaling up

the base model at the same time as you're scaling up the inference time compute budget. It's kind of like saying if you have 30 hours total to either study for or write an exam, it's up to you how you want to, you know, trade off that time. You could spend, you know, 25 hours studying for the exam and five hours writing it. You could spend, you know, 29 hours and 30 minutes studying for the exam and then just 30 minutes writing the exam. Right.

But there's kind of an optimal balance between those two. And you'll find like often you're more bottlenecked by the time writing the exam than by the time studying or vice versa. And that's exactly what we're seeing here. And so Epic AI has a great breakdown of this. I think it's a sort of under-recognized fact of the matter about scaling that Epic

You really do want to increase these two things at the same time. And no discussion of scaling is complete when you just fix a base model and look at the inference time scaling laws and then whine at them about how they're flattening out. We actually had this was the very trap media fell into and a lot of technical analysts fell into when they were talking about, oh, we're.

Pre-training, we're saturating pre-training, the ROI isn't there anymore. It's like, no, you actually have these two different modalities of thinking. It's like spending, if you spent 30 hours studying for an exam and five minutes writing it, yeah, your performance at a certain point will plateau. Another 10 hours studying isn't going to help. So anyway, that's really what you're seeing reflected here.

it's, you know, reflected as well, whether it's best event sampling or other approaches that you use, you will see that saturation if you use a fixed base model. Just something to keep in mind as you look at scaling loss.

Exactly. At the end of the day, this kind of makes sense that you will saturate and essentially also as you grow dimensionality of a model eventually for any size DLLM, any amount of inference tokens, you're still going to be incapable of doing well. The paper does go into some analysis of the best way to do it and so on. So some

Cool insights here. Also actually from Weiland Institute for AI, we covered earlier. So yeah, lots of research, lots of insights on reasoning, which is, of course, pretty exciting.

And now to the last paper, this one, not on reasoning, but instead on distributed training, which I think Jeremy, you'll be able to comment on much more. The title of it is streaming Dialogo with overlapping communication towards a distributed free lunch. And so that's a basic story is they are trying to get to a point where you can train in a more distributed fashion without

incurring kind of the cost or worse performance as opposed to more localized training. And I'll stop there and let you take over. Yeah, I almost wish that we'd cover this in the hardware episode, but it's so hard to figure out where to draw the line between hardware and software.

There's maybe something we could do with like optimized training or, you know, anyway, it doesn't matter. So yeah, DeLoco is this increasingly popular way of training in a decentralized way. The big problem right now is if you do federated learning, you'll have imagine like one, let's call it like a, I don't know, to grossly like,

oversimplify, let's have like a data center that's chewing on one part of your data set and another data center that's chewing on another part of a data set and a third and so on. Every so many steps, those data centers, they're going to pull together their gradient updates and then kind of update one global version of the model that they're training, right? So you have, you know, data center one has a bunch of gradients that it's learned based on training. Essentially, these are the changes to model parameters that are required in order to

cause the model to learn the lessons it should have learned from the data that that data center was training it on. And so those gradient updates, which are known as pseudo gradients, pseudo because each data center is only working on a subset of the data, you're going to pool together, average together or otherwise pull together those pseudo gradients to update in one step the global model, which then gets redistributed back to all those, say, data centers. Now, the problem with this is it requires a giant operation

burst of communication. Like every data center needs to fire away at once this big wave of gradient updates and associated kind of metaparameters. And that just clogs up your bandwidth. And so the question is, can we find a way to manage this update in a way that you're maybe only sharing a small fraction of the gradient updates at a time? And that's what they're going to do. They're going to say, okay, let's take our model. Let's

essentially break it up into chunks of parameters. And let's only update together one sort of chunk, one fraction of the model's parameters, one fraction of the gradient updates that we want of the pseudo-gradients

and kind of pool them together and then redistribute so that those bursts of information sharing don't involve information pertaining to the entire model at once, but only subcomponents. And anyway, there's more detail in terms of like how specifically Deloco works. We covered that in a previous episode. So moving on to policy and safety, the first story is regarding AI safety within the US seemingly kind of not going so good.

So with Trump having taken over office, we are getting the news that several government agencies tasked with enforcing AI regulation have been instructed to halt their work. And the director of the U.S. AI Safety Institute, AISI, has resigned. This is, of course, following Trump's

the repeal or whatever you want to call it of the indicative order on AI from the Biden administration we had previously commented on. It's actually not too surprising. And I think that the accurate frame on this is not that they don't care about safety in the sense that we would kind of traditionally recognize it, national security, kind of public safety, but

Part of the challenge has been that in the original Biden executive order, there was basically so much stuff kind of shoved into that consumer protection stuff, privacy stuff. There was

Social justice stuff, AI ethics stuff, bias stuff. It was literally the longest, it may still be the longest executive order in the history of the United States. And it really was that they were trying to give something to everyone. Their coalition was so broad and so incoherent that they had to stuff all this stuff in together. That's something that at the time...

I'd mentioned was probably going to be a problem. Here we are, you know, we have a new administration and no surprise, like that's not the way that they're going with this. I think on national security grounds, yeah, you're going to see, you know, some thoughtful work on what would otherwise be called safety. The problem is the word safety is taken on a dual meaning, right? That's almost politicized. So in terms of the actual kind of loss of control of these systems, in terms of the actual weaponization of these systems, that's something that from a concrete level, I think this administration is going to be concerned about.

But there's all this kind of fog of wars. Frankly, they're still trying to figure out what is the problem with this tech and how are we going to try to approach it, especially given that we have a competition with China. The departure of Elizabeth Kelly, this is the AZ director, not super, super shocking as well. The kind of role you would expect to shift over, you know, the head of a fairly prominent and to some degree, sort of like, let's say,

that is linked to this executive order, maybe less surprising that she'd be departing. It's unclear to what extent the Trump administration is actually going to use the AZ or find some other mechanism. But there certainly is interest in tracking down a lot of these big problems associated with the tech within the administration. They're just trying to figure out what they take seriously and what they don't, which is more or less what you'd expect in the first few weeks of an administration. And next, going back to

research and have a paper on inference time computation, but this time for alignment. So the title of the paper is Almost Surely Safe Alignment of LLMs at Inference Time. Almost surely safe is a bit of a technical term, actually, it's a theoretical guarantee you can get that you will approach a probability of one with some metric

This gets pretty theoretical, so I'm not going to even try to explain it. Broadly speaking, they train a critic and solve a constrained Markov decision process. So again, there's a lot of theory around these kinds of things, but the end story is they find a way to do inference time decoding or even without training a model at all.

At inference time, you can get safe guarantees if you train this critic to guide it.

Yeah, the cool implication here of it being inference time is that you can basically retrofit or slap this onto an already trained model and it'll work right out of the box in that sense. If you're getting excited because the title really sounds like a very promising advance, it's not that it's not promising, but there's a big giant caveat right in the middle of this thing, which is it's almost surely safe with respect to some particular model or metric that defines safety. So

So traditionally, the challenge with AI has been that no one knows how to define a metric that would be safe to optimize for by a sufficiently intelligent system. There are always ways of optimizing for any metric you can think of. This is called Goodhart's law, where you end up with very, very undesirable consequences when you push that metric to its limit.

right? So when you tell teachers, for example, they'll be rewarded for the scores of their students on standardized tests. Don't be surprised when the teacher teaches to the test, right? They find cheats, they game the system. That is in no way addressed remotely by this scheme. It leaves us with all our work still ahead of us in terms of defining the actual metrics that will quantify safety. So that's a bit of a distinction here. What they're saying is once you have that metric,

then we can make certain guarantees about the safety of the system as measured by that metric. But it doesn't guarantee that that metric will not be hacked. That's kind of the caveat here.

And we have another paper on that general topic of alignment, this one coming from Anthropic, titled Constitutional Classifiers Defending Against Universal Jailbreaks. So the story is they have their constitutional alignment approach where you set a

You write down a set of rules, a constitution, and then you can generate many examples of things that align with that constitution. So this, in a way, is similar where you have, again, your constitution and you need to train yourself

to not be possible to jailbreak. So you're not going to give in to a demand to go against your constitution. And the highlight here is that they had over 3,000 hours of red teaming, red teaming meaning that people tried to break this approach, find excuses,

a universal jailbreak that could reliably get the LLM to disclose information it's not supposed to. This approach did basically

successfully make it so you were not able to do that. One more thing to note here, Anthropic also offered $20,000 to be able to jailbreak this new system. So they are still trying to get even more red teaming to see if they can do it. So the challenge is closing today. So I guess we'll see if

anyone is able to beat it. Yeah, I will say 20 grand is a pretty small amount of money for a skill set that is extremely valuable, right? So the best jailbreakers in the world are guys who are making or people who are making a huge, huge amount of money every year. 20 grand, probably not that valuable, especially given the amount of value that it would offer to Anthropic in terms of plugging these holes. That's something that Anthropic was criticized for online. I know plenty of the prompter

or Pliny the Liberator, I guess, as he's now known, kind of got into this semi-heated exchange with, I'm trying to remember if it was like Jan Leike, I think it was Jan, on Axe saying, I'm not going to even bother trying to do this unless he has a whole issue with the way Anthropic is approaching safety fine-tuning and the extent to which they may be hobbling their model's ability to, I guess, talk about themselves. Now I'm trying to remember if Pliny the Liberator is one of these like

consciousness concern people or if it's more of an open source thing, I think it is both. A whole bunch of people anyway in that ecosystem who are complaining about

about Anthropic potentially using this to prevent the models from expressing the full range of their consciousness and all that jazz. So it's kind of this interesting, I mentioned it because Pliny is actually possibly the world's most talented jailbreaker, right? So like, this is the guy who every time there's a new flashy model that's announced, supposedly has the best

safety and security characteristics, he goes out and announces like, oh, yeah. And like five minutes later, it's like I figured out how to how to jailbreak it. Here's the jailbreak, you know, check it for yourself. So anyway, interesting that he has not been engaged in these red teaming exercises by any of the labs. And that's an ideological thing. But that makes this ideology worth tracking. It actually is affecting these frontier labs ability to recruit jailbreakers at the top tier.

year. And with that, we are done with this episode. Thank you for listening. Once again, you can go subscribe to newsletter where you'll get emails with all the links for the podcast. You can go to lastweekin.ai.

As always, we appreciate you subscribing. We appreciate you sharing, reviewing, chatting on the Discord, all those sorts of things. And we try to actually take your feedback into account. Maybe there won't be an AI-generated song on this episode. We'll see. But regardless, please do keep tuning in and we'll keep trying to put these out every week.

I'm reaching high.

New tech emerging, watching surging fly From the labs to the streets, AI's reaching high Algorithms shaping up the future seas Tune in, tune in, get the latest with ease Last weekend AI come and take a ride Hit the low down on tech and let it slide Last weekend AI come and take a ride From the labs to the streets, AI's reaching high

From neural nets to robots, the headlines pop Data-driven dreams, they just don't stop Every breakthrough, every code unwritten On the edge of change

With excitement we're smitten From machine learning marvels to coding kings Futures unfolding, see what it brings

#199 - OpenAI's 03-mini, Gemini Thinking, Deep Research, s1 01:37:46 Share

Last Week in AI

Shownotes Transcript

#199 - OpenAI's 03-mini, Gemini Thinking, Deep Research, s1