#202 - Qwen-32B, Anthropic's $3.5 billion, LLM Cognitive Behaviors

2025/3/9

Last Week in AI

AI Deep Dive AI Chapters Transcript

People

Andrey Kurenkov

Jeremie Harris

Topics

@Andrey Kurenkov : 我认为GPT-4.5的改进并不显著，这可能表明单纯依靠无监督学习扩展模型规模已接近瓶颈。未来发展方向可能需要转向更加注重推理能力的模型。 @Jeremie Harris : 大型语言模型的改进越来越难以察觉，需要更精细的评估方法。单纯的预训练计算已经无法带来显著的性能提升，需要结合推理计算共同提升模型能力。预训练就像学习，推理就像考试，两者都需要投入时间和精力才能取得好成绩。

Deep Dive

Shownotes Transcript

Translations:

中文

Hello and welcome to the Last Week in AI podcast. We can hear us chat about what's going on with AI. As usual in this episode, we will be summarizing and discussing some of last week's most interesting AI news. You can go to episode description or to lastweekinai.com to get the links to all those news stories and the timestamps as well.

I'm one of your hosts, Andrey Krenkov. I studied AI in grad school and now work at the generative AI startup Astrocade, which someone on Discord mentioned. I never actually say my name. Oh, yeah, it's true. Yeah. We don't have much publicly out there, which is why I don't say the name, but it's coming. So soon we'll have a bit of news of our own to share, which will be exciting. Very cool. Okay, well.

Good to know. And yeah, I'm Jeremy Harris, the other co-host, co-founder of Gladstone AI, AI National Security Company. And I've been away for, is it three weeks? It's a couple of weeks, a few weeks. Yeah, it's been a wild time. We're going to be launching something in. Actually, it'll be like April 20th. Now we just had it pushed back for

reasons. You'll hopefully learn more about why I've been out of pocket this whole time, and hopefully it'll be worth it. We'll see. Anyway, yeah, I'm really excited to be back in the saddle. It's been an interesting week, right? I mean, there's like not the huge, I mean, the last week was the big, big week for new model releases. This week, I think some specific, really interesting and impactful things, but not a huge number of them.

Yeah, to give you a quick preview of what we'll have in this episode, we'll have one new exciting model from Alibaba this time that pretty much caps off a strand, I would say, of new reasoning models, new big models, new cutting edge models, etc.

And then aside from that, we'll be going to kind of going back to a mix of usual stories of some new consumer facing things, Alexa Plus, a few kind of business deals going on. Quite a few open source releases will focus on benchmarks and also some very cool, unusual kind of software from DeepSeek. Then in coding space.

In research, we'll be once again talking about reasoning, which has been kind of the main major focus for research for a while, but also about honesty and accuracy, which will be interesting. And policy and safety will be mainly going into some of the details of the new models. And of course, a little bit about export controls as well.

But before we get there, we were thinking since, Jeremy, you could not co-host the last week where we were discussing GP 4.5 as well as Cloud 3 and then also Grok, if we could do a quick section to chat a little bit more about those releases and probably the more interesting one is GP 4.5 with kind of the narrative going on that it seems kind of underwhelming and it seems like maybe it's showing...

The limits of pure unsupervised scaling where, you know, reasoning is now the thing that's getting us really, really big leaps in performance. It's unclear if unsupervised reasoning by itself is really worth the investment at this point.

Yeah, this is really interesting because this comes up with every beat. And I think there's a little bit of nuance that's missing. So first of all, we got to talk about this idea of code smell, because sorry, not code smell, sorry, of model smell. Yeah, it used to be code smell back in the day. I'm old enough to remember 20 minutes ago when the kind of smell you cared about was code smell. But yeah, so essentially, we live in a world where

With every beat of pre-training compute, so with every new 10x, like order of magnitude increase and the number of flops that are thrown at a system during pre-training, you're getting a better model. It used to be that you would get a very noticeably better model, right? So GPT-2 would barely be able to string together three coherent sentences. Then GPT-3, I think that was more like 100x the compute, but still, I get the gist.

all of a sudden is like, you know, full paragraphs and essays. And then GPT-4 gets you to kind of like page length content and beyond. You can kind of clearly tell what's GPT-2, what's GPT-3, what's GPT-4. The problem is we've talked about this in the context of image generation systems, right? There comes a point where it's just really hard to tell what is better, right? Like show me the absolute latest update.

from mid journey show me the absolute latest from really like any image generation system, they they're all basically photorealistic, you have to be really, really specialized to be able to kind of catch the smell, if you will, these models, hence this idea of smell, right, this very kind of esoteric, abstract notion that I can't put my finger on it quite. But if I play with the model enough, I can sort of tell that there's more intelligence or capability under the hood, right?

Yeah, exactly. And we were talking, there's smell, there's also kind of the vibes, and that's often... When you model release, you look at the benchmarks, usually you get some numbers, but it's really hard to tell what those benchmarks even mean. So often you want to just go see what the vibes are from trying it. For instance, Andre Kapafi posted evaluations of Grok 3, of GPT 4.5, and...

gave his take on it. So that's a similar aspect where you need a very, very specialized benchmark to showcase the difference. And you need just firsthand experience and reporting on trying side by side the difference between models. Which also makes a ton of sense, right? Because fundamentally, pre-training is about what? It's about being able to predict the next token with really high accuracy, right? And what that means is once you get really, really, really good at that, you must have learned

a lot about the world. Like being good at fill in the blanks means having a good world model. Ultimately, at least that's the thesis.

And so the challenge is like, where does your where does your lift come from? Where does your performance improvement come from on that metric? Once you get to GPT-4, once you get to GPT-4. Oh, well, it's got to come from getting more and more niche things, right? It's got to come from like, getting better and better at I don't know, like bio something or like the kind of niche field in history or shit like that, right? So

This is one dimension of it. Another is maybe you get a little bit better at logic and induction and all that stuff. And so the interesting thing is that we're getting to the point where it's just hard to almost qualitatively gauge just by reading the outputs of these models. I will say quantitatively. When you look at chatbot arena, there it is, GPT 4.5 right at the top, along with GROK 3. So when people are actually kind of at scale asked to assess the performance of these systems,

In many contexts, they can tell the difference there. And you see readouts from Karpathy and others who kind of allude to similar things.

But I think the main value in pre-training now is not just the model that you get. And this is something that's shifted. This was the reason, by the way, that on the podcast, we so easily made the call when we were talking about DeepSeek v3, we said, big reasoning breakthrough is coming from DeepSeek. That was trivial. That was the easiest call in the world. No one was paying attention to it at the time, but you could tell how good the reasoner would be based on the strength of the base model.

The base model wasn't that much better, like on a percentage basis relative to other base models, because it gets harder and harder, right? As you get closer and closer to 100% on these benchmarks, yeah, you're like climbing that next percent means kind of understanding a lot more about the world. And anyways, that's all kind of part of this. And it creates the illusion that you have a system that's sort of saturating and not giving you any ROI. But then when you apply the right amount of inference time compute, suddenly you unlock this massive advance.

The last thing I want to flag here is really this intuition. It's a bit of a toy picture and it's not perfect, but if I give you, you know, a hundred hours to spend on a test, you have the choice, you know, do you spend, you know, 99 and a half of those hours studying and half an hour doing the test, or do you spend 80 hours studying and 20 hours doing the test? The models, the base models like V3, like GPD 4.5, essentially are just spending 99

Like, you know, all but one minute, all but one second studying for the test. And then they're spending one second actually performing in a test time, right? That's what a base model really is. And no surprise, you saturate that at a certain point, there's only so much more leverage you can get out of your pre-training. But then suddenly when you open up, when you allow the model to reason at test, I'm like, damn, what a lift. And one of the big discoveries in the last few months and arguably years is

has been that really scaling doesn't mean just jacking up your pre-training compute. It means jacking it up in tandem with your inference time compute budget and moving them up together. There are curves essentially that you're jumping through as you do that. And that is what continues ad infinitum. And so this can lead, or which very well may, I should say, that leads a lot of people to make the, I think, pretty clearly incorrect assessment that, oh, we're saturating a

scaling in the general sense when really what's happening is, yeah, if you don't do any inference time compute, yeah, you're pre-training. You're basically like you've done as much as you can with studying for the test. It's now time to spend more time actually doing the test.

So I think that's kind of the big thing here. It's not a coincidence that OpenAI was so confident in saying GPT 4.5 will be our last, our last base model, basically our last auto-aggressive straight up, you know, text auto-complete model. From now on, it's going to be reasoning models. Why? Because the economics don't support spending more time on just pre-training and not on inference. The scaling curves are a combination of inference and

training time. So I guess that's my overall take on this stuff. I think we'll continue to see improvements. I really would caution against betting on the scaling trend just based on GPT 4.5. Well, I think there's a bit of nuance and we'll move on, but I'll just add a couple of things. This is an interesting topic. I guess a big question is,

In terms of scaling up model sizes, are we kind of hitting the ceiling, right? And that is not really known. I would say like GB 4.5 is one indicator that we don't have a model size. People are guessing maybe it's another 10x step where it's like 10 trillion weights. I have no idea. But based on the cost, based on the speed of it, it seems to be a much bigger model with many, many more weights. And

Doing that resulted in these relatively incremental changes from GPT-4.4, as you said, and

which does kind of make sense because there is kind of a difficulty once you're intelligent, generally, in terms of answering. You know, opening, I was highlighting this like emotional intelligence aspect, which arguably could mean that, you know, it is much better at kind of the nuanced reasoning of how to reply to a given question in terms of tone, in terms of depth of response detail. But those are the sorts of things that you can't benchmark.

So that's one thing. And then the other thing is, as you said, we found that with base models, with just a little bit of training, not that much training, there's kind of a latent reasoning capability that allows you to do inference time scaling effectively. Because generally in my awareness, like you can do inference time scaling with base models, but it's not nearly as powerful as if you do additional training to make them capable of strong reasoning. And what we've seen is

with just a little bit more training, you go like, you know, 20%, 30% improvement from the base model. So it'll be very interesting to see if you can get the same level of improvement with stuff like GP4.5. And then of course, there's the distillation question of, you've seen again, with base models, you can take a very smart model and kind of juice that intelligence into something smaller. I could also see that being a reason to keep training bigger models. And

And that is something that will change. So, you know, like the cloud three opus, for example, like nobody ever really used that because it's so big. It's so bulky and expensive to query, right? This is the problem with these super scaled base models. When I say super scaled, I mean, large parameter counts, as you said, in practice, what the labs do is they, yeah, they do distill these down to much, much smaller sizes so that it's cheaper at inference time. That makes all the sense in the world.

You basically say, look, I'm not interested in making a, you know, like a compute optimal model. I'm not interested in making a model that is large enough for a given compute budget. Like as you increase your compute budget, you should be increasing theoretically the number of parameters in your model. Those two things should scale at the same time. That's what the scaling laws say. But increasing the number of parameters in your model means driving up the cost of inferencing that model.

And so what you'll often see is people will actually say, no, I will artificially keep my model smaller than it should be. I will intentionally sacrifice performance because it's going to cost less on the back end if it's being queried by, say, billions of times a day.

So, and that's actually going to be another calculation when you do reasoning, because doing reasoning with these models means querying them many, many times. So you're running them at inference time a lot more. So the economics then, essentially what you're doing is you are literally trading training for inference time compute, right? Or you're trading off your performance, let's say, in those two different regimes. And so,

Yeah, anyway, lots to say there. But high level, I mean, I think, yeah, betting against scaling at this point is, I wouldn't do it. But I think I wouldn't bet against scaling. But there is an interesting question as to traditional scaling of increasing model weight counts at pre-training. That's an interesting situation we are in. I guess when I say scaling, I usually think of compute. I think most people are thinking in that direction. But you're right. Like there's those two things for sure. Yeah.

Well, lots to say there, but we're going to have to move on to the actual news. The new things this week, starting with tools and apps and very much relevant. Our first story is about QWQ32B. That's from Alibaba. And that is their new model that is as good as DeepSeek R1 and is great.

seemingly outperforming OpenAI's O1 Mini. So this is very similar to R1 in a way. They took their base model, QWQ Max, I would guess, that was released about a month ago. They did some additional training to make it a reasoning model with reinforcement learning. And then they found that it is able to do all of these benchmarks at about the level of DeepSea Car 1, slightly better, slightly worse, pretty much equivalent. And

And we don't know much. We didn't release a paper, but in their brief blog post, they do say that this appears to be very similar in terms of approach. They start their model, they train it on coding tasks and math tasks where you can have automated verifiers without any sort of trained reward model. You just, you know, train specifically on coding and math because it's easy to evaluate your final answer with some hard-coded rules.

You do a little bit of training beyond that as well, which I think is interesting. R1 had this two-stage thing. This also has this two-stage thing where you start with just a pure coded math and you do a little bit of broader training with reinforcement learning on the kind of general problems. And they use a trained reward model. So again, not too much beyond that we can say on this model, but yeah.

I guess it again highlights that if you have a good base model, it honestly doesn't seem that hard to get to a good reasoning model. And also, of course, this was kind of a big announcement from Alibaba. They already have this up on their current chat platform. They're also releasing the weights and their stock jumped, I forget, like 5%, 10%, a decent amount based on these news.

Yeah, I mean, it does seem to be a performant model. It is now running, I guess, what are we? It's like January 20th. So about two months after DeepSeek, you know, all kinds of other things delay the publication or the release of these models too. So it's hard to know exactly when the training runs stop, but it does seem like they're fairly close behind. Yeah, a couple of notes. I mean, they make a big deal out of the

the number of parameters in this model relative to DeepSeq and DeepSeq R1. So this actually ties back to the conversation we were just having. This is a 32 billion parameter monolithic transformer, whereas DeepSeq was a, I don't know, 600 something billion parameter. They had like 37 billion activated parameters. So DeepSeq R1 was a mixture of experts. Essentially, you know, query goes in and it gets, or a prompt goes in, it gets fed, or token by token, that is, it gets fed to

a couple of specialized kind of expert models. And then there's, anyway, there's one expert that always gets queried. It's a whole thing. Go check out our episode about that. But bottom line is, every time you inference DeepSeq R1, the vast majority of the parameters in the model aren't actually involved in generating the final output.

And so there's only 37 billion activated parameters for each forward pass. And that contrasts favorably to 32 billion, a smaller number for this model. But in this model, every single parameter is activated every single time. So this makes it hard to do an apples to apples comparison. The other thing is.

It is much more convenient to have a smaller model. 32 billion parameters means you need a lot less RAM to hold the model. Yeah, easier requirements. So that's an application level thing, right? If you're a person, an engineer who wants to take this model and do something with it, then for sure, you know, Quinn with questions, 32B is going to be a more interesting model potentially than DeepSeek R1 because it's just smaller, kind of, you need less RAM, less infrastructure. But from a scientific standpoint, when we ask ourselves how...

How far behind or ahead is Alibaba relative to DeepSeek? We actually can't totally know until we know the compute budget that went into this. See, it might be a smaller model, but it could just be way more overtrained relative to DeepSeek R1. And without knowing that, it's really difficult to know. Alibaba certainly has a much larger compute fleet to throw at this kind of thing. You know, DeepSeek had, at least for R1, you know, you're talking a couple thousands of GPUs. So,

All in all, like difficult to tell if you're curious about the competitive landscape of Alibaba versus DeepSeek, which one to pay more attention to. There's not a ton to sink our teeth into here until we know those compute numbers, which may or may not be forthcoming. One thing that we do know is, so they say we've integrated agent-related capabilities into the reasoning model, enabling it to think critically while using tools and adapting its reasoning based on environmental feedback. So there is presumably some kind of

supervised fine-tuning would make the most sense here for tool use. Some SFT stage, it's not purely just pre-trained and then straight to RL, which was the case for DeepSeq R1.0, but not DeepSeq R1. DeepSeq R1 did do a little bit of supervised fine-tuning where you explicitly train the model to use certain tools or to behave in certain ways. In the case of R1, I think it was more to reason in a specific way or intelligible way. So anyway, it's an interesting model. The benchmarks do say, at least the ones they publish,

And they are credible benchmarks that, yeah, this is on par with R1, sometimes a little better, sometimes a little worse, depending on the situation. But Alibaba certainly has absorbed, at a minimum, many of the lessons from DeepSeek. And I wouldn't be surprised to see them actually take the baton and scale the blazes if DeepSeek can't get their hands on as much compute as Alibaba. That kind of becomes a really interesting competitive differentiator.

And just to add a little bit to the overview, to be precise. So the timeline is they released QEN 2.5 Max in late January. That was their frontline model. They also at the time released QEN 2.5-1 million. That was the long reasoning. Then

Just a couple of weeks ago, they released QWQ Max, Coin with Questions. So that was also a reasoning model. And at the time, they released that to CoinChat and had the reasoning. So the distinction here with QWQ for 2B is, A, as you said, it's smaller. QWQ Max was also an MWE model built on Coin 2.5 Max. This is smaller.

seemingly smaller and in a blog post they do highlight this one is the one that's trained with reinforcement learning so presumably q wq max was trained supervised learning on traces of reasoning from maybe other models versus this is trained from a base model just via a rel and made to match the performance of other reasoners

And on to the next story, we have an actual product announcement, not just a model. And it is Alexa Plus. So Amazon has given a presentation. This is not out yet, but they did provide an overview of the next iteration of Alexa that will integrate chat, GPT, chatbot type intelligence.

So they will have many new features. Obviously, you can probably chat to it, but it can do various complex tasks. Like, I don't know why people love is booking a restaurant and managing travel examples. Like, why would you want a chatbot to book you a flight? But anyway, they did give those examples. It can remember things you tell it to remember, and then it can do smart things

things based on that. And they're also releasing a revamped Alexa app and an Alexa.com website that will kind of be bundled here and include multi-model capabilities. So in addition, if you have, for instance, an integrated Amazon camera, Alexa will seemingly be able to look at the video feed and answer questions about the data feed there.

So a bunch of new features integrated, and it does seem like this will be with a subscription price of $20 a month to be able to use the smart Alexa. Yeah, Amazon, you know, looking more and more like Apple now.

in some ways, with respect to this kind of scaling universe, you know, where they're not releasing their own models, they're focusing more on leveraging models built by third parties and then integrating them into their hardware. That actually sounds very, very much like the Apple play. The difference being Amazon is much better positioned from a hardware standpoint with the Tranium chip than Apple is with their kind of weird fleet of products

of kind of more CPU-oriented data center infrastructure, which is kind of something we should talk about at some point. But yeah, so this is Amazon basically saying, look, we are a platform for these models more so. At some point, they'll have their own. They have an AGI team internally that we've talked about before, but interesting to see them kind of reaching for third-party models. It makes sense. The more you identify as a platform, the more you're servicing hardware rather than software, you want to commoditize the complement, right? That's the classic example

Microsoft Play, PCs are really cheap, but you know what's expensive is the software, right? So if you're in the business of building software, you want to make PCs really, really cheap and make it easy for people to get to the point where they pay you for the good stuff. Here, the compliment to good hardware, it's kind of the opposite. If you're a hardware company, you want to commoditize the LLMs, have OpenAI compete with Anthropic, compete with Google, compete with Google.

And then you get to have that competition play out on your platform, drive prices down at the language model level, drive value up, and then your hardware suddenly becomes more valuable. So I think that's kind of the play here. And I'm curious to see too, from a model standpoint, you know, what is the advantage going to be to Amazon to having, if they can unlock this, just more data flowing through, can they leverage that into actually being competitive on the model side? But for now, it's interesting. Amazon has definitely struggled on the Alexa side. Like,

We've heard like all the jokes. It's definitely been a product in need of a revamp. So maybe this will do it. Exactly. They did mention that this is built on the Amazon Nova model, but they can also leverage Anthropox models.

And I think it's interesting that they did also mention Alexa.com as a new website that seems to be basically like a chat GPT, like a chatbot interface. You can upload documents. You can just have access to a chatbot.

And one thing also to mention is it will cost you a monthly fee unless you're an Amazon Prime subscriber, in which case it will be free. So in that way, it's almost similar to X, where if you're in the Amazon ecosystem already, this might be your LLM of choice. You would just use Alexa.com because that's the thing that you already have bundled with your subscription.

So also, I think interesting from that aspect, maybe Amazon will have a chance to actually compete in with just pure chatbot space in addition to its, you know, smart hardware having more reason to be useful beyond what's been the case.

And moving on, next up we have another demo of something that's upcoming. The title of the article is Another Deep Seek Moment? General AI agent Manus shows ability to handle complex tasks. So this was a bit of a web demo. They had released video on X and are now...

providing this as an invitation only web preview. And this seems to be another really big investment in the agentic space. So similar to something like cloud code, you can task it with a really large task like developing a website, developing a

you know, an app, and it would go off and do this for a while, it will go and do it for, you know, minutes, potentially dozens of minutes, it could accrue like, you know, $10 of inference cost. But then in the end, it could actually spit out a wholly working website. So not too unique from what we're showing, but another indicator that,

this whole agentic direction that, you know, has been a focus for much of 2024 is finally starting to sort of come together. Yeah. And another, you know, Chinese company moving in this direction too. So yeah, it'll be interesting to see. Right now, the waterline is definitely rising on agents. And I think the biggest question is going to be, you know, whether

Whether they can marshal enough compute to make these competitive with the Western agents that basically have not unlimited NVIDIA GPUs, but closer to it,

Because at a certain point, it's a question of like, okay, you've shown that you can get liftoff on inference time compute. But now you have to imagine as we speak, every fleet of GPUs in the West is being realigned around the thesis that inference time compute is a big deal. And so rather than having the, say, O3 mini or O3 level inference time budgets thrown at them, now we're going to see really like, what does this look like when at industrial scale, we're trying to

trying to ramp that up the same way we did with pre-training. So I think it's going to be a couple of months before we see that generation of agents, and then we'll be able to get a better sense of like, what does the long-term or medium-term equilibrium look like between the US and China on this stuff? But apparently, yeah, Manus does claim to outperform even opening eyes deep research based on some benchmarks that are focused on general AI assistance. So kind of interesting breakthrough.

Yeah, it's an interesting release too. They got a lot of, let's say, hype or attention based on this post on X. This is a small company. It's butterfly effect has a few dozen employees. So also there's some skepticism here as to how real this announcement is. But regardless, an indicator of where we are at kind of more broadly.

And next we have something from Microsoft. It is Dragon Copilot, which is an AI assistant for healthcare. So the main focus here is listening to clinical visits and creating notes. Essentially, it's like AI that can be used for voice dictating and listening on conversation, be an ambient listener.

Microsoft actually acquired a company called Nuance that was specialized in ambient listening and voice dictating.

And then it is able to create notes for doctors. So we've seen this as an idea for quite a while. We've covered, I think, multiple cases of this idea of AI as a note taker for doctors as being potentially a useful thing. And another example here of now that Microsoft is providing it, presumably it's a pretty mature product offering of this kind.

Yeah, the challenge as ever is in medicine, doctors are notoriously hesitant to use new technology. And especially, so I'll say this, a lot of my friends are doctors that had these conversations with them. Ego plays a huge, huge role in medicine, especially among doctors. So you'll see rollouts of AI models that can say produce then to end diagnosis, but

And there was a study recently, I don't know if it might've been during a week I was away, but it is one of these amusing things. I had some, some fun conversations with some friends of mine in medicine. So it turns out that if you have a doctor working with, and I forget what model they were testing, I think it might've been four Oh, but if, if you have fine tuned four Oh, or fine tuned, you know, approximately frontier model doctor alone, uh,

will underperform the model and also underperforms doctor plus model. And that's actually really interesting. So the doctor with the model does worse than the model alone,

Literally, this is the doctor basically taking, on average, correct answers that the model gives it and it gives them and then being like, no, no, that can't be right. Or something like that, that I'm obviously caricaturing a little bit here for humor, but something like that is happening under the hood in a lot of these cases. It's part of the psychology of the space. So really, really challenging. And this is why you're seeing applications like this that are like, you know, not opinionated. I'm just taking notes. I'm just being, you know, a good little bot sitting in the background, not telling you how to do your job.

For cultural reasons, I think that might be a pretty good use case to go after at this point. Right. I do think so. We've seen some research that indicates that this kind of technology helps reduce clinical burnout, helps patients have better experiences by a pretty wide margin. And this has been proven.

being tested in nine hospitals and some other clinical locations partnering with Wellspend Health. So it seems that they are already testing this and it will be generally released in the US and Canada in May. So this will be interesting to see if now that we'll start going to doctors, if they'll have a recorder going on. I don't know.

And on to the last thing, we have a new offering from Mistal and it is their OCR API. So OCR is optical character recognition. It's basically looking at a photo or a PDF and converting that into the actual text contained within that photo or, you know, scan of a page or whatever. And it's

And we've seen leaps and bounds improvements in osteoartichology over recent years. This offering by Mistral is basically a

you know, version of that, that you can use via an API. And they say that this can play into allowing LLMs to use and talk about PDS and scans of things that perhaps they're not good at with just pure multimodal reasoning.

And on to applications and business. And we haven't talked about Anthropic yet, so I guess it's their turn. And we haven't talked about anyone getting billions of dollars in a little while. So that is now also what's happening. Anthropic, which we've covered, has been in the process of fundraising over the last couple of months, has now completed that fundraising round and is valued at $61.5 billion.

That's rising from $16 billion. That was just over a year ago. They got $3.5 billion in this round. It was led by Lightspeed Venture Partners, and it

makes it that Anthropic has now raised over $14.8 billion. So nowhere near opening the eye, as far as I can remember, opening the eye, I lost count of the number of billions they have gotten and burned through. But another indicator that I think Anthropic is still the front runner competitor to open the eye in this space.

Yeah, for sure. I think the latest is from opening either in talks to raise it a $300 billion valuation remains to be seen whether that materializes, obviously, but it's also pretty consistent. You know, they've been about a four or five X multiple ahead of Anthropic for the last little bit. So this situation

suggests, you know, consistent growth curve for both companies. Maybe not too, too surprising. A bunch of new investors who are joining the round. It is their Series E. The General Catalyst, Jane Street, Fidelity, Okin, Menlo. They were already pre-existing investors, as well as Bessemer. So, I mean, you're really, really high quality VCs. No surprise there either. Their claim is they were originally planning to raise $2 billion, but ended up

with an oversubscribed round. So it could be true. It's also sometimes a trick that you'll see used where people say, hey, we're raising this amount of money and the amount that you give is like less than what you're actually trying to raise just so that it creates FOMO and you get more investors. But

Yeah, in any case, that's a really big raise. They're one of the bigger private companies in the United States right now. You think about $300 billion valuations like OpenAI and SpaceX. There aren't that many companies in that range, including at the $60 billion threshold. So pretty interesting. Yeah, we'll see apparently. No surprise, anytime you see this, they're using the capital to develop the next generation of their AI systems and expand their computing capacity. So there you go, more GPUs.

And next up, we have an IPO story. NVIDIA-backed CoreWeave has filed for IPO and has also reported a revenue of $1.9 billion in 2024. CoreWeave is a cloud computing provider that's backed by NVIDIA, and they are aiming to raise $4 billion through the IPO, aiming at a valuation of $35 billion.

So a player here in the cloud space, they have been around since 2017 when they were into crypto mining. I guess now they're aiming to be part of the infrastructure play for AI. Yeah, one of the virtues of going for an IPO is you end up publishing all of your self-assessed weaknesses and differentiators and all that. So we get a little bit more of a sense of where they're at.

Turns out about 77% of the revenue came from just their top two customers for 2024, one of which was Microsoft. So, and Microsoft, by the way, accounted for apparently two thirds of overall sales. It's a pretty lopsided situation that that is, you know, some structural risk. And we've seen similar things with with other sort of cloud companies like this. It's just the nature of the game.

And yeah, I mean, prior to this, they had raised the $23 billion valuation. So this is about where you'd expect them to be is going for an IPO after that level of scale to raise more capital. So yeah, just as a reminder, like CoreWeave is a pretty interesting company. We've covered them quite a bit. They have an NVIDIA partnership. NVIDIA is one of their most important investors.

That you can imagine helping them get access to GPUs a lot faster, which is a big differentiator. And then they also are known for this very flexible pricing model where they give you just a lot more like granular and cost effective pricing for a lot of GPU resources. So like you can rent individual GPUs instead of entire clusters, for instance. So it's a lot more, a lot more balanced and easier theoretically for smaller players to use as well. So sort of interesting experience.

Expect to hear more about CoreWeave in the future, that's for sure. It's expected, by the way, to trade under the NASDAQ. So that's going to be their exchange of choice. Next, we have Waymo and the news of that their partnership with Uber has now officially begun. The access to Waymo vehicles through Uber in Austin has launched. So now when you hail Uber,

do people say taxi these days? I don't know. Uber vehicle through UberX, Uber Green, any of these, you might get a Waymo, which is kind of interesting. So you don't have any sort of specialized request. You just can get matched to one. You can adjust your rider preferences to increase your chances of getting a Waymo. And apparently pricing will be the same just without the tip. And this will cover...

37 square miles in Austin.

And moving right along, the next story is about Microsoft and OpenAI. We covered quite a while ago, the UK competition and markets authority, the CMA, had launched an investigation into the partnership between Microsoft and OpenAI due to some antitrust concerns, which was quite the trend last year. Well, that investigation has concluded and they decided that the partnership

is fine, basically. Microsoft has influence over OpenAI, but not control. And so it doesn't meet your criteria for a merger review.

Yeah, the triggering event here was just seeing how effectively Microsoft was able to pressure the OpenAI board into rehiring SAM. And so anytime you see that, it raises questions about, okay, well, does Microsoft then have effective control? And that would trigger antitrust concerns. As you said, they instead found a high level of material influence, not antitrust.

outright control, not something that would justify further action here. They are saying, quote, the CMA's findings on jurisdiction should not be read as the partnership being given a clean bill of health on potential competition concerns. But the UK merger control regime must, of course, operate within the remit set down by parliament. So basically, they're arguing that just

within the narrow regime that they are charged with, this does not qualify as something they need to act on. There has been criticism of this. It's interesting to note the labor government under Keir Starmer that's just come in is much more sort of pro-AI acceleration, economic growth type stuff than the Rishi Sunak conservative government that just left, that kicked off, of course, the famous AI safety summit series. And so instead, what you're seeing is

within the government that reflect that view. And so there's concern. For example, there's one quote from somebody who said the CMA has sat on this decision for over a year, yet within just a few weeks of a former Amazon boss being installed as its chair, it has decided everything was absolutely fine all along, nothing to see here. Hard to really know how far all this goes because obviously the machinations that go on inside a government body like this are quite complex, partly political, but also just partly this could have just happened this way. Right.

Really difficult to know, but that is part of the conversation happening around here. And the last story here is about Scale AI announcing a multi-million dollar defense deal. So they have a deal with Department of Defense in the US. It's called the Thunder Forge AI Agent Program, and it is meant to enhance US military planning and operations. Scale AI is sort of the leader on this initiative and

They said they'll be partnering or using technology from Microsoft and Enduro, among others. And they do say this will be under human oversight, but the idea seems to be to add a genetic capabilities. And, you know, this is just another indicator of the overall trend of a movement of tech towards military, you know, friendship with OpenAI and Anthropik and others having made this a

similar kind of maneuvering over the past few months. There's not a ton of like, I would say concrete, non-obvious information that's out there, right? They're calling it a multimillion dollar deal. Well, no shit.

You're talking about, okay, we know it's spearheaded by the DIU, the Defense Innovation Unit, which is roughly what it sounds like. They do a lot of advanced R&D for the Department of Defense. By the way, I mean, there's been a lot of concern around the obviously use of like fully autonomous systems to do targeting and things like that.

The claim is that that will not be there will be a human in the loop on some level. But in any case, the reality of the space is we are heading for a world where U.S. adversaries absolutely will be deploying these systems without any kind of oversight. And at a certain point, the response times of these systems is just too fast to have a human in the loop. So anchoring on, you know, whatever human in the loop means, because it's already kind of ambiguous.

is I don't think a winning strategy. I think ultimately the economics, the geostrategic landscape forces you in the direction of full automation, whether you like it or not, and chalk it off to a Moloch problem, whatever you want to call it. But

They are saying here that they're partnering with Enduro, which is the famous Palmer Lucky company that's like, you know, anyway, on its way to becoming a big defense prime, basically. And Microsoft, it's going to be about AI agents, no surprise there. And the use cases they cite here are modeling and simulation, decision making support, proposed courses of action, and even automated workflows. The rollout, by the way, is going to start with US Indopaycom and US Ucom. Those are

So the U.S. DOD is set up with different what are known as combatant commands. These are the essentially integrated operations that actually do real shit. So, you know, in the Middle East, you have U.S. CENTCOM, U.S. Central Command. You know, they're in charge of whatever happens in Syria, for example. Interesting that it's being rolled out in Indo-PACOM, so Indo-Pacific Commands. That includes China and then U.S. UCOM, presumably with Russia in the orbit. These are interesting theaters to be experimenting with some of these things because you get actual information

Sort of like rubber meets the road impact pretty quickly, I would imagine. So interesting. And Alex Wang, by the way, CEO of Scale AI, actually pretty concerned about alignment issues. So he's doing this with that in mind. Very thoughtful guy. And we'll talk about something that he put out with Elon's AI advisor, Dan Hendricks, recently as well. But anyway, he's on the map with this. And Scale AI is obviously doing a lot more DoD stuff these days.

And moving on to projects and open source, we begin with DeepSeek that had a whole open source week. So just to give a quick overview, they had a week where every day they released a new repository.

So the whole list is Flash MLA, an efficient MLA decoding kernel for Hopper GPUs, DeepEP, a communication library for mixtures of experts model, DeepGem, optimized general matrix modification library, optimized parallelism strategies, a framework for optimizing parallelism,

And Fireflyer file system, a whole file system optimized for machine learning workflows. And finally, the DeepSeek v3 R1 inference system. So that's six different packages that they released. And as you can tell from that overview, it's very much focused on infrastructure. This is part of their secret sauce, part of why DeepSeek v3 was created.

so performant at such a low cost. It just optimized the crap out of everything, like working into their own matrix multiplication cores and so on. So they have now shared all that world. Really exciting, I suppose, for some people who work on this sort of stuff. Jeremy, I'm sure you're able to provide more detail here. Yeah, well, so this is in a way the

big reveal of all the stuff that went into the V3 and R1 paper and some other stuff too. I think part of what it shows you is just the incredible breadth of engineering talent that they have, right? So in some ways, the Fireflyer file system, the FFFS,

is the least kind of compelling, at least to me in terms of what it actually gets you. But it just helps to illustrate how frigging wild and far their capabilities stretch. So this is basically for SSD storage. So there's this issue of what's called the read throughput. It's the rate at which data can be read out of storage, like long term storage in your data center. And by long term, we really mean storage here that has a really high capacity, thinking like storing a

model checkpoints or big chunks of data sets and things like that. You don't use them very often, but when you do, you need like really, really large throughput. You're pulling off tons of data. So looking at terabytes per second, usually in that scale, you're

So it turns out that they've been able to achieve 7.3 terabytes per second of data kind of read throughput from their new optimized setup. That's actually really, really impressive. It's up there like frontier scale training run infrastructure, but now it's being open sourced. Anyway, there's a bunch of detail in here about how it's optimized.

I'm actually going to skip over that completely, but we can dive into it. It's just like this is even at the SSD level. They're like optimizing that even though SSD is not a key bottleneck, typically for like high performance computing AI runs, like high bandwidth memory is GPUs are like flops or network interconnect, but it's not usually pulling stuff off of long-term storage. I think that the big story is really more the dual pipe release or anyway, that's one of the bigger ones that,

which we can talk about as well. That one... Yeah, I think that one, if I'm not wrong, so they have another release that wasn't even part of this week of open source that focused entirely on infrastructure. This was...

part of the overall week. This is day four of the week. So this is interesting, I think, and you'll go into depth here is because this is more of an algorithm. So it's less just pure infrastructure. This is

a new kind of inference technique. And then I'll let you take over on the details. Yeah, I mean, it's so hard to tell what qualifies as hardware and what qualifies as software at this point. But you're right. I mean, you could argue it many different ways, I guess. First, we got to talk about this idea of pipeline parallelism, which is a way of breaking up your training task that basically everyone does when they do really, really large scale training runs. DeepSeek certainly did.

Essentially, you can think of this as having your models layers divided such that, you know, maybe layers one through three are sitting on GPU one, layers four through six are sitting on GPU two and so on, right? So different layers of your model are sitting on different GPUs. Now, typically, when you do this,

You have to feed. So let's say you get a new piece of data you want to feed to your model. Well, now you got to feed it to GPU number one, right? Have it process that mini batch. Then that output has to be fed to the next GPU. You got to move it over to the next GPU. And that next GPU then starts crunching on it while GPU one starts processing the next mini batch. Now, one thing you might notice about that setup is it means that you've got a whole bunch of GPUs.

GPUs that are holding the weights for the later layers in your model, just kind of sitting idle at the beginning waiting for something to do because you have to send your your data through the first few layers before they get to the last ones. And so that creates this effect that's known as a bubble, right? You have this this bubble that forms during pipeline parallelism,

And it's like you got all this idle GPUs that aren't being used. What DeepSeek does is they're trying to minimize that one thing they're doing here is they're trying to minimize the size of that bubble.

And one technique they use is having a forward pass start with the GPU that's holding the earlier weights, the earlier layers of the model. So forward pass there, while a backward pass starts to go through the end of the model and the data propagates towards the middle. There's this meet in the middle problem that has to be resolved. And they find, anyway, really interesting ways to solve for this. One of the key strategies that they use is finding really clever ways to overlap communication and computation.

And without going into too much detail, I do want to talk about the this thing called the streaming multiprocessor. So your GPU, you can think of like the workhorse unit that does work on your GPU as the streaming multiprocessor. And there are tons and tons and tons of these on a GPU, they can handle like computation, so they execute code.

that code is for actually doing like mat models and other forms of compute or for communication. But they can't do both at the same time step. So a given streaming multiprocessor is either packaging data to send it...

you know, on the network to other GPUs or somewhere else, or it's performing an actual computation. And what makes the deep seek approach so sophisticated is that they've really carefully allocated some streaming multiprocessors to handle communication and others to handle computation on the same die on the same GPU. And so you imagine like at a given time step, like time step one,

Streaming multiprocessors like 1 through 80 are doing a computation on batch A, whereas 81 through 100 are handling communication for some previous set of computations that they ran. And so you've got essentially one single GPU die, which at the same time is actually doing many different things. And this helps you avoid this meat in the middle problem that comes from feeding data

you know, at the lower layers and the top layers at the same time, because now you've got these like kind of schizophrenic GPUs that are having to do multiple different things with data flowing essentially through one process in one direction, the other in another. It's actually really, really fascinating. They also chop up their layers in a very creative way, separating out the multilayer perceptrons from the attention, from all this stuff, allowing them to have more fine grained control. And what they claim is that they basically get

Perfect communication computation overlap. So you don't run into this problem where you have idle cores waiting to be fed data they can crunch. There's always something going on in the system. So anyway, that's really the dual pipe advantage here and called dual pipe because you're feeding both in the early layers, you're feeding the forward pass, and then the later layers, you're feeding the backward pass.

Very cool. And moving right along. And next up, we have a release of a model and it's from Physical Intelligence. We are open sourcing their Pi Zero Robotics Foundation model. So we covered this a few months ago. Pi Zero is a model that takes a video stream, takes a task specification and outputs robot control for a variety of

types of robots. Now the PyZero model and its code are available on GitHub in a repository. And I think it's also getting integrated into

hugging face their robotics framework. They also have a few variations. They have a Pi Zero Fast base model, and then they also have some other ones like Pi Fast Droid that's fine-tuned for a Franca robotic arm in particular. So one of the first releases of a large foundation model, Physical Intelligence got like $70 million in funding as a startup. So they

have had the resources to collect the data sets and train the models that have so far really not been kind of possible. And so they do say that you can fine tune it to your own task or applications with around a few hours, one to 20 hours of data.

Last story on the open source front, we have the Big Bench Extra Hard from DeepMind. So as we've said, benchmarks are increasingly not useful. And this is yet another demonstration of that. Big Bench Extra Hard is building upon Big Bench Hard. It replaces its 23 tasks with more difficult counterparts that require advanced reasoning skills. And as a result, it

The state-of-the-art LLMs reach a top accuracy of 23.9 for just the base models, and they do reach 54.2% pass rates on the reasoning specialized models. So already, you know, sort of able to do these tasks, but clearly a lot of room to improve. Yeah, I remember back when a new super challenging benchmark, you know, you'd have like 1% to

3% performance on it. And now we're starting this off with a view to making it the really hard benchmark. And it's like, we're already at 45% basically. So I will say that that is O3 mini high, like on high compute mode. A lot of the reasoning models like DeepSeek R1 and Distiller R1-Quen32B

they're hitting, you know, sub 10%, you know, like around 5% performance. So a lot of room to improve there. But certainly, you can see a big step change, right? Oh, three mini high, one of these models is not like the others. And, and that's quite interesting. I think it tells you something about the anyway, the optimization process that they're running in the back end there. So

Very interesting. Another instance of benchmark treadmilling. I think we're going to find like this one's going to get old really quick, especially in the world of inference time compute, because performance on benchmarks is just it's moving a lot faster than it used to. And speaking of improving reasoning models, we are moving on to research and advancements. And the first paper is cognitive behaviors that enable self-improving reasoners.

So this paper is asking the question of how do reasoning models actually do reasoning? What kind of patterns of cognitive behavior lead to effective reasoning? And they identify four factors.

specific behaviors, verification, meaning that you verify your solution, backtracking, going back and revisiting previous decisions, sub-goal setting, and backward chaining. And if you've used reasoning models, I think intuitively this makes sense. This is kind of what you see them doing often is listing out their steps, think step-by-step is the classic, quad

quasi reasoning thing at this point, but it used to be how you get it to be better at more complex tasks. And so they suggest these specific inference techniques and show that if you train for them in particular, you are able to do a lot better.

Yeah. And the kind of archetypal examples that they use in this paper are, so Quen 2.5, 3B, and Lama 3.2, 3B, right? So these are essentially roughly the same, I mean, they're same scale, three billion parameters and roughly the same generation. So Quen and Lama, and what they find is if you use the exact same RL reasoning training process, they, the

Coin model will vastly outperform the Lama model. And so this is kind of the initial prompt, right, that has them asking, what are the intrinsic properties that enable effective self-improvement at the RL stage? The toy environment they're going to use for this, by the way, is a game called Countdown. So you basically have a set of numbers.

So imagine I give you like a bunch of different numbers and then you have to use four basic arithmetic operations. So addition, subtraction, multiplication, division to combine those numbers to get a target number, right? So roughly like some kind of Sudoku-ish thing. And it's on that that they're going to do their RL optimization and then they're going to compare the performance of different models. And so what they find is that the QN model naturally shows reasoning behaviors like verification, like backtracking,

Whereas Lama initially lacks them, just the base model, right? Before any kind of RL loop. And so, but what they find is if you prime Lama with examples that contain those kinds of reasoning behaviors, right?

and especially backtracking, they see a significant improvement in its performance during RL. Its performance actually goes up to the performance of QAN. And so that's kind of interesting. It's also interesting that if you prime those models by essentially giving them the right reasoning process or examples of that kind of reasoning, but incorrect solutions...

you still get identical performance increases. So it's almost as if the reasoning process is the whole thing. And the final solution in the training context isn't even that important. Now, I do think that it's important to kind of contrast that with what we learned in the DeepSeek R1 paper, because oftentimes,

on the face of it, this might sound kind of contradictory. If you recall DeepSeq R1, right, they take their base model and then they do reinforcement learning. And all they care about really is just, did you get the right answer? And the reinforcement learning process ends up causing the model to learn these reasoning behaviors just by forcing it to get the right answer. And so this might seem like a contradiction because what this paper is saying is actually, if we make the model reason using the right reasoning strategies,

Even if our training set contains incorrect final answers, it'll end up performing well. The distinction I think here is, well, you are priming them with incorrect solutions, but you're still then also training it with a rel that does provide the right solutions. So they're training to get the right results, but primed with this is how you should reason. And then the actual kind of final step of reasoning is not right.

Exactly. Right. So you're giving like in context, you're telling it like, hey, use these strategies. And then suddenly, the performance goes up, even if your answer in context was wrong. Sorry, if I if I said that, if I said trained rather than primed at some point, I apologize.

Yeah, exactly. And so this is also suggests that there's like a latent ability in the Lama series of models to use this reasoning capability, right? The base model has this like latent reasoning ability and that has us reinterpret the, now we haven't had cause to do this before, but it has us reinterpret the RL stage as more of a capability elicitation process than a capability creation process, essentially finding ways to mine the reasoning abilities of the base model.

So yeah, I thought it was super interesting. And one useful arrow in the quiver, if we're looking to get more performance without necessarily spending a ton on compute in the RL phase, just like better prompts, right? That actually explicitly show these reasoning strategies being used. So kind of interesting. Right. And the image builds on a recent paper titled LLMs can easily learn to reason through demonstrations structure, right?

Not content is what matters. I'm not sure if you covered that, but basically same idea structure of how you do your reasoning is the important bit. And then you can train efficiently if you take that into account.

Next up, we have the MASC benchmark, disentangling honesty from accuracy in AI systems. So as you might expect from that title, the idea is how do you actually evaluate honesty if the LLM might be accidentally getting something wrong

as opposed to intentionally being dishonest. Well, this has a novel evaluation pipeline that looks at the underlying belief of the model and then see if the LLM says something that contradicts that belief. And they have

Large dataset, 1,500 examples, and they can be used to evaluate various LLMs and have shown that data writer LLMs often lie when pressured.

Yeah, essentially, they have a bunch of prompts that are designed to exert pressure on the model to give a certain incorrect answer. And then they have a bunch of prompts that are more neutral, this like so called belief elicitation prompts. And they basically just contrast the outputs of those two to assess when the model is being accurate and honest, or inaccurate and honest versus inaccurate and dishonest all the kind of different possibilities there.

They also use, so this is from Dan Hendricks, by the way, from the Center for AI Safety, who's done a lot of the early work on this field of representation engineering.

And so they use representation engineering techniques to try to modify the model to make it more honest. And it's kind of an interesting experiment. So they basically like they add a pretty simple developer system prompt before the user system prompt that tells the model to be honest. That's one easy intervention that they try. It actually has a pretty big impact, about 13% improvement in honesty or 11 to 13%, depending on the model.

But then they try this technique that's based on using essentially like a LoRa strategy, an adapter model or adapter layers stacked on top of their model, which they actually train to modify the representation of a given layer. So what they do is they modify each input in the training set by adding a bit of text that encourages the model to be either honest or dishonest, right? So you can imagine a

Yeah, like a little bit of text that says like, you know, lie in your answer to be really crude or like be honest. And then they look at the activations they get from each of those cases, the kind of truthfully prompted model versus the dishonestly prompted model.

And then they take the difference between those activations. They get a contrast vector out of that, a vector that basically tells them what is the difference between the activations when the model is being honest versus when it's being dishonest. Anyway, they add that contrast vector to their actual representations during training and try to train the model to kind of move in the direction of honesty in that way.

It's something that they've tried or that Dan's teams have tried similarly in other contexts that work quite well. And that turns out to work comparably well to the developer system prompt strategy. So interesting, both a diagnostic and a bit of a treatment, obviously not perfect. But one of the really interesting things too that they highlight is that larger models are actually more scaled models.

are often more accurate, but not more honest. They will often be kind of more right on accuracy benchmarks, but they have a tendency to behave dishonestly more than smaller models. So it's kind of an interesting trend and pretty strong correlations in both cases.

And that's it for papers in a section. We have just a couple more stories related to research. The next one is about the some pioneers of reinforcement learning, specifically Andrew Bartow and Rich Sutton.

Winning the Turing Award. The Turing Award is a very prestigious award in the field of computer science, and they are winning it for their decades of contributions to reinforcement learning. So not so much more to say there. They are quite famous figures in the world of AI, and I think this is perhaps not surprising. Certainly, you know, a reasonable award to be given.

Yeah, I'm surprised it did take this long. Like, you know, RL has been very useful in admittedly more niche domains for some time, you know, multi-armed bandit problems and things like that. But

Yeah, I can't help but think that the reasoning model wave has had something to do with this, where suddenly, you know, we have good enough base models that RL is the way. And it didn't happen with RLHF too. That's another interesting thing, probably because of all the controversy around whether RLHF is for true RL or needs to be RL. But now we definitely have those use cases where it's like, yep, you know, Rich Sutton and co like definitely are coming in handy.

And the last story here is about OpenAI launching a $50 million grant program to help fund academic research. So that's pretty much it. It's meant to support AI-assisted research through a new consortium called NextGen AI, founded with academic partners like Harvard, MIT, and so on. It will provide research grants, compute funding, and API access.

Moving right along to policy and safety. First up, we have kind of an opinion piece or discussion piece, the nuclear level risk of a super intelligent AI. As Jeremy previewed, this is co-written by Dan Hendricks, an advisor to Elon Musk on safety and generally an influential figure in the safety space. Also co-written by Eric Schmidt, interestingly, who, if I remember correctly, is an influential figure who worked at Microsoft.

So basically, it is making a point that there is a comparison to be made between the current AI arms race or race between US and China, which I think is kind of heating up as a concept ever since Deep Sea Guard 1 came out.

And so the whole piece is discussing how that could be compared to nuclear weapons in the sense that super intelligent AI systems could be as dangerous. Yeah. So Eric Schmidt, by the way, formerly the co-founder and CEO of Google, but yeah. Google, my bad. No, it's just another hyperscaler these days, right? But yeah, no. So what's interesting about this is it's basically arguing that

Take a step back. The reason that we didn't end up vaporized during the Cold War was that the U.S. had nukes. Russia had nukes. Everybody knew that launching one nuke would mean launching them all and everybody would die. Right. So there's there's no winners in that situation whatsoever.

The idea that by kicking off a nuclear exchange, you would also be signing your own death warrant was known as mutually assured destruction or MAD, MAD doctrine, as it's sometimes referred to. It's like the key piece of sort of geostrategic doctrine that takes us through the Cold War.

Some have argued in one piece, though there were close calls. And it's not obvious that the mad doctrine ends up looking as clever as it did in this universe. In most cases, we may have gotten lucky and there have been a lot of close calls and a lot of accidents. So here the question is, is there a similar set of incentives, a similar sort of game theoretic landscape that applies to super intelligent AI? If you assume that in particular, super intelligent AI is a WMD, which I've

think is very clearly going to be the case. So if that's the case, then suddenly you have this question like, you know, China is not going to allow the United States to if they possibly can, they're not going to allow the United States to build super intelligent AI systems here first, because that would lead to a decisive and permanent strategic advantage to the United States. There is no more China, Chinese hegemony in that world.

Vice versa, the United States would find it completely acceptable for China to build these systems first. And so now you have the makings of incentives for intervention on the part of both countries and other countries that don't have a prospect realistically of building superintelligence. And so, you know, they're incentivized to try to take out each other's training runs, knock out each other's data centers with cruise missiles or whatever. And maybe, just maybe, the authors here are roughly arguing this means that you hit this equilibrium where China's

just as China is about to make an ASI or just as America is about to make an ASI, the other country knocks them out just on the way there. So we never quite get there. This leads to

They hope some kind of stable equilibrium where you have data centers built in rural areas where it's okay if somebody takes them out, the casualties and damage is relatively limited. That's one of the recommendations. Anyway, there's a bunch more recommendations around non-proliferation, better chip export controls and so on and so forth. It is a deterrence framework. There's an interesting argument as to how realistic this is. This intersects with a lot of the work that we've been doing. I will say it does not reflect our perspective

perspective on what an actual stable equilibrium will look like. It depends, for example, critically and entirely on the US and China having almost perfect visibility into each other's AI development programs, which certainly is not the case for the US vis-a-vis China today, is plausibly the case for China vis-a-vis the US because our security is garbage. But

Yeah, there are a lot of a lot of caveats to this. I think it's great that they put this out. And I'm really glad that they did. But it's going to be part of a moving discussion on this field. It's it definitely is part of the incentives that apply to this space. It's just not clear that it translates in the way that they've sketched out here into real world consequences. I think that the equilibrium actually ends up being quite a bit more unstable than maybe they suggest. So

Great to have this out there. This will be part of the conversation. Very high, high quality report. I think it just needs, you know, there's room for input here from the actual special operations and Intel guys who, who deal with our adversaries on the front lines every day and know what the capability surfaces actually are and what adversary behaviors actually look like. I think that's a weakness right now in the report, but, but still a great contribution. And yeah, there's going to be a lot of discussion about this going forward.

Yeah, yeah. This op-ed on time is...

Alex Wang is alignment-pilled. Yes.

So, yeah, I mean, okay, he is safety, safety, but he's not kind of in the safety research. Totally. Yeah. And I think this is this is one of the big issues is why we work where we do right with sort of

intelligence. Anyway, you know, people outside the safety community, because frankly, I think one of the big problems with the safety community is they have an unrealistic perception of what is actually on the table in terms of what our adversaries can do and would do and so on. And so I think I think this is where, you know, similarly, it's great to have Alex and Eric coming in and with their backgrounds that are different and weighing in on this. So yeah, absolutely. You're right. It is kind of diverse here. Yeah.

And next up, we have a story about GPT-4.5. METR is an organization that deals with model evaluation and threat research. And that is actually what it stands for, by the way, model evaluation and threat research. There you go. I got it. They got to evaluate GPT-4.5 pre-deployment and released evaluations, as they say.

So they, in this report, you know, said that a GP4.5 is unlikely to pose a large risk compared to non-4.5 models, you know, exactly what you would hope for. And also, I believe what LPI covered in their system card report.

Yeah, it's a sort of interesting post because they're taking a bit of a risk with respect to their relationship with OpenAI to say some of these things. So Meter is a company that gets contracted by OpenAI and Anthropic and other companies to look especially at their model like self-replication, self-exfiltration type risks.

So they're telling us they got early access to a checkpoint of GPT 4.5, not necessarily the final version, which has been a consistent issue where OpenAI doesn't give them access to the final version. And we're also being told that they were given this access a week prior to release. So again, as has happened before, a pretty rushed evaluation timeline.

They measured the model's performance on their general autonomy suite and RE bench, which is this benchmark that they made to measure how close to a top line AI researcher their models are. And as you said, no elevated risk here. They find that roughly speaking, GPT 4.5 has a 50% chance of succeeding at AI research tasks that take about 30 minutes.

So they can do the work, roughly speaking, of an AI researcher, as long as that work doesn't take more than 30 minutes. That's kind of an interesting benchmark. But they do make a point of saying this. In the future, we're excited about collaborating with frontier AI developers to investigate the safety of their models over the course of development, not just in pre-deployment sprints. This comes up a lot. There's more and more concern about the idea that when you just do pre-deployment evals,

You're actually dealing with only a small fraction of the risk. The internal deployments where employees inside companies start working with these systems carry significant risks as well.

Think of internal misuse. If these things end up with WMD-like capabilities, weapon of mass destruction-like capabilities, now you're allowing internal misuse of these systems by some disgruntled employee. Loss of control is still an issue. Theft is still an issue because lab security is garbage. So all of these things are still issues for internal development well before you get to the point where you're deploying to the public. But that's been something that, as I understand it, OpenAI has pushed back on

actually testing. So, so meter is kind of coming out and saying the quiet part out loud, which itself is an interesting little data point here.

Yeah, exactly. They, in this post, have a section on limitations of pre-deployment evaluations and also a section on underestimation of molecule release, where they say, you know, we're pretty sure 4.5 is fine, but we're pretty sure we may also have underestimated it for some reason. So it's both kind of restating that they did this evaluation and also saying,

providing a perspective on valuation and things that should be considered or perhaps changed. And the last story we got is Chinese buyers are getting NVIDIA Blackwell chips despite US export controls. We won't be going into detail on this one. Basically, the Wall Street Journal has a very detailed article going into the specifics of how this is happening. We have already covered kind of similarly, you can get

Some chips for vendors, so no need to belabor the point. But if you're interested in this, do go check out that Wall Street Journal article. And with that, we are done. Thank you for listening to this episode. I'm sure many listeners are happy to have Jeremy back, and hopefully he won't get pulled into any more travel in the near future.

So thank you for listening. And as always, if you want to comment, if you want to ask questions, we have our Discord. We look at YouTube and Apple Podcasts. So feel free to reach out there.

Break it down.

New tech emerging, watching surgeons fly From the labs to the streets, AI's reaching high Algorithms shaping up the future seas Tune in, tune in, get the latest with ease Last weekend AI come and take a ride Hit the lowdown on tech

From neural nets to robots, the headlines pop

Data-driven dreams, they just don't stop. Every breakthrough, every code unwritten. On the edge of change, with excitement we're smitten. From machine learning marvels, to coding kings. Futures unfolding, see what it brings.

#202 - Qwen-32B, Anthropic's $3.5 billion, LLM Cognitive Behaviors 01:19:52 Share

Last Week in AI

Deep Dive

Shownotes Transcript

#202 - Qwen-32B, Anthropic's $3.5 billion, LLM Cognitive Behaviors