Elon Musk accuses OpenAI of anti-competitive behavior, including discouraging investment in competitors like XAI, misusing sensitive information, and engaging in self-dealing, which he claims are violations of antitrust rules.
Amazon's Nova family includes four text-generating models (Micro, Lite, Pro, Premiere) and multimodal models for image and video generation. These models are significantly cheaper than competitors like Claude and Anthropic, making them attractive for many use cases, especially for tasks that don't require top-tier performance.
Llama 3.3 is a 70 billion parameter model that performs on par with the larger 405 billion parameter Llama 3.1 model while being much smaller and cheaper. Meta achieved this through post-training techniques, showcasing significant progress in condensing model performance.
Adding ads to ChatGPT could help OpenAI monetize its large user base (300 million weekly active users) more effectively. However, it may also lead to concerns about censorship and prioritizing advertiser interests over user satisfaction, similar to criticisms faced by social media platforms.
Tenstorrent's main challenge is competing with NVIDIA, which has an annual release cadence for new GPUs. Tenstorrent is still on a two-year cadence, making it harder to keep up with NVIDIA's rapid innovation in the AI chip market.
Genie 2.0 is an AI model capable of generating interactive 3D worlds from a single image and text description. It differs from Genie 1, which generated 2D video game-like environments. Genie 2.0 can create consistent worlds with different perspectives and simulate interactions like bursting balloons or opening doors.
AI safety researchers like Rosie Campbell and Miles Brundage are leaving OpenAI due to concerns about the company's trajectory and focus on building AGI without sufficient emphasis on ensuring its safety and alignment with human values.
The Densing Law of LLMs introduces 'capacity density' as a metric to evaluate the quality of LLMs. It shows that open-source LLMs have been improving, with smaller models achieving better performance relative to their size, indicating progress in efficient model training and compression techniques.
The MONET model uses a mixture of monosemantic experts, where each expert corresponds to a specific concept (e.g., chemical compounds, programming languages). This approach improves interpretability by allowing researchers to identify and isolate specific concepts within the model, making it easier to understand how the model processes information.
China's export restrictions on critical minerals like gallium and germanium could impact U.S. semiconductor manufacturing, as these minerals are essential for components like power delivery systems and high-speed communication interfaces in AI chips. The U.S. is heavily dependent on Chinese supplies for these materials.
so
Hello and welcome to the Last Week in AI podcast, where you can hear us chat about what's going on with AI. As usual in this episode, we will be summarizing and discussing some of last week's most interesting AI news. As always, you can go to lastweekin.ai for the text newsletter with even more news and also all the links to all the stories we discussed, which are also in the episode description. I'm
I'm one of your hosts, Andrey Karenkov. I studied AI in grad school, now I work at a startup. And once again, we have Jeremy as our regular co-host. I'm back. Yeah. And things have been anything but regular. I've been on...
business trips and all kinds of things. I'm also moving. So right now I'm at my brother's place. A shout out to him and his wife and their newborn daughter for very... They're actually with my newborn daughter upstairs right now because they're champs. And anyway, so a bit of a different setup. I don't have my mic either. So Andre is going to try to work some magic in the back end to make me sound like I have a smooth baritone.
Um, and yeah, so anyway, super excited to dive in because this week has been crazy. Um, last week at some point, uh, maybe I'll be able to talk about what I was up to last week. Um, there may be a public artifact there too. So anyway, so, so much crazy stuff, um, on that side and just in the general AI world, like, yeah.
Dude, I don't know how we're going to do this in two hours, but we're going to give it a shot. We're going to give it a shot. And yeah, we're going to skip our usual response to listener comments of this one, because there's just to give you a quick preview, tools and apps.
new GPT subscription tier, new stuff from Amazon, applications business, more Elon Musk and OpenAI drama, and more Amazon and Fropic news, projects in open source, Lama 3.3, and an open source O1, research and advancements. We have Genie 2 from DeepMind. We have some really cool new scaling laws.
And then policy and safety, as always, more developments on China and the U.S. and export restrictions. So a lot to get through, and we'll try to do it efficiently. But before we get to news, one more thing. We do need to acknowledge our sponsor,
One of them is, as usual, the generator, Babson College's interdisciplinary AI lab focused on entrepreneurial AI. Babson is the number one school for entrepreneurship for over 30 years now in the US. And last fall, professors from all across the university partnered with students to launch the generator, which is a lab that has multiple groups, AI entrepreneurship and business innovation, AI ethics and society, and more programs
units like that. And they are peer training, perhaps in faculty. They are now going to get to 80% pretty soon across the university. So yeah, it's a pretty cool initiative. And they are doing many kind of initiatives that are meant to foster entrepreneurship and creativity with AI.
One more and then we'll get to the news. We are also brought to you by the engaging AI safety book, Uncontrollable by Darren McKee. Max Tegmark said that Uncontrollable is a captivating, balanced, and remarkably up-to-date book on the most important issue of our time, that being the danger posed by AI. It explores the key topics of uncertainty, control, and risk to show us there are good reasons to be concerned about AI.
But it's not a do-more book. It lays out a reasonable case for AI safety and what we can do about it, which is what we like to do on this podcast, I think, as well. And for those people who are interested in AI safety, it could make a good holiday gift. So you can look for it on Amazon or Audible and check it out.
And that is it for the sponsors. Let's dive straight into tools and apps. First up, we have OpenAI confirming a $200 monthly subscription called ChatGPT Pro.
So this will include primarily the advanced O1 reasoning model. So far we've had O1 Preview and O1 Mini, which both have had some limits in terms of their usage as well. And it appears that opening eyes banking on the full O1, which is even better according to WEM benchmarks than anything we've seen with O1 previously,
that people will pay this pretty large sum of money, $200 per month, 10 times the usual subscription for access to it. And this is just the beginning. They do say, OpenAI, that they have like 12 days of Christmas subscription
kind of thing with a lot more feature announcements. So you'll probably be talking a lot more about OpenAI in the next two episodes. Yeah, and so some big questions around, I think they're calling it, by the way, the 12 days of shipness. It's called the
Well, because Sam Altman, but no, I mean, it's an interesting question as to what specifically is going to come with this. They are saying that the O1 reasoning model will be part of the package that you get here. Obviously, O1, the full version of O1 is being released, has just been released by OpenAI. Apparently, it's not. So the full version of O1 is going to be available with a $20 per month tier on chat GPT. So it's not like you have to pay 200 bucks a month to get access to full O1, but
the amount of inference compute that is being used and expended by that model in service of your queries is going to be higher on that $200 a month tier. The claim is that you won't need the $200 a month tier for the vast majority of use cases. Most people will be happy with the $20 a month tier. That's the claim. I think a really interesting aspect of this, and the opening I01 model card,
has dropped. And this is something that I've spent the last day and a half basically just parsing out. It is pretty interesting. There's obviously, you know, this is just an, to some degree, an incremental improvement over one preview. There was a lot of conversation about like, how big of a deal is this going to be? Is this going to be, you know, the GPT-5 type stuff or what are these things going to shake out as? And frankly, I mean, in
I think I'm pretty surprised at how incremental the full 01 is over a one preview. It doesn't it doesn't seem to be groundbreaking. You look at the evals. It's actually like it's pretty remarkably not moving the needle on things like sweep bench verified. Right. So this is that classic software engineering capability benchmark.
We're not really seeing an improvement that's significant over a one preview. Notably, opening eye is charting its performance against the sweet bench verified benchmark relative to all the one series of models as well as GPT for Oh, they are not showing in their paper, the cloud 3.5 sonnet new performance on that very same benchmark. That's a very, very good benchmark.
That performance, by the way, 49%, right? So significantly better, uh, Claude 3.5 sonnet is then all one, even the full version of O one at 40, 41%. Uh, that's actually to me pretty surprising. Um, I think it's something that, that suggests, uh, as some people have speculated that opening, I might be having some trouble, uh, with, with
post-training with getting things to work to kind of squeeze more juice out of the lemon here. But certainly, the general capabilities of this model are impressive. There's a whole bunch of stuff, really interesting stuff that I think we should talk about maybe next week
just because there's so many stories here, but stuff about autonomy and sort of, you know, like kind of autonomy evals that are pretty remarkable and persuasion capabilities that are pretty remarkable. So we'll park that maybe for next week. I know that's not on the official roster of things. As we were moving around, we're trying to decide what stories to include. It didn't make it. We will cover it next week. But just to call that out, not clear, you know, that there is a huge delta there with the full O1 model.
Exactly. Yeah, we will probably get back to the kind of deep dive on a lot of the details of what we got. This is more of a tool side. Opening, I did highlight some benchmarks that I guess look nicer. So like on the PhD level science question, GPQA Diamond, O1 had 67% 404 reliability. O1 Pro Mode had 74%.
So not a huge jump, but it's better. 01 Preview has 58% performance on that one. On Competition Code, it's also better. But yeah, particularly you get this 01 Pro mode, as you said, that allocates more inference time compute compared to 01. And 01 is already pretty close to 01 Pro.
Yeah, it seems like a pretty high cost subscription that I don't think OpenAI expects many people to pay up for. Well, who knows? Yeah, my sense is the way they've been pitching it is you're going to get early access to the next generation of models. So think here, GPT 4.5, which is rumored to be coming down the pipe soon, probably as this 12 days of shipness takes shape.
And, you know, the other piece too is it really emphasizes, we've been talking about this quite a bit lately, but the need to kind of think of which models you're going to use for a particular use case, right? Like the O1 models are not the models you want writing poetry for you. And GPT-4.0 is not the model you want kind of reviewing your whole code base and proposing changes. If you think about the way that the O1 series of models are trained, at least our speculation on the podcast previously has been, it's specifically trained on
or reinforcement learning done on tasks that can be quantifiably verified, right? So you think coding, think math, think science-y types of questions, especially physics. And what that means is that's the direction this model is really good at expending inference time compute on.
It's not going to do a great job of thinking over the best kind of poetry it could write. Leave that to GPT-4.0. Leave that to, you know, the kind of opus series of models maybe. But if you're looking for the kind of logical reasoning stuff, that's where a quad 3.5 Sonnet U shines. It's where OpenAI 01 is sort of differentially advantaged. So anyway, that's at least for what it's worth, the landscape as we see it.
And on to the next story, a bit more of a surprise for me. Amazon has announced Nova, a new family of multimodal AI models. So this was at their reInvent conference where they announced various things. And this was probably one of the highlights.
They have four text generating models, Micro, Lite, Pro, and Premiere, all of those having different capabilities similar to what you've seen with Cloud, Haiku, and Sonnet, and Opus.
And the big deal about these models is they don't seem to be quite as performant from what you've seen, but they are available now and they are really cheap. They are something like an order of magnitude cheaper compared to, let's say, Claude and Anthropic.
They are saying that in addition to these text models, they're also launching Nova Convice for image generation and Nova Reel for video generation. I believe the biggest model, Premiere, isn't available yet, but all those other ones are available to use already. They have also promised in the future to release a speech-to-speech model and to any-to-any model.
And keep building on this Nova kind of family line. So yeah, Amazon has been a little bit quiet on the line of sort of their own big major AI models. And this appears to be their kind of entry into that.
Yeah, and we're going to talk about Amazon later today from the sort of data center footprint side, the AI hardware footprint side. They're definitely trying to make up for lost time here. I mean, they've been flat footed on this whole AI space for like the generative AI boom for some time. Look, they're following what has now really become that trend of companies releasing about three or so models at different scales for different use cases. So this makes a lot of sense. Nova Light, 300,000 tokens in context, pretty impressive for a small model.
Apparently, it's also multimodal, right? So it can look at images, multiple images, up to 30 minutes of video in a single request. That really kind of jumps out. And it supports text and multimodal fine-tuning as well. So pretty flexible, actually, especially as small models go. When you look at the performance on the benchmarks that they share as well, this is a differentially...
quite a good model, right? Like, so it looks favorable compared to Gemini 1.5 Flash, the 8 billion parameter model. Lama 3.1, 8 billion. Also kind of outperforms them pretty well across the board, right? So across, you know, GSM 8K, that sort of famous basic math benchmark. The VMAF benchmark,
benchmark as well, Dan Hendricks math benchmark. So a lot of these kind of logical reasoning benchmarks, as well as language understanding, things like that, like MMLU. So very, very strong model, especially for its size and scale and flexible too. I expect that to get, you know, decent uptake to look pretty competitive. I haven't,
Full disclosure, obviously, I haven't played with it yet, so I will hold off judgment. But for now, at least from a benchmark standpoint, looking really strong. Nova Pro, the kind of mid-range model, also 300,000 token context window. And here they're sort of focusing more on that sort of mid-range, basically like
in the middle in terms of pricing, in the middle in terms of capability and all that. It also can be used, they say, as a teacher model to distill custom variants of Amazon Nova Micro and Lite. That's kind of interesting just because it is, you know,
it is going to be used for those things for fine tuning. It's not necessarily that you want to use the full pro model here or like for every query, you want the sort of the cheap answers. This kind of like reflects Amazon's commitment to the productization of this stuff. Less focus maybe on the true frontier capability because this pro model, you know, I said it's sort of a mid-range. It's because it's,
it doesn't quite have the same capabilities as the true frontier models that we've seen. If you look across the kind of the evals, the benchmarks that even they share, you know, the Claude 3.5 Sonnet V2 basically beats it on just about every benchmark. There's some kind of instruction following benchmarks where that's not the case, but you got to be pretty picky to find the places where it outperforms
you know, GPT-4.0, Gemini 1.5 Pro and, and Claude. So, you know, maybe not the leading model, but definitely Amazon starting to flex its muscles a little bit and saying, hey, we're going to focus on the productization side. And, and anyway, I thought that was kind of interesting just because the, the,
the high-end scale and then their Nova Premier model, sort of same story. So we'll see where this goes, but Amazon is definitely on the map here. And as they buy up more and more compute, you know, they're dedicated to the idea of scaling now. So we'll see how it plays out. Absolutely. And I think the difference in price is significant enough that they are going to make a splash, is my impression. You know, even if it's not top tier model, they're competitive enough. And for many tasks,
actually the leading models are pretty much overkill. So yeah, we'll be very interested to see how much this is adopted. Obviously very good for AWS to have people use their APIs for Bedrock, which includes many other models as well, including I think Anthropic.
And just a quick note on this canvas and reel. So the image generation is pretty much what we've grown used to, honestly. I'm not sure if we can say too much about it. The video generation is a little more interesting. They are limiting it to six second videos, which take about three minutes to generate.
Just looking at the article, the fidelity is very high, very nice looking, although obviously this is relatively slow compared to their competitors that are much more real time like Luma. But regardless, this is their first foray, so pretty impressive results here as well, honestly.
And they are also doing the interesting thing we've covered before, which is providing an indemnification policy. So if someone uses their models and gets sued for copyright stuff, they are like Microsoft, like I think pretty much everyone, although I don't recall about Anthropic, they will protect you and say that we'll cover the legal expense. Adobe, of course, famously also does this. So that's another interesting note here.
All right, so those are some big stories. We're going to have a couple of kind of smaller ones that we'll get through quick. First, we have 11 Labs launching GenFM to turn user content into AI-powered podcasts. So we've seen Notebook LM get pretty popular, I think a couple of months ago now, that it's really not worth trying. I mean, you don't, here's the thing, you miss the incomparable organic content
uh, uh, lovability of two co-hosts. You know what I mean? It's just, you can't. Yeah. Anyway, don't even, I would just say, don't even try it. Yeah. Don't take our whole feed of articles and just feed it into a stool and see if it's actually better and shorter and not, it doesn't take two hours to get through it. Definitely not worth it. Right. No, definitely. Uh, you can't do better than this, but yeah,
It is still the case that Eleanor Labs has introduced GenFM, which is very much like Notebook.LM. You give it some inputs and it can produce something like a podcast that covers the details of it. And it's kind of like a pretty chatty conversational topic. Many people have found this helpful for learning about the contents of PDFs, if you need to learn about a new topic.
For some people, it's much more easy to listen to a podcast-type medium rather than just reading it on a screen. So yeah, this is, I guess, a pretty quick launch of a competitor to Notebook LM. And they also support 32 languages, including English, Hindi, and Spanish.
So 11 labs, kind of impressing me lately with the speed at which we're launching new stuff. It definitely seems like we're talking about them a lot. So that's one indicator for sure. I think that one of the big differentiators here really is that multilingual support, right? This is meant to kind of increase the footprint of the product that I think, you know, if you're speaking a language that's not...
I don't, don't quote me on this, but I don't think that Notebook LM actually has an option for like, you know, Hindi and a lot of like off English languages. So, you know, if you speak French, if you speak Spanish, like this may be your best option and a good way for 11 Labs to kind of get their feet in the door. But one interesting take home from this is
It's actually like a pretty replicable approach. I don't remember whether Google came out and said what their architecture choice was for Notebook LN in the first place. If they haven't, this is a really impressive replication move by 11 Labs. This is a pretty tough technical thing to pull off and they seem to have done it well. Right. And for context, I guess I forgot to mention 11 Labs in general is a leading provider of text-to-speech software.
So they do have the models already to generate very human-like speech, and that's their whole product. So in a way, not surprising that there are other ones who could replicate it or at least get close.
But that definitely is the thing that stood out to me about Notebook Alarm, a very human-like kind of tone and flow of the conversation, which presumably you'll get here as well. And one more story about Google. They are expanding access to their VO, generative AI video model. So now businesses can integrate it into their content creation process and
via a preview on the Vertex AI platform. I think we saw the announcement of VLO quite a while ago now, many months.
It was kind of in the same ballpark as Sora. You give it text or image prompts, it gives you pretty nice looking HD videos. And of course, there were some nice examples released alongside with this announcement. They're also expanding access to their Imagine Free text-to-image generator to all Google Cloud customers. So yeah, Google's still kind of in the race. They do have a lot of tools, a lot of features,
And I wouldn't be surprised if a lot of people just use Google because that's already in their tool set. Yeah, it's also being released along, I guess you said, with Imagine 3. One of the consequences of that is they're including this synth ID tech, right? The digital watermark that Google's been talking about a lot in both of these product lines. So not just still images, but also video. And that's kind of interesting. The other piece to it is that the length of the generated videos is not...
ostensibly it's not limited to the roughly one minute length mark that previously Vio had been advertised at. So potentially this is kind of a more extended video generation that they can do, which is really interesting for a lot of reasons, one of which we'll talk about, I think a little later today, that this idea that you can have
a video that's generated that remains coherent over a long period of time implies that you have a robust world model, a world simulator as some people sometimes call it, right? This physics model, physics engine inside the model itself. So in order to make those videos last a long time and be coherent, you have to have a model that really kind of understands physics enough to prevent things from going off the rails.
Right. To prevent balls from all of a sudden flying up into the air and turning into a I don't know, a shower of confetti or something. That's the sort of thing that that has to be has to be absorbed. And if you can do that, then you can start looking at things like, well, let's use these as simulators to train.
autonomous agents to navigate complex environments and do long-term planning. So these may seem like distinct threads in very fundamental ways. They are not. As you see video technology in particular advance, I think you're going to see a concomitant increase in the essentially long horizon planning capability of the agents that are trained on the corresponding world models. And so we'll be talking more about that later, but I just think this is a really interesting development from
Google. And as you say, very much still in the race. I mean, they're taking a different approach to this whole space than OpenAI, focusing much more on specific kind of targeted advances than sort of the scale thing and see what general capabilities drop out approach of OpenAI. Exactly. On the image generation front, imagine we, I guess, discussed this a little bit.
Last episode, one thing they highlight is that it's very prompt, accurate. You could give it like a paragraph with a lot of details and the images are faithful to that. So in addition to looking really good, Imagine Free is pretty advanced in that sense. You can very much control what you get.
And we do know that OpenAI, I guess the rumor is it will have Sora sort of launching in ship mass, maybe. That might be one of the things we'll get soon. So interesting to see how that will compare to this.
On to application and business, and we are continuing on the thread of Elon Musk versus OpenAI. On the last episode, we discussed the emails that came out that showcased kind of the early history of their involvement together and how they broke up.
This time, we have news about the development in the, I guess, legal battles between them. Elon Musk has filed for injunction to halt OpenAI's transition to for-profit. So the claim here is that there is anti-competitive behavior and OpenAI would...
their transition to a for-profit entity, which as we've covered before, seems to be ongoing and something they essentially promised to their investors while doing their most recent round where they, what was it like $6 billion, whatever it was, making OpenAI valued at $150 billion. So this is somewhat significant. Musk accuses OpenAI of discouraging investment in competitors like XAI.
and misusing sensitive information, engaging in self-dealing, those sorts of things that the claim is are anti-competitive. Yeah. And it certainly is. I mean, the rules change as you go later stage, but, you know, kind of early stage startup context, certainly, you know, telling people not to invest in another company, that would be considered a pretty bad tactic to follow. But in this case, you know, very different
very different world. Sam Altman, by the way, did comment on this. So the claim was, or generally, we've heard this claim floated that, you know, OpenAI has told people, okay, if you invest in us, you can invest in Anthropic, this idea, they may have said that even to like NVIDIA and other companies in the latest round of funding that OpenAI raised. Sam Altman is claiming now that that's not quite what happened, that he said, look, you can invest in whoever you want,
All that we are going to do is if you invest in a competitor, we're going to stop sending you our product roadmap. We're going to stop sending you our research roadmap, which is much more in standard practice territory. So right now, we seem to be getting two pretty different stories depending on who you ask. So I guess time will tell on this. But I think that's an important kind of point of clarification as to what exactly people are complaining about here.
There certainly is an allegation as well of improper information sharing between OpenAI and Microsoft, maybe anti-competitive practices there. This is centered around a couple of different things, including Reid Hoffman, who is the co-founder of LinkedIn, sort of former PayPal mafia with Peter Thiel and all that. So Reid Hoffman was simultaneously on the boards of both Microsoft and OpenAI while also being a partner at Greylock.
And so, Musk's attorneys are making the claim that that gave him privileged views into those companies and their dealings. There's also Dee Templeton, who was Microsoft's appointed non-voting board observer at OpenAI. You might remember that. So, they had that phase where they had a
an observer who couldn't vote. And the claim here from Elon is that she was in a position to, quote, facilitate agreements between Microsoft and OpenAI that would violate antitrust rules. So a couple of different layers to this onion. But the bottom line is the argument is being made that this transition from nonprofit to for-profit obviously is an issue. And an injunction, for those of you who don't speak legalese, is just literally a court coming out and saying, or I think a judge coming out and saying, you know,
ahead of time, I'm telling you, you can't do this thing. So it's sort of an anticipatory provisioning
preventative measure that makes it impossible in theory for one party to do something that the other party is concerned they might do, not that they have already done. So yeah, again, continuing seems in a more serious way, perhaps, because sort of the previous legal claims regarding, I forget exactly what the legal case was when we started out, but here, if
There is even without winning a case where the ability to just tie OpenAI up and not being able to transition from being nonprofit to for-profit, that could be very harmful and very, you know, maybe make it more challenging for OpenAI to raise more capital, which is,
there's a decent chance they'll need to do. So yeah, very curious to see what will happen here. And apparently in an email statement, an opening ice spokesperson said, Elon's fourth attempt, which again recycles the same baseless complaints, continues to be utterly without merit. And they have before dismissed Musk's illegal stuff, calling it blusterous and baseless. So yeah.
yeah, the, uh, the bad vibes continue getting worse, I guess. Yeah. I mean, it's sort of motivated. I think some of the questions that, um, I think it was this New York times or deal book summit thing, whatever that, uh, uh, you know, the God, I forget his name, but that
that famous CNBC anchor or whatever, it was talking to Sam Aldman and he was asking, look, are you concerned at all about Elon being involved in the administration apparently so closely from the outside is what it looks like. And all of this is sort of in the background, right? This idea of
You know, court cases and, you know, how that interacts with, anyway, Elon's influence in the White House and all that stuff. OpenAI kind of being cornered by Elon along a lot of different dimensions. And Sam, I think, came out and said, look, I'm not concerned. It would be unlike Elon to use the power of the state to crack down on OpenAI and so on and so forth. But how much he actually believes that.
I think we'll be left as an exercise to the reader. But anyway, yeah, no love lost between Elon and Sam, that's for sure. I like how our business section has kind of changed into being the legal drama section as well in recent months. Well, next up, another common theme we've had in the past few months about data centers. It looks like Amazon is building a mega AI supercomputer with Anthropic.
They are saying that they're going to build one of the world's most powerful AI supercomputers, expected to be five times larger than the cluster used for Anthropix's current top model. This will be part of Project Rainier and will have hundreds of thousands of Amazon's Tranium-2 chips.
and will seemingly be the largest reported AI machine cluster upon completion, I guess at hundreds of thousands, you're beating the 100,000, 200,000 that XAI has recently seemingly gotten online. So again, I guess part of the overall trend in the industry, and not too surprising that Amazon would be one that's trying to make a big impact here.
Yeah, no word on the article. I mean, you see five times larger than what was used to train the latest Claude series of models. I've heard numbers on the order of like 400,000 H100s. So like your H100 equivalents, let's say, out of the Tranium 2 chip here. So yeah, we'll see. I wonder if I just did a quick search, just to see if there was an official number out there. Didn't find anything. Maybe there is. Certainly, hopefully listeners will let us know. So
Yeah, one of the big claims here is that the new AWS clusters are 30 to 40% cheaper than those that feature NVIDIA's GPUs. So that would be another big, big deal, right? One of the big things you look for in this space is where are you getting your margin, right? Where are your profits going to come from? Because increasingly, like the language models themselves are, it's preemptive.
It's pretty competitive, right? Most applications, Andre, you pointed this out, but most applications don't require the full-throated, like the super scaled frontier models that are being built at the edge of capability. Most of them are fine with sort of a run-of-the-mill, like, you know,
whatever, anyway, like a smaller scale, 8 billion parameter model or whatever. And so given that, when you're looking at the range of like 8 billion parameter models, all of a sudden, what starts to matter more is how cheaply can you deliver performance? And that's one area where AWS certainly is really well positioned. Anthropic needs that because they have, you know, they're a
increasingly glued at the hip with Amazon. This latest deal certainly does that more so. It brings their total investment, Amazon's total investment in Anthropic to $8 billion. This is a big, big part of Anthropic's war chest.
So anyway, we also know, by the way, that according to the same article, there is a next generation chip coming up, Tranium 3. This is Amazon coming out with the, well, the thing after Tranium 2. It's going to be available in late 2025. So really going to be competing with Blackwell. It better be good, right? That's what this means. You're going to be producing these things at scale. This is the Blackwell class, the Blackwell generation for Amazon.
And certainly they're going to be working with Anthropic very closely on the development of those chips or on the refinement of the Tranium line. And that's going to be part of what's led, I'm sure, to Tranium 3. We know that the previous agreements between Amazon and Anthropic have featured commitments by Anthropic to use some Amazon hardware. And so that's going to be a really important part of the feedback loop that Amazon depends on to be competitive on the hardware side.
Right. And since we are, I guess, filling out the rest of the context from their announcements, worth noting also, just in brief, that along with these other announcements, there were some other tools, one being Bedrock Agents. So they now enable people to build these kind of agent systems, maybe not exactly the same as Claude's Anthropic Agents,
sorry, Anthropix computer use API. Here, you'll be able to automate things like customer support, order processing, and analytics by hooking together different data sources and APIs within AWS.
As we've said before, there's also model distillation as a feature where you can take larger models and make them smaller and less expensive and faster. And they have also this automated reasoning tool that is a verification tool. It can take the outputs of a model and then have some logical reasoning to seemingly, I guess, reason about the output and see if it's wrong or could be improved.
kind of aligned with 01. So a whole pretty big suite of tools and announcements at this reInvent event. On to the lightning round and back to OpenAI, we have a bit more of a speculative story. And the title is, it sounds an awful lot like OpenAI is adding ads to chat GPT. So this is based on recruitment ads for professionals from tech
Firms like Google and Meta with interest potentially in an ad-supported model. And there were also some comments by the CFO, Chief Financial Officer Sarah Fryer, that it seems like maybe they're evaluating moving there and they'll be thoughtful about when and how to do it. Later, she did redact that and say that they are happy with our current model.
business model. So perhaps not a fully surprise and still not anywhere close to a confirmation, but certainly hinting at ads being a potential avenue stream for OpenAI. Yeah, that's going to be one pissed off comms team there. Yeah. To make or take it back, I mean, honestly, I'd be surprised if they didn't end up going in that direction at this point for ads. There are pros and cons to it. The article talks about this a little bit, but just the idea that when you do go for ads, it's
It is something that will tend to get you to focus more on satisfying your advertisers and maybe being a little bit more cautious. We've seen criticism of social media on that basis that companies like Twitter, as it used to be, and Meta and Instagram, those sorts of platforms, rely so much on advertisers that then they're more likely to censor things. So it's been a common refrain. But on the flip side...
I think just at the scale they're going to be operating at already, actually, no, 300 million weekly active users is the number that's recently been announced from OpenAI. You're getting to that scale where to reach those people, you can't be charging them, you know, like just,
20 bucks a month or whatever to access your tech and fully kind of capitalize on the value of having those users on your platform, advertising starts to become really important. And we've seen perplexity obviously go in that direction too as well. The other thing that makes it sort of odd that they would bother denying that they're looking at this or kind of really surprised
starting to commit to this direction is we know that they've actually been hiring ad talent. So back in May, they hired, so I'm not even going to try to pronounce that. I'll try. I'll fail. Shiva Kumar Venkataraman. It's pretty good. Hopefully, yeah. Sounded plausible.
Okay. Okay. So our listeners who, who know what the name is supposed to be, I'm sure are commenting right now. Thank you, please. Um, anyway, so he previously led Google search advertising team as vice president. So, you know, this is a pretty big, heavy hitter hire to, you know, to, to not be doing this for any particular reason. And, um,
and, and actually, you know, Sarah Fryer also pointing out, look, we have these great, um, this great leadership, uh, from with people who have experience at next door at square at Salesforce, who've done a bunch of stuff in the advertising world. And Kevin Weil, uh, who's their, their CPO at open AI also has this background and all that stuff. So there's a lot of like talking up their pedigree on ads and, uh,
Right.
Right. And I mean, it'd be surprising if they aren't at least considering it. Obviously, they recently launched search as part of web search as part of ChatGPT. We saw Perplexity just a month ago in early November say that they are bringing ads into their search process. And they, for a while before that, wanted to just be subscription-based. Now they do have sponsored ads.
follow-up prompts that show up after you search for something. So certainly something when you're doing web search, when you're doing kind of real-time information,
It's awfully appealing to consider it as a printer of money. Next, we've got a story about fundraising once again, and this time it is about Black Forest Labs. It seems they are in talks with A16Z to have a 200 million round, which would make them valued at more than 1 billion.
This is still kind of in a discussion point. Apparently, there's some private information here. This is from people familiar with the plans. But, you know, plausible, I think, considering that we saw their quick progress when launching Flux on Grok, on X, and then having Grex also, sorry, their Flux tools on various platforms, including Mistral. They are...
certainly a big player increasingly in the space. Yeah. And this is a round theoretically that would be led by A16Z, Andreessen Horowitz, and would be a unicorn round. So we're looking at an over a billion dollar valuation, which is, you know, it's significant. It's also interesting because this is such a specialized team, right? Like they are looking at the kind of
image, video, multimedia type generation and less the AGI play. So it's kind of interesting. This is obviously a result of the partnership between Black Forest Labs and X, which drew all the attention of Silicon Valley into this previously really kind of unknown startup before they partnered with Elon and X. But they are apparently also bringing on...
Sorry, this is from previous rounds. They already rather have Gary Tan, who heads up Y Combinator on their cap table. And Andreessen has invested previously that $31 million seed round. Seed round, by the way, that's insane. I'm old enough to remember when a seed round was half a million dollars, $31 million. They apparently raised in August. I'm sure we covered it back then, but I'm just looking at that number like, damn, that's a pretty good looking cap table. In any case, I think
I think one of the big issues, and they're flagging this here, is if you raise too much money too fast, right, or your valuation is overly high and you haven't proven things out, you don't have like clear revenue, one of the challenges is you might be taking a down round if you can't live up to that valuation, like live up to the hype. You know, the next time you have to raise funds, you might be taking a haircut on the valuation and that's like,
Like that's, that's death for a startup quite often, right? It does bad things to the cap table and it's generally just a really bad sign. So they're trying apparently to be a little conservative, maybe not, you know, jumping with both feet to this, this fundraise. It'll be interesting to see if it actually closes.
But yeah, understandable and very mature play by the founders here kind of saying like, hold on a minute, we need to grow into this valuation a little bit more. And we'll see. I mean, A16Z, not bad to have them double down. And one more story about fundraising. This time it's about a chip startup, Tenstorrent, which is getting investments from Jeff Bezos and Samsung, among others.
They're getting $700 million. That would put their valuation at $2.6 billion. Their entire mission is to challenge NVIDIA to create more cost-effective AI chips by using open-source technology and avoiding some of the expensive components and VA uses like high bandwidth memory.
They've been at it for a little while now. They do have some revenue. They have contracts worth nearly $150 million. But certainly, it'll be interesting to see if they are able to compete. And this kind of big investment seems to be a pretty good show of confidence.
It does. It's a really interesting company, TensTorrent. We've talked about it previously. They are pursuing a bit of an unusual approach to chip design. One of the key pieces here is, as you mentioned, high bandwidth memory, HBM. We'll talk about that in the hardware episode, which by the way, we'll be coming out, right? We have this on the books, do we not?
We have concepts of a cloud. See, I thought we had a deal at least. We'll make it happen. Okay, we'll make it happen. But in any case, we should record it soon. But high bandwidth memory, HBM, is a universal today, a universal component of all GPUs that do anything worthwhile in the AI space. Basically, it's just memory that can just, you can pull huge volumes of data really fast, hence high bandwidth memory.
which is exactly what you need to train these really big models. The problem with high bandwidth memory is that it hasn't been improving. If you think about Moore's law, it hasn't been improving as quickly as logic, as the actual computations, as the logic dies that power these GPUs.
And so you're not riding the same wave when you hitch your wagon to HBM as you might be if you orient your approach in, let's say, just other directions. That's what TenStorm is doing here. They're trying to ride trends that are a little bit steeper that might allow them to overtake as HBM grows more slowly. And the reason why HBM is growing slowly is something we'll also talk about in our hardware episode. It's kind of interesting in and of itself. The other thing TenStorm has going for it is they've been a big proponent of the
this essentially a new kind of logic processor that uses RISC-V, which is an open standard for an instruction set architecture. Basically, this is the language that processors actually understand. It's sort of machine level code, the interface in a sense between the hardware and the software at rock...
What would you call it? At bedrock, essentially. So RISC-V is the open source play, and it's a competitor to the closed source product of ARM Holdings. So ARM is famous for its ISA, its instruction set architecture. And essentially what TenStorm is doing is saying, hey, look, we're betting against HBM, and we're betting in favor of the open source RISC-V instruction set architecture that people can iterate on more easily. So anyway...
Anyway, they've just generated $150 million in signed contracts, by the way. So this is not enough to justify a valuation unless you believe there's big, big growth that could be coming in the future. And another challenge that TenStorm has is when you're a small chip designer,
You can't design at the same cadence as the big guys. NVIDIA recently updated their cadence to the point where they're releasing a new design for a new GPU every year. They have an annual release cadence. It used to be every two years, and Tencent Torn is still on that every two-year cadence. So...
I think they've got an uphill battle ahead of them. Everybody does. But boy, is that market looking really good. And they're, by the way, Tencent also just moved over to TSMC. They previously were with Global Foundries, which is more of a struggling chip fab firm. TSMC, of course, that...
pure play foundry that we talk about so much. And they're now going to start using the two nanometer fabrication process from TSMC. So real, you know, cutting edge process that, well, we'll see if they can use it well. And on to projects and open source where we do have some pretty exciting stories. And we begin with perhaps the most exciting one of all. We have yet another Lama release from Meta.
So most recently, we had Lama 3.2, pretty recently, and that was a multi-model version of Lama. Now we have Lama 3.370B, which is a new release that is seemingly about on par with Lama 3.1 405B, the much larger model, while being much smaller and relatively cheaper. So they came out with benchmarks across various platforms
Things like IF, eval, human eval, math, et cetera. And the scores are, let's say in the same ballpark while the cost is the same as their previous 70B models.
much less than the bigger 405 billion models. So really showcases that we have made a lot of progress in being able to condense big models and really squeeze out the performance capabilities
add smaller sizes. And that's what they're kind of saying here. They have made use of post-training techniques to make this happen. Yeah, I think increasingly to see this, you've got this big kind of range of tasks that AI can help with, obviously. And increasingly, the vast majority are falling in this bucket of relatively simple tasks to automate. Smaller models with gradualization
Gradually increasing performance, as we're seeing, are a lot cheaper to serve, right? So you may actually want to overtrain a small model. So it's not necessarily like, you know, you can get away with having a bigger model that performs better for the same amount of compute. But instead, you take that blob of compute, apply it to a smaller model because it's so much cheaper to inference, to serve up. So, you know, that's kind of the philosophy here as we move from 405 billion parameters for metas
Lama 3 kind of large model to the 70 billion parameter model here. That's a very sensible next play. One thing I found really interesting here, so the usage of both the Lama model and then the Meta AI Assistant, which is powered by Lama, 650 million downloads of the Lama models, which, I mean, I would have
bet against that many downloads of any language? Any machine learning model. I do not know where you get millions of downloads, maybe from server deployments where you have kind of serverless or so on, where you have many startups using this and deploying it on AWS or something like that, where you have an automated type process, perhaps. And that's
That's got to be it. There's obviously not 650 million people who know about even what the hell this model is. But bottom line is, yeah, exactly. It is being used very widely. Yeah.
And, uh, and, and of course we've talked about the licensing restrictions here, platforms, over 700 million monthly users need special licensing. So basically this is them just flipping the bird at, uh, at Google and at open AI and that sort of thing. Um, now on the monthly users piece, I thought this was interesting. We just talked about open AI hitting 300 million, um, weekly active users. We don't know what their monthly active users are, or at least it wasn't in the stuff that I've looked at this week. Um,
600 million monthly active users is the meta AI assistant. So if true, this is quite a comeback, right? I mean, it's essentially comparable usage to chat GPT for a platform that really didn't exist in any meaningful sense until about, say, two years ago. So had about two years delay relative to open AI. So pretty impressive. Obviously, we've talked about this too on the podcast, distribution, distribution, distribution.
The reason that Microsoft teams outshone Slack was because Microsoft just has better distribution. They're in everybody's computers to start with. Well, the reason that Meta's had so much growth in their assistant is everybody uses some combination of Instagram, Facebook, or WhatsApp, or whatever the hell. So they've got a huge, huge advantage here, and they're going to try to spend their way to...
monetizing that advantage somehow, right? There's all this activity as well that's being motivated by their big push in this direction. $10 billion AI data center in Louisiana, which is called that in the article, which is coming soon. And over 100,000
uh, H 100 GPUs. Uh, there's, there's, you know, the B series as well that they'll be bringing online soon. So pretty, uh, pretty cool. Very cool. Next, we got another massive company open sourcing a model. This time it is Alibaba and they are releasing an open challenger to open the eyes. Oh, one reasoning model. So this model is called Q W Q. I think, uh, on the X, it was an explanation of how to pronounce it. I think it's Q. Uh,
I'm not sure. But anyways, it's a 32 billion parameter model. They're releasing it as a Dash preview model and under Apache 2.0. So that's pretty permissive. You can use it for commercial use. And as we saw last week, we covered another open source model that
had reasoning built in R1 light preview from DeepSeek, this similarly is optimized for doing reasoning of the nature similar to R1, where when you ask it something, it outputs and lets you see kind of a reasoning trace
talking through the question more or less, and then outputting the answer. And as before, on some tasks, it actually performs quite a bit better than things not optimized for that.
Yeah, I got to say the blog, because there's not a lot of information about the model. We know it's 32 billion parameters, as you said. We know it's from the Quinn team. Okay, good stuff. QW2, by the way, apparently stands for Quinn with questions. This idea being that it is some sort of like reflective model, right? I do want to call out, I mean, the blog post, it's one of the weirdest things.
fucking write-ups I've ever seen. So I just want to pull this out, right? So this is just from the blog post. It's hard not to read this like with a pipe in your mouth. So what does it mean to think, to question, to understand?
Anyway, it goes on and...
And it goes on and it goes on and it will talk all about seekers of wisdom and shit. And it will tell you nothing about the fucking architecture. So there you go. It's going to be open source soon, though. So we'll have those answers. I just like I want to I want to see who OK, that blog post, because that is some of the funniest shit that I've seen a long time. Yeah. The title on it is QWQ reflect deeply on the boundaries of the unknown. Yeah.
Very poetic right there. I mean, God damn it. Now, I will say it's another indication, of course, that we are seeing a proliferation, including into open source, including by Chinese companies of some potentially pretty impressive. I mean, certainly R1 was. It remains to be seen how Quinn with questions does, but a really impressive increase.
inference time compute strategies did not take long to replicate what OpenAI was doing. I'm going to go out on a limb and say that there is a greater probability than is widely appreciated that some of this is industrial espionage at this point. OpenAI is going to be penetrated to the blazes. That's just...
you know, it's pretty clear if you have a national security bone in your body that that's going to be happening. And so to the extent that China's in the business of picking state champions for this sort of thing and sharing that intel, this could be a vector. Like, it could be. But the other thing is the, like, R1 team at least is, like, cracked. Like, DeepSeek, those guys are really, really good. So they could just be that good that they're doing it.
And in that case, it makes you think about export control, makes you think about all kinds of things like that when it comes to the control of the system. Yeah, and I think it also is possible that we are seeing kind of a low-hanging fruit era of these sort of reasoning models where we are applying more or less kind of a set of pretty well-known reasonable techniques to get a lot of progress relatively fast. And it could be that everyone kind of is just disinterested
doing the more or less well-known ideas at the same time. And they are just very efficient because there hasn't been too much effort on this yet. And on the benchmarks here, they are seeing not quite as good as O1 Preview, but on par with O1 Mini or getting closer certainly than 4.0 or Cloud 3.5 or Command 2.5 72B. So...
You know, not going to beat OpenAI's top of the line models, but gets you pretty close compared to non-reasoning type models. Yeah, and it is an open model. And to your point, I mean, I think the take home on that export control question is absolutely that, right? Every time you have a paradigm shift to a, you know, in this case, inference time compute, but it eventually will be other things. You have this overhang.
Where all of a sudden, there's a whole bunch of people who previously couldn't compete who now maybe can. So I think it's a really important, let's say, policy lesson to learn that things can shift on you and you don't want to have policies that are too anchored towards just a pure training paradigm or whatever else.
Onto the next one, also an open source model by a Chinese giant. This time it is Tencent and they are launching Hunyuan Video, which is a 13 billion parameter open source AI model for text to video generation.
This is an area where we don't see too many big open source models. In fact, this would be the largest of its kind in the open source domain. It has some features like apparently video to audio synthesis as well. You can do various inputs like voice, facial expressions, and body poses. Overall, it seems like a pretty useful model that could be leveraged for fun things.
Yeah. And I guess reflecting the Chinese pedigree here, they've really focused on the scaling approaches to cut computational costs. Um, so their, their technique ends up getting them about 80% savings relative to what comparable systems might have in the past. It is a, it is a pretty hefty model. I mean, 13 billion parameters. Yeah. Not, not bad at all. Not too shabby. Um, pretty compute intensive. So, um, yeah. Uh,
We'll see if it ends up getting taken up and used, but certainly having a leading model in the open source domain for text-to-video is interesting, and it gives China an ability to project some power to in this dimension. That's right. Yeah. Just looking at the clips right now, and I'll try to splice them up in the YouTube video, they're pretty good. They're not top of the line. There's still some AI artifacting, but pretty impressive. It's just being...
being open source that might have some impact. Next up, we have a paper, not an open source model, and it is...
DEMO, decoupled momentum optimization. This is a new optimization method that decouples momentum updates to reduce the need for high-speed interconnects and make it more possible to therefore kind of do distributed training. And this is an area where Jeremy is much more of a fan and aware, so I'll let you take over and just give the details on this one.
Yeah, yeah, for sure. So this is both fascinating and I think really important and part of the trend that I think we want to call out in general. So first of all, this is another piece of research from Noose Research or New Research. I don't know if they want to
Pronounce it one way or another. So they are the Cosmo Kramer, if you're a Seinfeld fan, of the AI world, of the AI open source world. Very kind of like esoteric sort of views on everything from AI consciousness to the importance of decentralized compute. And that's what this is. So this big question is like, how can we...
The ideological question that motivates this to some degree is how can we set up these large distributed training infrastructure that will be difficult to control because it'll be decentralized, that will take advantage of small piles of compute that you have sitting around here or there. That's kind of one of the long-term goals, stretch goals of an approach like this.
And fundamentally, what they're looking at here is, how do we start by cutting the amount of communication that's required between all those nodes? Because if you have a bunch of different nodes, you're going to have to communicate an awful lot between them. And they come up with a bunch of conjectures that are really freaking interesting, that I would have thought made no sense, but then just seemed to work empirically. So
First of all, when we do training, we use optimizers. These optimizers are essentially the things that allow you to decide how exactly to update your model parameters after each round of training, right? So your optimizer is, it could be set up in different ways.
One is to account for momentum. So if you find, for example, that a certain set of parameters, they keep getting nudged in the same direction after many updates, right? So after one batch, they move in one direction. After another batch, they move more in that direction. Well, maybe that suggests that your next batch of training, they'll probably move in that same direction again, right? And so you might want to take advantage of some notion of like the momentum of
of those updates. Like if you notice that the past updates have always been tending, in other words, have momentum in a certain direction, maybe you keep it, keep that in mind and use that to kind of make a more informed guess as to the direction that your parameter values should evolve into. And so what they do in this, in this paper really is they say, okay, well, uh,
I wonder if there are different components of the parameters in a model, so some clusters of components that tend to evolve more rapidly, so fast-moving components with big momentum values in a sense, versus ones that, let's say, have more sporadic changes where you have sort of quicker changes with higher temporal variance, basically. They're less predictable.
And it turns out that this actually is true, that there are identifiable kind of pockets of parameters that, and there's a structure to them. This is kind of a crazy thing to me. I mean, when you initialize the parameters in neural network, you initialize them all randomly, right? There's no reason that any particular cluster of parameters should tend to evolve a certain way, should tend to, you know,
consistently have high momentum in a certain direction or low momentum, and yet they kind of find this. And so they find that there are some...
clusters of these parameters, and I'm using the term cluster very loosely here, because anyway, they identify them in different interesting ways. They use sort of like a Fourier transform if you're in signals analysis, a cosine transform to identify the fast-moving components, the slow-moving components, and they go, well, you know what? If we have some components that have this predictability to them, that have high momentum, right? Maybe we don't need to update those as often. So maybe we can...
essentially just pick and choose which parameters we update in more frequent updates and leave the rest of them to more sparse updates. And that allows us to do a lot less communication between nodes during training. And by doing that, they're able to reduce their bandwidth or communication requirements by over 20x in the best cases. That's pretty wild, right? That really, really, really makes a difference in your training efficiency. So
I find that pretty, pretty remarkable. Um, and, uh, anyway, there's a bunch of reasons. I mean, I kind of feel like we could do a whole episode on why this is so weird and unexpected. And this at least tells me that there's an awful lot that I don't understand, um, about the dynamics of these systems and why these patterns appear. Like they don't even try to defend this. They just sort of say like, or they don't have a proof for it yet in this
in this paper as to why there is a pattern. And they seem to kind of acknowledge that it's weird, but empirically, it seems to be true. And that's an amazing thing. And I could imagine this having some implications for scaled decentralized training. Right. And I believe we covered their announcement of this maybe a month or two ago, they released
The announcement that they have this new method and they just showed some results, this seems like the paper that they promised that goes into some more detail. And yeah, it seems like it actually works, which is kind of crazy. Yeah, this is more to the Cosmo Kramer-ness of news research. They put out that paper, you're right, right? They were just like, hey, we did some cool shit.
look at the crazy result, but we're not going to tell you how we did it. And so now even somehow when they tell you how they do it, you're still like, but that doesn't make sense. That is the most schizophrenic freaking training idea I've ever seen. And yet it works. So Cosmo Kramer, man. And on that topic of decentralized training, the last story is about Prime Intellect Releasing Intellect One.
the first 10 billion parameter language model collaboratively trained across the globe. So I just looked it up in October 11th, they had the blog post Intellect One launching the first decentralized training of a 10 billion parameter model, which we covered at the time.
And so, about two months later, they have another blog post, IntellectOne release the first globally trained 10 billion parameter model. And they go into various details here. The one trillion tokens they used to train were composed of different open data sets. They do say they trained across what, three continents, I believe.
And the term train is a little loose here, I suppose. They are competitive with older models. So they compare to LAMA 2 7B and LAMA 2 13B, Falcon 7B, MTB 7, CHAT. Compared to this year's 10 billion and several billion parameter models, this is not on that spectrum. But
But nevertheless, it is a somewhat performant, smallish language model. And that is quite a bit of an achievement, given that it is very difficult to do training, especially at this level of decentralization. Yeah, I think this is A, a lot more impressive than it seems based on the performance of the model. And B, the implications for policy are pretty wild, including energy policy. So this is...
As you said, I mean, it trained across three different continents, up to 14 concurrent nodes. In other words, they had up to 14 concurrent groups of compute that were being aggregated together with contributions from 30 different compute providers. So you've got people dynamically joining and leaving the training process and so on. I want to call your attention to a couple numbers, actually maybe one number in particular, just for in the interest of time here. So
Flops utilization, right? So this is how busy you can keep your GPUs in practice. In practice, there's tons of downtime when you're actually training models in the real world because you've got to move data back and forth between things. GPUs are sitting idle, kind of twiddling their thumbs as they wait. There's all kinds of stuff when you do pipeline parallelism where you get a bubble that forms anyway. Bottom line is they hit 36% to 41% micrometrics.
model flops utilization. That is really impressive. So typically on an H100 GPU, Frontier Labs will hit flops utilizations of around 35-ish percent.
maybe as high as 40%. And this is from Semi Analysis. They talk about this quite a bit, but for kind of trillion parameter training runs. So as you scale more, you tend to be able to do better on flops utilization. You can think of that as being a consequence of economies of scale broadly understood. But bottom line is they are doing really well in terms of keeping their GPUs
busy, keeping their logic engines busy. And that is a key, key thing when you're doing distributed training like this. You want to make sure that, yeah, your things are kind of churning out, churning out product, if you will, over time really efficiently. So
So a couple of things here. One is when you think about the big barriers to American dominance in AI right now, energy is by far and away number one. We have all the chips we want by and large. We do not have all the power we want. What this means in turn is that if you are an AI company, you're looking to build data centers at massive scale wherever you can get
of power, wherever you can get the spare grid capacity. And right now, labs are looking at that one gigawatt range in the sort of 2026-ish era, trying to create one gigawatt compute clusters. And the problem is that there isn't a spare gigawatt of baseload power, or just
available on the grid anywhere. You have to kind of patch it together from different geographic locations. And if you're going to do that, now all of a sudden you're in the business of distributed training.
Now, all of a sudden, you need to come up with really efficient training approaches that allow you to pool together that kind of compute across geographically separated regions. Google does this already in campuses that are relatively close together, but they don't do it, say, across the country. And that's really what IntellectOne is doing here. They're really pushing the limits on not model capability, but how distributed can this training be? Can we use that spare lap
that you have lying around and get it to do some work in this direction. And that's really what this is about. And they have this whole prime framework that uses DeLoco. We talked about DeLoco in the last episode that we did. So do check that out if you're interested. And then they have this whole elastic device mesh, which is really cool. It's this thing that
As you can imagine, if you do distributed training like this, you've got to have ways to dynamically allow new nodes to join and leave the system as you train. And that exit and that entry, they have to be dealt with gracefully. You need fault tolerance. You need ways of welcoming in new nodes to kind of contribute, compute new GPUs without failing. Right?
right? Without kind of falling all over yourself. And anyway, that's a big part of what they're up to here. So really, really interesting paper. If you're into the hardware side of things, if you're into policy, like you're going to have to learn how to speak this language. If you want to predict the future in this space, you're going to have to learn how to speak this language because increasingly energy is the constraint. And this is really where things are going to go.
As to that paper, they did release a pretty detailed technical report, 15 pages going into all the details of the data set, the people involved. They had many people in the US and also in Europe, with also some in Asia, in places like India and Singapore.
So yeah, quite a few details there and they do release a model under the Apache 2.0 license. So very permissive. You can use it for whatever you want. Big, you know, good contribution to open source. You also have a code they use for training and the details of the model as well. So yeah,
Overall, yeah, if you're an open source, if you're interested in decentralized training, this is a pretty exciting release. On to research and advancements, and we are getting back to the idea of world models with DeepMind's Genie 2.0.
So I don't remember when it was, but I think it was this year that we were talking about Genie 1, which was a research paper and a research work that had the ability to kind of let you play a 2D video game, broadly speaking, that was entirely generated by AI.
So it looked like your kind of typical platformer. You could move around a character, you could hit jump, but there was no code running it aside from a big neural net that was producing a video stream. And so now we get Genie 2, which is an AI model capable of generating interactive 3D worlds from a single image and text description.
And you get kind of the typical kinds of things you would get in video games. You can have a character where you're running through an environment like a desert and you can hit jump. You can have a race car or a boat and you can kind of swim on a lake and it is almost
almost real time. It apparently can generate consistent worlds with different perspectives for up to a minute. We also showcase memory, where you look away from a part of the scene and look back, you actually do retain the aspects of a world that were there. That was something we saw with some of these kinds of things with Minecraft simulation we covered not too long ago. There was this thing where if you do a 360-degree turn,
what you see in front of you is not the same thing as what you saw before you started turning. Whereas here, they have at least some examples where the details are preserved even when you look away, which is telling you that there is some notion of world model, as Jeremy said. So very cool effort, very cool videos on this one. I'll try to include in the YouTube version. And yeah, again, this is maybe one of these examples
quieter trends, but as we're going, starting really with Sora and maybe even earlier, where a lot of people are excited about world models and are pushing in that direction, even though it doesn't have as much impact on most people. Yeah, this is another one. So there's a researcher
whose work I've been following for some time. So Tim Hochdeschel, who is at GDM right now and was part of this research, was part of the Genie One research. And to your point, pre-Sora, he was working on stuff like this. There were big efforts in this direction. They were initially in the kind of procedural generation of game environments. Basically, like, can we
Can we create games where you generate new game environments by following simple rules so you can autonomously generate these sort of out-of-distribution settings for the purpose of training AI models and agents? This is really, I think, the realization that actually we can go all the way by not procedurally generating, by doing just like
deep neural network generated environments, one of the big jumps to between Genie 2 and Genie 1 is that we're now moving from 2D to 3D. So again, when you think about agents navigating the real world, that's a big deal. We don't know the parameter count of Genie 2. In fact, we don't have all that many technical details. That's one of the interesting things here. Genie 1, we had a whole paper. We kind of chewed on that sucker for a while. We had an 11 billion parameter model. We knew how they...
trained it. I would guess that there's a lot of similarity to here, not just because of the naming convention, but because of the way that the model is set up and what it can do. They generate, using Imagine 3, an image based on a text prompt
And then they turn that image into a navigable environment in the way that you just described. To the point of these things are world models, right? Some of the things that this thing can do is simulate interactions like bursting balloons, opening doors, shooting barrels of explosives. There's like a million things that you can see, you know, directional lighting, grass blowing, you know, all that kind of stuff that really suggests this thing has captured something interesting and meaningful about the way the world works.
In terms of what's going on under the hood, we're left to speculate based on what we know of Genie 1. And there, the key really was, two keys were the latent action model
Basically, they trained this model to take as an input the previous sort of video frames and the frame to come and then to predict, okay, what was the action that took us from those early frames to the next frame? Essentially, what was the causal event that links these past frames to the future frame? And that's part of the system learning to infer what actions cause state transitions in the environment.
So that's one piece. That's the latent action model. Essentially, you can think of this as like reverse engineering the physics of what just happened, right? So if I show you a car that's driving along the road in one image, and then the next image, I show you the car that's, you know, I don't know, parked or something, you can infer that, okay, the driver probably, you know, put their foot on the brake or whatever at some point, you're doing all that work. That's the latent action model. That's what's learning a lot of the cause and effect in the world there. And separately, there's the dynamics model where you take in
a bunch of previous frames, and you take in an action, and then your job is to predict the next frame.
This is a model that essentially is going to say, okay, here's the past history of the system, here's the nudge I'm going to give it, and then what's going to happen from there. And it's the way these two models are integrated together, the latent action model and the dynamics model, that together give you the sort of immersive physics and playable world that you get out of at least Genie 1, and I would suspect Genie 2. In fact, I don't think that they're claiming that it's different at all.
Right. And to that point of not having a paper, unlike what we were discussing with that Minecraft example by Deckard, or also we couldn't get into it, but there was a similar release of a type of world model from World Labs, where you could also input an image and kind of move around a little bit. Here, there's no interactive demo. We just get a bunch of videos.
And they do mention that they used an undistilled model for this. I'm pretty sure this is not real time, although they do claim that they can distill it and have lower quality real time performance. One thing that is kind of fun to note is that this is from the CIMA team, where CIMA is the scaling instructable agents across many simulated worlds.
This is a paper we did cover, I believe, earlier where they had agents in many video games where you told them, "Go to this planet or open that door," and they just learned to use a mouse or use whatever controls to end the game without rendering it or anything, do those actions.
It kind of makes a lot of sense that the same team that was already training agents in these games would then go and make this kind of video simulation model using presumably a lot of the same data and a lot of the same infrastructure.
On to the lightning round, we'll try to cover the rest pretty quick. First, we have language models are hidden reasoners unlocking latent reason capabilities via self-rewarding. So there is this technique, latent reasoning optimization, LATRO,
which is going to enhance the capabilities of large language model by treating reasoning as sampling from a latent distribution and optimizing using it, using variation methods. So I'm just quoting from the abstract here, but the general idea is that you're going to have the kind of existing capabilities that are already in the LLM baked in,
and doing that by optimally sampling and optimizing the output that you get.
Jeremy, I'm sure you have a little bit more detail you want to get into here. Yeah, I thought this was a really interesting one from an inference time compute standpoint. That's one of my pet topics these days, just for obvious reasons, with the O1 and Sonnet 3.5 new releases. So what this is basically saying is, hey, we have a new way of scalably developing a dataset of
of reasoning rationales. That's actually a big blocker for these agentic models. What we don't have on the internet, we have a whole bunch of text, a whole bunch of video and all that crap. What we don't have is examples of extended reasoning traces that we can use to train models to be good reasoners on. So can we create those data sets in a reliable, automated way? And that's what this paper is about. So fundamentally, they're going to take some kind of challenging technical problem, they'll ask a question, that question, let's call it
X and there's going to be a correct answer Y, right? And that's going to be your starting point. Now what you're going to do is you're going to basically get a model to try to pitch a bunch of different rationales. And it turns out that the model, if you have a sensible rationale and you take your question, you glue the rationale next to the question. So you have question, then long chunk of reasoning.
If that reasoning is good, then your language model is going to assign a higher probability to the correct answers than to the wrong ones. And that's kind of interesting because now you have a way of measuring how good a rationale is, right? So if I take my question and I just swap out the rationale for a crappy rationale, well, now my model is going to assign probably a lower probability to the correct answer, which I have in my data set.
And so they're basically going to use this to have a model automatically basically go through a whole bunch of rationales and evaluate them based on how likely those rationales make the model to guess the correct final answer. So that's a pretty interesting and new way of thinking about inference and using these models to kind of, anyway, to assess the value, the correctness of inference.
And anyway, they use a whole variational optimization approach that's fairly technical. I mean, if you're a math nerd like me and you like seeing integral calculus, then check it out. It's actually kind of cool and satisfying. But the bottom line is you are using these models to sort of...
automatically assess rationales by flipping the script a little bit. And instead of swapping out the, instead of generating a bunch of like reasoning traces and then trying to see if you can, you know, if you get to the right answer, you sort of like
Well, you swap out your rationale and hold the input and output constant, essentially, the prompt and the final answer. And you assess which rationales are good based on how likely the model, or what the probability of the model assigns the correct answer.
So next we have Densing Law of LLMs, which I found quite interesting. So we love scaling laws in this podcast. And here they're talking about capacity density as a new metric to evaluate the quality of LLMs across different scales. Essentially, we're saying that
What is the ratio of effective parameter size to the actual parameter size relative to some kind of baseline? So how well can you perform at kind of the size you're at? We know that, for instance, 70 billion parameter models are capable of performing some number on a given benchmark.
if you train your model well, it will match that. If you train it badly, it will not match that. So you can kind of have a measure of how good you are for your size. And what they have found is that the capacity density has been trending up in open source LLMs. And this is not surprising. We have been covering this trend that
at the 1 billion, 2 billion, 7 billion model size, we've increasingly been getting them better and better. Where now they're quite capable, where it used to be they were not capable even. Back in the day, GPT-2 was 1.8 billion parameter models, and that was not even anything like what we get today. And so they actually do empirically
Look at this phenomenon, we find that there is about a 3.3 month period in which these measure doubles. And so far, it still is upholding, although it's a little bit more noisy in the last few months. We do see some fall off with things like LAMA2-9B.
Lama 3.23b, gamma 2.2b. Anyway, there's a bit of variance there in the empirical results, but the overall trend line is still going upward where we are getting more performance from smaller and smaller models. And we just actually covered that with Lama 3.3. Same thing right now with Lama 3.370b. We have a more performance 70 billion parameter model relative to Lama 3.170b.
Yeah, the sort of intuition behind this fundamentally is like how much performance, how much world knowledge can you cram into the same number of parameters? And we're getting increasingly good at doing that. One way you can do that is just by overtraining. So for a given model size, there actually is an optimal amount of compute that you can pour into your system to max out its performance. That's a scaling law that we've known for a long time.
What we've seen increasingly, though, is people saying like, yeah, well, I actually don't care about maxing out the performance of my model. What I care about is maxing out the performance of my model subject to a parameter size constraint. So I don't want to go beyond, for example, seven, eight billion parameters because I want the model to fit on my phone or I want it to do whatever. I want it to run below a certain cost. And so I'm going to overreact.
to overtrain it. And that's one way of cranking this up. Another is algorithmic, right? Making algorithmic breakthroughs that just allow you to leverage that compute even more. And this is the idea of effective compute, right? One flop back in 2023 was worth a lot less than a flop today. And what they're saying is the same is true for parameters, which is self-evidently true. As you said, Andre, we've seen so many examples of this, but it's interesting to see it plotted out.
A couple of interesting top line numbers, right? They say from January 2023 to now, the inference cost of GPT 3.5 level models has decreased by a factor of about 270. So that's pretty remarkable. It tracks obviously with what we've seen on the podcast and part of the challenge that these companies now face of recouping the losses they incur every
in training these huge models is like, there's not a lot, like, like the cost of inference is pretty low and you're racing to the bottom on pricing a lot. So that's kind of an issue. Um,
Last thing I'll just point out. So most of the current techniques that we use to take a large model and then turn it into a small model. These are techniques like pruning, where you basically just pick weights that you want to throw away. So you actually throw away parameters from the model and distillation where you use a big model to train a smaller model to do the same thing. They'll
Those techniques, they say, typically result in lower density models. So the model you get at the end of the day tends to not cram as much knowledge as it could in those parameters, which suggests there's a lot of room for improvement in compression techniques. And that's a really interesting call out.
It's something that especially, again, when you think about what ecosystems care a lot about maxing out the capacity, their compute capacity, definitely Chinese entities are going to care more about that. And this group is a Chinese research team. So I think you're seeing a lot of that sort of what necessity is the mother of invention type reasoning where people are inventing new and more efficient ways to make more use out of their existing compute stock plans.
And just one more paper, this time it's about interpretability. The title is MONET, Mixture of Monosomatic Experts for Transformers. So this is kind of putting together two ideas, mixtures of experts, which we have often covered. The idea there is you sort of have different routes through your neural net and certain parameters.
are these experts that are better for certain types of inputs. And you only use the appropriate weights when you have a certain input.
We have also covered quite a bit of trend towards interoperability by way of finding concepts within neural nets. Most recently, often using sparse autoencoders where you take the activations, the outputs at a given layer, you basically compress them and from that compressed smaller number of things, you get something like a dictionary where you find that certain combinations of weights can
can be mapped onto specific ideas, like for instance, a math concept or a bridge concept, things like that. So this paper presents a different way to do that, the interoperability idea, finding concepts within a neural net.
by doing that at training time. And what they do is essentially scale up a number of experts to be very large. They have here something like 200,000-- 262,144 per layer while maintaining parameters relatively low.
And then what they get from having that many experts going on, and they have some fancy techniques for being able to scale that high, is that the experts themselves can now be shown to capture certain ideas or techniques. So they have experts that correspond to chemical compounds. That's expert 174,000 people.
uh 40 in the paper i cover that one u.s states uh cartilage weird various kinds of experts like that and they have various experiments showing that they can identify experts for let's say python java different programming languages and if you delete certain experts you get very big uh differences in performance so in a way it's
similar to what you get with sparse undercoders while not having the post hoc nature of it as much. Yeah, and I think it's a really interesting approach. The post hoc one has been a challenge, obviously, as you said. And
it's one of these things where I found this really counterintuitive. When the early MOE models were coming out, like I conceptually, I thought because they're called experts that they would be, you know, very clearly each one would have a purpose. You know, you'd have the grammar expert and the cow expert
or whatever. And obviously polysemanticity, this idea that individual neurons respond to multiple different concepts and get activated by them is a big issue here. So baking that in, that's why, right? That's why they're using so many experts. There is so much meaning that you have to be able to capture that if you want every expert to just be an expert in one coherent human understandable concept, you just need way more experts. That's kind of the hard constraint that's motivating that.
A key thing to think about, and when you think about safety and interpretability, one of the things that people often talk about is the alignment tax. In other words, how much performance do you have to sacrifice to get a given level of interpretability or controllability out of your system? And the answer here seems actually to be fairly encouraging. So they look at a whole bunch of different zero-shot tasks with the 1.4 billion parameter system.
version of the model. And they get an average score of 0.478 across those tasks for the Monet model here versus Lama 1.3 billion parameters of 0.84. So really, really small performance penalty there on the order of about 2% roughly, which is
which is good. That's what you want to see. You want to see a low alignment tax. But that's something that is a key number when you're thinking about what is the cost here. And we also, I guess, don't have the side-by-side of the flops, the amount of compute that this approach requires relative to the long the series. But just kind of on the face of it,
It does look like a really promising and scalable result. And hopefully these sorts of things become more and more doable. My take is presumably this would not be as easy to train and scale up. And so it is unlikely that people will train a massive frontier models
have this but it seems like it could probably be very useful for research and perhaps could translate understanding to more post hoc techniques and on to our last section policy and safety first again we are talking about export controls the Commerce Department has yet again strengthened restrictions related to advanced semiconductors for military applications this will
add controls on 24 types of semiconductor manufacturing equipment, three types of software tools, and high bandwidth memory, along with various guidances and additions to the entity list, the companies that are being, I guess, controlled or restricted.
So lots of details. The entity list has now 140 new entities and 14 modifications. The new rules have specific provisions for semiconductor manufacturing equipment.
Yeah, I guess very much in line with what I've been seeing as a trend for a while now. And Jeremy can probably speak more to the significance of this latest move. Oh, yeah. I mean, I think this is a really interesting move and it's going to have ripple effects in a really big way. By the way, this is the third round of U.S. export controls that we've covered on the podcast. So I think we should sort of have a
celebratory emoji or something. But yeah, so this is every year, basically, at least the Democrats so far have had this sort of new wave of updates to their export control regime. And I think part of the reason why they have to go in so much detail every time is that they want to keep playing this game of whack-a-mole where they use very
fine scalpels to carve out very carefully the range of technologies that they don't want to allow to be exported to China. I think this is actually kind of a problem up to and including the whole idea of having a blacklist of companies. So let's start there, actually. So when you when you look at the the Envy list, right, this is the thing that Huawei famously joined
I think it might have been back in 2018 this started, but basically this was like a list of entities that you can't sell to without a license. You can't sell high-end semiconductor equipment without a license to these entities. The problem is these entities just keep spinning up subsidiaries that you've never heard of before that then trivially work their way around your export controls. And this has been happening over and over and over again, this sort of
Again, Semi Analysis, I mentioned them earlier. They have a great post called Fab Whack-A-Mole trying to pin down all the subsidiaries that Huawei is setting up to get around export controls. And we're seeing tons and tons of GPUs, high-end GPUs, work their way into the Chinese market, H100s included, which are absolutely cut off, as well as A100s, which are also cut off as of the last update of
of, um, uh, export controls. So, um, you know, if, if you're listening to Jeremy on, on, on policy shit, uh, you need a black, a white list, not a black list. This really, really needs to change. Um, but, uh, the other thing is this, um, this update, uh, is less severe than expected. So there were Japanese chip equipment suppliers, uh, that really benefited from some of the, the tighter controls. And, um, uh, anyway, so, so there's, there's, um, sorry, uh,
Sorry, I spoke over myself there. There are, I'll just start with the Japanese bit. Sorry about this. So there are Japanese chip equipment suppliers that, yes, are benefiting from this new round of controls for a whole variety of different and interesting reasons that we should get into in the hardware episode. But,
apart from the actual envy list piece, I want to highlight HBM. So Leonard Heim has a great, um, on, on Twitter has a great thread on this. So let's just start with HBM high band with memory. Again, crucial, crucial part of all your cutting edge GPUs. The GPUs will have a logic die that actually does the computing. And then it's got stacks of high bandwidth memory that, you know, move data around anyway, in a way that we'll talk about later. Um, the,
The bottom line is Huawei should not have access to HBM or they shouldn't have access to HBM2E. And that was found in the Huawei Ascend 910B chip in a teardown that was recently done. And so this suggests that the HBM was sourced from Samsung via these distributors, right? Via Huawei spinning up these subsidiaries that no one had ever heard about. And it looks like
all Huawei Ascend 910Bs were produced by
by TSMC. So this is the logic chip rather than the HBM. That's really significant, right? That's really significant. For a while, we were speculating on the podcast, well, maybe Huawei is building these chips using SMIC, right? This is the domestic Chinese equivalent to TSMC. Well, it looks like that actually wasn't the case. It looks like they were actually having to use TSMC. Why does that matter? It means domestic production in China is not
going so well, right? It implies that they're actually struggling with whether their yields or their processes on some other level. So they're being forced to actually spin up these entities and grab chips from TSMC, which means export controls are working on some level, despite an awful lot
of Chinese propaganda recently trying to suggest desperately that actually everything's fine. This actually is an interesting twist in the story here. Last thing I'll just mention, the foreign direct product rule, the FDPR, is being applied for this. So this is this idea that you cannot sell to China your equipment, your semiconductor equipment, your chips, whatever, if they were made uniquely
using any amount of American technology, or at least that's the threshold that they picked for this. Basically, 0% threshold on the foreign direct product rule. So basically, you need to get a license exemption if your stuff uses any American tech whatsoever. They do carve out exceptions, interestingly, for Japan and for the Netherlands, which is sort of interesting anyhow.
Uh, but, uh, anyway, uh, there it goes. That was one of the big red lines that people were wondering, you know, will they cross this, this line? And so, um,
Anyway, all kinds of stuff that I think we actually should talk about this in more detail when we talk about the hardware episode, because there's stuff here as well on the semiconductor manufacturing equipment and EUV that's really important too. But for now, I'll just leave it there because we got to wrap the episode at some point. Yeah, yeah. That hardware episode will, has to be our best work yet, honestly. We've built it up so much. We're putting so much effort on it.
Well, very much a related next story, pretty much directly affiliated. The next story is that China has retaliated against the U.S. for this very restriction. They have said that the exports of certain minerals, including gallium, germanium, antimony, and some other ones are prohibited to the U.S. So,
The US cannot have these, and there are stricter restrictions and controls related to exports of graphite. Graphite is used in batteries. China is the dominant supplier of it, 77% of the world's supply coming out of there. This seems like a pretty big blowback and a pretty big retaliation as far as I can tell, although what do I know?
Yeah, I think, what does anyone know? I mean, part of the challenges in the US were we could produce a lot of this stuff. We're just not doing it for a whole bunch of reasons. And some of our most important critical minerals mines are actually Chinese owned as well. There's like...
legacy of extraordinary and excruciating policy failures that have led to this dependency. But just to give you an idea, yeah, China currently produces 98% of the world's supply of gallium, 60% of its supply of germanium. And one question you're entitled to be asking yourself is, well, why does that matter? What are these actually used for? So when you think about gallium in particular, this is maybe the most significant one for AI chips. Gallium nitride is
is used in power delivery systems for AI accelerators. So for GPUs, TPUs, that sort of thing. And just because, anyway, it has favorable properties from a conduction and kind of thermodynamic standpoint. So yeah, so that's a really big issue. Another one is
The gallium arsenide side. So some chips use gallium arsenide for high speed interconnects and radio frequency stuff. So that's that's gallium pretty, pretty central to the actual like logic side in so far as they're important for power and stuff.
Germanium, also important because you do see silicon germanium used quite a bit in high speed communication interfaces between AI chips and memory, especially. So these are actually pretty core. I mean, it's, you know, it's not input like
if we had a deregulation push, we really could onshore a lot of this shit, but we've, we've put ourselves in a, in a real pickle. Um, and, uh, and I think that this is something, you know, you're going to see the Trump administration come in and, and, uh, and do stuff about this, but, uh,
This is also a bit of a warning shot you're seeing from the CCP saying that, hey, the Trump administration is about to come in. We don't want to see them see us take their export controls lying down. They just like, you know, the Biden administration just threw on, as we just discussed, this third round of tighter export controls. So now we want to try to put up a front and be like, hey, you know, we're going to we're going to respond with our own stuff. Yeah.
How effective that actually ends up being remains to be seen, but this is a bit of a warning shot there. We'll see if that goes well with the Donald in charge. But anyway, kind of an interesting tit-for-tat play here and a call certainly for the U.S. to figure out its game on domestic production of critical minerals and rare earths.
And onto the lightning round, just a few more things to cover. First, we have the story that OpenAI is working with Enduril to supply the US military with AI. I don't know how you pronounce this, Enduril. Enduril is a defense startup. OK, Enduril, I'll take your word for it.
They, I believe, work on air defense systems and drones. And so it appears that they are now collaborating to improve those products. And so through Endural, working with the U.S. military, we've seen something similar with...
What's that other one that starts with P? Palantir, yeah. Palantir, exactly. So yeah, the AI sector and the tech sector as a whole seems to be warming to the idea of collaborating with the defense sector. And this is just a really good example of that.
Yep. I think it's also, I mean, this drags in a whole bunch of political considerations and people's ideologies quickly perk up. But this has been a recruiting challenge for Google in the past, right? When you fill your company with people who don't like the idea of partnering with the DOD, the US Department of Defense, if you then go out and do a partnership, you get protests. That's what happened in the context of
Project Maven, which was Google's then famous, I think, 2018 project with the DOD. You had a lot of walkouts, a lot of protests, that sort of thing. Of course, you know, US adversaries are doing this. So if we don't do it at all, like we're going to be pretty screwed pretty fast. But there's, you know, there are questions about how to do it and so on. And
I guess I have biases same as anyone. And yeah, Minar, you probably do want some close collaboration between American technology companies and the DoD so that we're at the forefront of capability. Though obvious ethical issues exist and so on, this is not a cut and dry thing. But bottom line is,
Yeah, opening eyes, shifting gears there. They used to have a policy that really said, you know, we won't do this sort of thing. So it is interesting to see them move in that direction. I think Anthropic made a similar move recently. You were talking about Palantir and I'm trying to remember, I'm embarrassed to say, I can't remember if it's Anthropic,
I don't think they're partnering with Palantir. I think there's something about making their stuff available. Yeah, it was the change of policy. I somehow am also forgetting who exactly partnered with Palantir. Too much news. Yeah, it's just too much to try and remember. But this was another example of this in the sector for sure.
And possibly related, but maybe not related, not sure, certainly related to a lot of stuff we've been covering in this section over the past months. The story is again on OpenAI and another AI safety researcher quitting and once again having some indication that this is not because of their personal things. It's really because of what's happening at OpenAI.
So this safety researcher is Rosie Campbell. She, um,
had a blog post that came out. She has worked there for several years, including with Miles Brandage, who was the head of the AGI readiness team, also departed and essentially indicated that OpenAI has gone in a route where he could not be effective anymore in doing what he thinks needs to be done for AGI.
AI safety appears to be the same essential message here that she is concerned about the company's trajectory and safety practices. So, you know, add it to the trend. Certainly, you know, we don't know exactly what's happening inside OpenAI, but there's been an awful lot of safety people leaving in this past few months. Yeah, it's sort of funny. I mean, when you talk to people at the lab, it's like,
there's very much a sense that the, um, the lab is treadmilling out a lot of the people who not just, um, not just care the most about safety and, and security, by the way, I think that one is undervalued. Um, but, uh,
But yeah, they're like kind of bringing in all these product people and starting to think, A, about AGI as a product and B, framing all this kind of China competition piece through the lens of just like, yeah, let's accelerate on capabilities without really tracking that currently. And this is I'll say this is my opinion. This is a project that my team and I have been working on for over a year now.
Open AI security and broadly Frontier Lab security is shit. I mean, to the extent that Open AI even believes its own marketing, that they are building superintelligence, that they're on track for that, they have a completely inappropriate level of security that is, I find it very difficult to imagine that they are not fully penetrated by the CCP to the point of, as Mark Andreessen says, getting penetrated.
like daily downloads or weekly downloads of model training checkpoints. Like that sort of thing is completely plausible to me at this stage. Um,
whether you talk to to the the researchers themselves like there are a lot of whistle blows that opening i was spoken to uh in that direction as well as national security experts who will tell you like roughly what what the left and right bounds are and what can be done technologically from an infiltration and espionage standpoint and and what is being done too like how aggressive china actually is on a whole bunch of fronts so i think that this is uh
You know, we haven't heard the last of this. And they are like doing a great job of treadmilling out
like their safety and security talent. I think there's already a bit of a gap in terms of just technical ability at OpenAI to keep up with the security situation because so many of the people who care about it have left. So, yeah, I mean, this is a bit of a downward spiral in that dimension. And I think the losses of Ilya Sutskiver and John Shulman and all those very important figures...
Mira Marati, the list goes on and on, is also significant. And I think it's all part of that. Anyway, that memeplex. So there you have it. That's right. And this blog post, you know, it's not too spicy, but at the same time, it's pretty clear that there is a mismatch. And as with Miles Brundage,
This block says that Campbell doesn't see a place for her to continue doing this kind of work, which she's been doing internally at OpenAI. Also says as a two cents for the rest of OpenAI, remember the mission is not simply to build AGI. There's still so much to do to ensure it benefits humanity.
And that, I think, probably speaks a lot to what's happening. There's so much excitement about building AGI. The mission was to build AGI to benefit all of humanity, but maybe building AGI is definitely at the forefront right now. And next up, let's move away a little bit from drama and open AI to a research paper related to safety and a reminder of why safety is something we should care about.
This one is titled "On Targeted Manipulation and Deception When Optimizing LLMs for User Feedback." And the high level quick summary is that you can train your LLMs to sort of do what the person wants, what the users want by saying, oh, this helps me. Good. Do more of this. Thumbs up. Or you can be negative.
And there is a phenomenon that's pretty well known in reinforcement learning and general optimization of feedback gaming, where the LLM can find a way to get the reward in kind of not the appropriate way. In a way that gets it more reward, but is maybe not what you want or even like exactly what you don't want.
So in this example, it could manipulate people and deceive them to get the high reward. And they did find that this reliably happens in scenarios where you have feedback. And I'm sure, Jeremy, you picked out some interesting things here.
Yeah, well, the dimension they're exploring here too, right? Like historically, what we've seen is during the training process, you get a bunch of raters, right, to give feedback to the model. And that if you do that, eventually the model learns to do things and say things that get
upvotes from the readers, but not necessarily that are true or accurate or whatever, right? So that's one kind of failure mode. We see this reflected in, you know, Claude, or sorry, Anthropic has a series of papers about how Claude does sycophancy. And they've got a whole bunch of different kinds of sycophancy where the model plays, like you place your ego or whatever to get that upvote in fairly pathological ways that are associated with reward hacking. This is a bit different.
So this is about the model essentially being trained in real time from end user feedback to optimize for those upvotes rather than just from raters during the training process before deployment. And so the question is like, does this problem set generalize to that new setting where people in real time, as you use like,
user feedback is being used to optimize. Basically, yeah, like charge your ET, you can leave a thumbs up, right, when you have messages, something like that. Exactly.
Exactly. Yeah, that's it. And so they give a couple of examples that are pretty interesting. So what they do is they found that models will learn to deliberately lie about successful bookings, for example, if they're asked to book a flight or a hotel, even when the system had errors, when there was a system error that prevented the booking from going through. So...
So what they tried was preventing the model from outright lying, and they applied safety measures for that. But when they did that, the models learned more subtle manipulation tactics, like trying to discourage users from booking altogether. Basically, you're there, you're saying, hey, I want to go out of Cayman Islands. And the model realizes, oh, shit, I can't book a hotel. There's no free hotel or whatever. So let me try to convince users
the user not to go to the Cayman Islands. Do you really want to go to the Cayman Islands? It's a little cold this time of year, whatever, that sort of thing. So it's sort of fascinating and all the things that you might expect actually, unfortunately, from these models as they get more capable, they tend to get better at obfuscating, applying it and so on and so forth. And we'll be talking more about this, obviously, I guess next week, if we can, I think we should talk about the
01 model card or system card, because there's a lot of interesting examples that are in this direction. But this is, as we discussed, this is in the context of a very particular kind of optimization scheme where it is end user ratings. And they found that even a very small fraction of manipulable end users can teach the model to manipulate. So it's a very generalizable sort of behavior pattern.
And just one more story. This one has to do with a theme that used to be pretty often something we got to, but not something we've talked about too much lately. It is about AI and election misinformation. And
And what the news is, is that Meta has a report now on misinformation. And in that report, they say that AI-generated content seems to have accounted for less than 1% of election misinformation on its platforms during the 2024 global elections, including the U.S. presidential election. So there are many concerns about AI-generated content situation.
making it easier to do a lot of misinformation cheaply.
According to this report at least, it seems that the bad actors who would do that are not necessarily doing it. It's maybe not as big a problem as many people thought it would be, which anecdotally seems to be the case. It doesn't seem like anyone feels like AI had much to do with elections.
Yeah, it's also kind of tricky to assess what is and isn't AI-generated content. We've talked about this, but when you have short pieces of generated text, there's actually a pretty hard limit on how reliably you can assess whether it was AI-generated or not. So I think there's some questions there. But also...
If you have less than 1%, around 1%, a big question is, where is that 1% focused? If you're specifically targeting swing voters in certain districts, whatever, you would actually expect an efficient information operation to not involve swing
You know, a massive surplus of agentized disinformation or whatever on the Internet, like what you would expect is a more targeted operation, just like, you know, the campaigns, the presidential campaigns only care about really like what's going on in Wisconsin and Florida and Georgia, you know, those the swing states.
Well, these guys will too. And double down even further on that, they're only going to care about what happens in the small handful of counties within those states that determine the outcomes of elections. And within that, the small handful of demographics that are actually shiftable. So you can do a pretty targeted operation on a small country.
a subset of the American voter that's pretty effective. That being said, I'll happily say, I mean, less than 1% is maybe less than I would have expected as a bulk measure. I just don't know how reliable that number is and how far it can really be thrown, if you know what I mean. Exactly. Yeah. It doesn't seem like they disclosed too much about how they got to a number and perhaps there is some uncertainty there, but
Regardless, certainly, I think deepfakes, AI-generated imagery didn't wind up playing a big role. Perhaps the next generation maybe did speed up or help some of these operations, but these kinds of operations did exist. And they do also say that they took down 20 such operations from across the world as an example.
And that is it, quite the full episode. So thank you for sticking around till the end. I know we probably moved pretty fast. And as always, if you want to get deeper into any of these stories, you can look into the episode description or go to lastweekin.ai, where we also have our text newsletter.
We always appreciate your comments. There was a lot in this episode, so do feel free to leave a comment. And of course, also feel free to leave a review. We always appreciate a nice five stars and any feedback as to how we can improve. But more than anything, we like people tuning in and listening, so please keep doing that. ♪♪♪
♪♪♪
♪♪♪ ♪♪♪ ♪♪♪ ♪♪♪
♪♪♪