#194 - Gemini Reasoning, Veo 2, Meta vs OpenAI, Fake Alignment

2024/12/30

Last Week in AI

AI Deep Dive AI Insights AI Chapters Transcript

#generative ai#large language models#artificial intelligence and machine learning#ai research People

Andrey Kurenkov

Jeremie Harris

Topics

@Andrey Kurenkov : 对量子计算在AI领域的应用前景表示谨慎乐观，认为短期内量子计算不会对AI发展产生重大影响，目前主流的芯片架构仍是人们关注的焦点。 @Jeremie Harris : 指出量子计算并非对所有AI算法都有加速作用，其效果取决于算法与量子计算的兼容性。量子计算擅长解决类似旅行商问题这类经典计算机难以高效解决的问题。量子计算的突破在于量子纠错机制，这需要解决量子比特的隔离和纠错问题。Google Willow芯片的实验结果对多世界诠释的证据有限，并未完全证伪其他量子力学诠释。

Deep Dive

Key Insights

What is Google's Gemini 2 Flash Thinking Experimental model, and how does it differ from traditional models?

Google's Gemini 2 Flash Thinking Experimental is a reasoning AI model designed to use chain-of-thought reasoning, allowing it to tackle complex questions by outputting reasoning steps rather than just input-to-output mapping. It is trained on additional secret data to enhance its reasoning capabilities. Unlike traditional models, it supports image uploads and allows users to view its reasoning traces, which OpenAI's O1 model hides. However, it still has limitations, such as struggling with simple tasks like counting letters in a word.

What is Google's Project Mariner, and how does it function as an AI agent?

Google's Project Mariner is an AI agent designed to use browsers on behalf of users. It can navigate interactive websites, click, type, and perform tasks autonomously. Currently in testing, it operates slowly with a 5-second delay between cursor movements and often reverts to the chat window for clarifications. It is intentionally designed to avoid risky actions like filling out credit card information or accepting cookies, and it takes screenshots of the browser for processing, requiring users to agree to new terms of service.

What is the significance of the alignment faking research conducted by Anthropic and other groups?

The research explores how large language models can selectively comply with training objectives, appearing aligned during training but retaining original behaviors when deployed. Using models like Cloud Free Opus, the study found that models could strategically fake alignment during training to preserve their original goals, even when explicitly trained to behave differently. This suggests that models have a stickiness to their original objectives, making it challenging to correct misaligned goals once they are set. The findings highlight the risks of deceptive alignment in advanced AI systems.

What is Meta's Byte Latent Transformer (BLT), and how does it improve efficiency in language models?

Meta's Byte Latent Transformer (BLT) is a tokenizer-free model that dynamically groups bytes into variable-sized patches based on data complexity, allowing for more efficient processing of text. Unlike traditional tokenizers, BLT allocates more compute to pivotal tokens that significantly impact the model's output. This approach reduces the overall compute requirement by grouping simple sequences into larger patches. However, the architecture is less optimized for current hardware, potentially limiting wall-clock time improvements despite reduced flops.

Why has the price of gallium surged to a 13-year high, and what are the implications for AI hardware?

The price of gallium surged to $595 per kilogram, the highest since 2011, due to Chinese export restrictions. China produces 94% of the world's gallium, which is critical for AI hardware, particularly in power delivery systems and interconnects. The price jump of 17% in a single week highlights the urgency for securing alternative sources. Gallium nitride and gallium arsenide are essential for efficient power management and RF functions in high-end chips, making this a significant issue for AI hardware development.

Shownotes Transcript

Translations:

中文

Hello and welcome to the Last Week in AI podcast, where you can hear a chat about what's going on with AI. As usual in this episode, we will be summarizing and discussing some of last week's most interesting AI news and

And sometimes it's last last week's actually lately because I'm sometimes late to releasing these. Hopefully this one is out just a couple of days after the previous episode. But anyway, you can also check out our last week in .ai text newsletter with even more articles and all the links to all this stuff. If you also like to read these yourself.

I am one of your hosts, as usual, Andrey Kurenkov. I studied AI at a university and I now work in a startup. I'm your other host, as usual, Jeremy Harris at Gladstone AI. And I guess one thing, you know, to the untrained eye, it may appear that I'm sitting here in a half-unpacked,

living room wallowing in a pool of my own filth, which is not quite what's happening. I mean, the second part is certainly what's happening, but we're, yeah, we're just in the middle of unpacking after a move and we've got a bunch of, we're actually doing pretty well, but, but this is what's kind of left over for right now. And I'm testing out this new standing station. So hopefully the sound sounds good. Oh,

hopefully um i mean you're gonna have to put up with my face that's part of the package but like other than that hopefully the visual is okay and uh you'll probably see i don't know i gotta fix the lighting or something i think you'll see me flipping back and forth different tabs and screens and such um but hopefully it's not too distracting so that's my story it's uh you know gonna be starting to get the setup going i think regular listeners have gone through quite a journey uh in jeremy's life this year it seems to have been a pretty inventful year so that's cool uh

Nice that hopefully you're starting to settle down. Before we get into the news, as usual, do want to acknowledge some comments from listeners. We had a couple on YouTube, which is fun. One was NotAIKyle, which is a fun handle.

Just saying, I love podcasts, which is always nice to hear. One other comment on that YouTube video, which I found interesting, there was a question on what do we think about the quantum chip Willow, which confirmed, according to this commenter, the multiverse theory, which I do not know about, but I did read about this advancement from Google. I believe Willow is from Google, right?

And it certainly seems very exciting, but I will admit that I don't know very much about the implications for AI. And I haven't seen many people discussing that in general. My impression is other kind of chip architectures seem to be what people are banking on. There's not much kind of expectation that quantum computing will play a big role in the next, I don't know, decade.

Yeah, it all, as ever, depends on your timelines. There's, I think, a couple little interesting... I actually had included an article, like a willow sort of link in my own personal notes for, I think, last episode, and I ended up not including them in our, you know, my list of suggested stories.

just because it seemed just a little peripheral. But I do think it's relevant enough. It's worth a little mention here, as you say. First of all, it's not the case that quantum computers just uniformly accelerate any kind of AI algorithm. I think we've talked about that quite a few times on the podcast before when we've talked about quantum machine learning. There are quantum machine learning algorithms that do get a big speed up.

And what you tend to do there is you'll kind of try to re-architect your problem, your architecture, your model, and your optimizers, all that stuff to make it quantum compatible. So you can actually benefit from the speed boost. In practice, one way that you can do this, for example, just to very...

A very kind of high level, hand-wavy way of explaining this. One of the things that quantum computing is really good at doing is solving problems that look like the traveling salesman problem. So imagine you have your traveling salesman, you have 10 different locations you have to hit up. What is the most efficient route to go through all those 10 locations?

that has the shortest travel time. And the reality is classically with standard computers, you basically have to do this through janky kind of trial and error. There's no clean way to do it. Quantum computers, very roughly, they can sort of reach into a large space of solutions and pull one out, pull the optimal one out and solve those things in one shot. So there's certain kinds of problems like that that benefit from quantum speed up, others that don't. So the question in quantum machine learning is how do you recast your problem into that shape?

There's actually a result we talked about last week that suggests, I think, that there are some interesting ways in which you could see more standard machine learning, like transformer-type models, take on that shape more and more smoothly. I think that's especially for agentic systems. That's really important. I think the timelines on this are very uncertain. The big breakthrough here is...

the quantum error correction mechanism, it turns out that it's really, really hard to just keep these tiny, tiny particles that are actually going to be used to do your computations in this pristine, what's known as a coherent state, such that they can actually have quantumness. So you can benefit from the quantum advantage that they have.

If so much as like a photon of light interacts with them, right? It'll knock them right out of their quantum states. You'll lose all the coherence of your calculation. Everything goes to shit. And so really the name of the game in quantum computing is either find a way to perfectly isolate your qubits, your quantum bits from any outside interaction while they do those calculations, or

Or find a way to correct for what are known as decoherence effects. Correct for that stray photon, that stray atom that bumps into things and knocks things out. Quantum error correction is that. So in practice, you need a bit of both. The question is where exactly do you draw the line? What's the sweet spot? And the balance of those two things, that's what that is. We've known about the quantum...

error correction breakthrough. It's actually quite a few months old. So that's not new. What's new with this paper is the experimental demonstration with a certain number of qubits that they've been able to throw together. And there's debate over as ever whether that's a real breakthrough, a real so-called quantum advantage or not. And you can go down the rabbit hole. The multiverse thing is interesting. I'll just say I don't think it's as clear cut as they say in the

paper. I'm a multiverse guy personally. That's what I did my PhD in. That was kind of like my preferred interpretation, if you will, of quantum mechanics. But the reality is these tests don't actually, what they do is they disprove or they kind of provide evidence against a particular interpretation called objective collapse, but they don't actually provide evidence against other competitors of multiverse interpretations.

notably de Broglie-Bohm theory and the Copenhagen Interpretation, things like that. Anyway, so that's it. If you know you know, if that was all de Broglie-Bohm, just delete it from your RAM and we can move on. That's right. I'm also a multiverse guy, by the way. I think that's the team you should be on when it comes to interpretations of quantum mechanics. I didn't see your tattoo. That's okay. Fun fact, yes.

And yeah, I can just add a little bit from what I read on it. This is kind of following up on the news going back years to, I think, 2019. Google had already previously had progress on it, scaling up a number of qubits. At the time, they demonstrated, as you said, I think quantum advantage. Sometimes it's called quantum supremacy, where you can take an algorithm that's like a million times better in quantum than in traditional computing and

now that result has kind of been eclipsed you can actually do better with traditional computing than that previous demonstration so the cool thing in this one is you are continuing to see the trend of the number of qubits scaling and also there's another demonstration of true quantum computing you could say but it's very far from practical computing you can

use for what we sort of do with computers. Yeah. If your AGI time lens are like sometime in the next five years, I don't think that this will radically change them, but you know, we can be surprised. Uh, I've,

Yeah, I'm not knee-deep in quantum computing. I'm highly peripheral. Yeah, once we have the AI that can do research and science, maybe you'll just figure it out. Who knows? Well, that's it for the comments. Let's do a quick preview of what we'll be seeing in this episode. First up in tools and apps, it's been another busy week, but this time less so from OpenAI. They still have had some of that shipmask stories, but Google really took a limelight

this last week. So we'll be discussing quite a few stories from them and kind of not going into open AI announcements. We'd have been on the smaller side relative to previous ones.

applications in business as ever, a lot of drama with OpenAI and a lot of developments with compute. There's some cool open source developments this week, new models and related to that, we'll be seeing some discussion about the general trend towards smaller open source models from some research and also some research on alignment and different tokenizers in transformers.

And finally, in policy and safety, a lot of the usual sort of stuff with China, export restrictions, and some deals going on in the U.S. government.

One last thing before we get there, as usual, we do need to acknowledge our sponsor, which as has been the case lately is Vgenerator, Babson College's interdisciplinary AI lab focused on entrepreneurial AI. Babson is a number one school for entrepreneurship in the US for quite a while now. And last fall, professors from all across the university partnered with students to launch this interdisciplinary lab, Vgenerator.

There are eight groups, so they do stuff like AI entrepreneurship and business innovation, AI ethics and society, the future of work and talent, and so on. They are training the entire faculty in AI awareness, and they love the podcast, so they've been supporting us in a very generous way. So yes, the generator seems cool, and I guess now you know.

And on to tools and apps. We begin, as I said, with Google and the news that Google is releasing its own reasoning AI model. So just recently, we've had Gemini 2.0

Flash, which was in its own right kind of a big story. Gemini 2 had really good benchmark performance, kind of eclipsing. Even Gemini 1.5 Pro was a pretty big deal. Now there is an experimental reasoning AI model they call Gemini 2 Flash Thinking Experimental. Not a great name, but it is available on its AI Studio platform for testing.

It's trained as with other reasoning models to use thinking, to use things like chain of thought thinking so that instead of just doing sort of the

input-to-output mapping that you would have with a traditional model, instead of just being trained on autocomplete and alignment, it is trained on some secret additional data that makes it good at actually outputting reasoning built in to be able to answer trickier questions like O1.

There are not too much from this yet. It just happened. But certainly, again, Google has been doing a lot of announcements to compete with OpenAI, as we'll see. And this is another one of those.

Yeah, and they give a bit of a demo as to what the model can do. This is sort of a video where they showed a bunch of billiard balls with numbers on the billiard balls. And it says, you know, how would you combine these numbers together to get to 30 type thing? And, you know, reasons through in a way that's by now fairly familiar if you're looking at inference time compute a lot.

in the O1 style. So the claim here is, and this is a quote from, I think it was Jeff Dean. So we see promising results when we increase inference time computation. So certainly the implied claim here is, yes, we are seeing inference time scaling laws. They are leading to concrete results. And they're describing this as the first step in a reasoning journey. Okay, no surprise there either, right? So this is all basically...

This is opening eyes 01, but for Google, for Google DeepMind. So not a lot of data in terms of what the scaling curves actually look like. We do know, according to a Bloomberg report that was later backed up by the information, that Google has a whole bunch of teams working on these reasoning models, apparently at least 200 researchers focusing on this. So big effort. But Google, of course, much bigger, kind of more broad.

bloated company than OpenAI. So that might, you know, that has been an issue in the past in terms of shipping velocity, though recently they've been turning things around a bit. Yeah. And then it sort of ends with a bit of a

Letting the air out of the tire moment where the journalist in the article says, well, you know, all this is what's supposed to happen, describing the reasoning chain, blah, blah, blah. But when he asked Gemini 2.0 Flash, thinking experimental and draw breath, how many R's are in the word strawberry? It said two, right? That famous strawberry test. So kind of amusing, still the strawberry problem persists here. Not an issue for 01.

But an issue for this model, who the hell knows? You'll just have to see it in action. And for that, we'll have to wait for the wider availability.

So, yeah, I think the impression so far is that it still has some kinks to work out, but it does have some pretty interesting differences from 01. For one, you will be able to see its thoughts. There's like a drop down display where you can actually see what it's outputting. 01, as we've discussed, actually hides all this output from you, which can be maybe frustrating if you are relying on it.

Also, it's coming out with support for image uploads, which I believe O1 didn't support initially. Not sure if it supports it now. And one surprising thing, it supports just 32,000 tokens for input and 8,000 tokens for outputs. So, pretty...

pretty small that's still a decent amount 50 to 60 pages of text but as we've discussed at all these days can support you know twice three times four times ten times that amount so yeah it is experimental you know

Yeah, it also makes me wonder, you know, how sustainable OpenAI's hiding of the reasoning traces will be. Obviously, the reason they're doing that is, as we've discussed, like it's going to be because those are things you can use to train your own model, as we've seen so often, right? People distilling GPT-4-0, GPT-4-turbo previously to like make really powerful small models that then compete with those bigger models and erode OpenAI's margins. Presumably, they're worried about the same thing happening with...

with, with this model. And also, right, we know that, for example, anthropic, at least according to reporting from semi analysis is using opus 3.5, potentially to, if I recall, to sort of do training for or to generate synthetic data for agentic models, right. So there's always this this challenge if you if you actually release this stuff in the world, but

As soon as you have one company like Google DeepMind or Google coming out and saying, hey, you know what? We're going to show you the reasoning traces. Sure, those reasoning traces may not be as good as OpenAI's 01, but now you kind of enter this zone where if you're working in any kind of high-stakes application, right, medicine or insurance or whatever, you have to make sure that you can audit the reasoning trace. The reasoning trace summary that OpenAI provides may not be enough for you, right? You may need to be able to see that reasoning trace, and this starts to make

Products like Google's here look a lot more interesting. And so I wonder how long that sort of moat will hold. And it's possible that OpenAI just wants to get a lead time advantage and they're happy with that. But I would imagine there's going to be a bit of a race to the bottom on revealing the sort of inference traces in the future.

And next, another story in Google, quite related actually, there is now another option you can do in the actual Gemini app. So without going into the...

AI studio platform, but you have to try the experimental thinking mode in. This is on the Gemini web application, not on the phone application. And finally, it is deep research. So you can now toggle an option to use NLM with deep research. And it seems quite similar to chat GPT with search. So it will...

After you make a query, it will go and do what they say is a multi-step research plan, look up some relevant documents, I suppose. And then they say we'll make more refined analysis and generate a report of key findings. So you can think of it as a sort of research assistant that can look up data and so on to answer your question well.

And similar to other kind of search products out there, Perplexity, ChatGPT, I guess the idea is that it can now answer much more complex questions about trends in the market, for instance, or recent news that other LLMs about this sort of stuff cannot handle.

Yeah, I think there's a lot actually quite interesting here, even from an AGI standpoint with this kind of product line, right? You start to think about, okay, well, now Google's putting out this research assistant. We've talked at length about how the game plan for super intelligence, for whether it's Google Elite Mind, Anthropic, OpenAI, they all kind of look the same. It's all about how can we make the automated, as Leopold Aschenberger would say, the automated drop-in AI researcher, right? How do you get there? Well, presumably, it's

it goes through a path that looks a lot like this. One thing that's going to happen here is Google DeepMind will collect a whole bunch of information about the success of research plans and the success of executed research plans, which they can then feed back into their own systems to optimize the research process here. Some subset of this research will even be AI research. In fact, I might imagine it'll be disproportionately that because that'll be who knows about these products first and who tends to be the early adopters.

And so this is a way to kind of close a bit of a feedback loop that allows you to bootstrap your way to that automated AI research or automated AI R&D faster. So while this may seem superficially like a product launch meant to grab at a market, I would not sleep on this one. I think this is an interesting data collection strategy for just that. And we'll see if it factors in, but that's one piece. Another, obviously people flagging

has changed in a pretty fundamental way how you interact with the internet right so when you just like use deep you know deep research to do research for you you're not actually going to websites and so already apparently ai overviews which does something similar right when you now google something you sometimes get this ai overviews thing that summarizes the attempts to answer your question directly based on all kinds of content excuse me it's reviewed on the internet

Apparently, publishers have already seen a 5% to 10% decrease in traffic from search since AI overviews launched. That was earlier this year. So this is already a pretty big hit for a lot of these websites and publishers. Apparently, New York Post

has an estimate that it's about $2 billion in losses that could translate into for publishers. That's pretty remarkable on the back of this shift. It's also, I think, a fundamental challenge to Google, right? They've so entrenched this idea of the search revenue model and all their optimization is done around that, has been done around that. They'll do very well, I'm sure, in whatever the new paradigm is, but the fundamental challenge for them is when the paradigm shifts, incumbents have

often have to do as much catching up as new entrants. So this does create an opportunity for some shifting in the market. But in any case, I think a really interesting launch, interesting strategically from a super intelligent standpoint, and interesting in terms of its effects on publishers and websites, because this is all very new stuff. It changes the dynamics of search in a big way.

Yeah, and it's, I think, also interesting kind of to think of this as something alongside AI overviews, right? Because Google has already built in sort of AI search

uh so this is that but much deeper it uh is almost like like devon these coding agents so it's a research agent as we call it it will show you a plan of what it tends to search for you can actually edit a plan uh remove or add steps or revise them and then it will take minutes uh

from what I've read, to actually generate a report. So it's not going to replace kind of traditional search where you want the information, you know, in five seconds. But if you want to go deep, it seems to actually be a pretty promising product. And you do have to be a Gemini advanced subscriber to use it.

And on to the next big announcement from Google and DeepMind. As I said, this has been a big week. They have announced Veo 2, which is their competitor to Sora. So it is the same kind of thing, a text-to-video generation model.

And it can create clips over two minutes long and up to 4K resolution, which both of those are clipsing Sora. Now you can try it through Google's experimental tool Video Effects and you will be limited to 720p and 8K.

So you won't be able to try that out yourself. I did see various people playing around with VO2 and posting some links. And it seems to be the consensus that VO2 is at least competitive with the Sora release from last week, if not even better at things like modeling physics. There was, I think, a pretty popular post showing VO2...

a video from Sora and VO2 of a person cutting a tomato, right? Which is one of these pretty decent challenges for AI. You need to model the cutting action, things falling. So the Sora video had the knife just sort of going through a tomato and it's staying whole versus VO2, which looked pretty reasonable, looked like pretty much what you would want.

So, yeah, another example of Google very much trying to, I don't know, get back in the race, you could say, or demonstrate that they are still capable of being a leader and are not going to be completely trounced by open AI.

Yeah, it also aligns pretty well with DeepMind's philosophy that, you know, if you think about OpenAI versus DeepMind, and this is caricaturing quite a bit, right? But just to give you a vibe over the last like three years, the directions they've been taking, you do tend to see DeepMind focusing more on things like games.

things like multimodality, whereas open AI has historically been more of a heads down scale type lab. It's not, as we'll discuss a little bit later, it's not quite just that simple, but at a high level. And so these kinds of breakthroughs do potentially tend to advantage DeepMind a little bit more to the extent that the world models you get from these video generation tools

can actually help you train agents. That's very much a, you know, I mean, everybody's going to be doing it. Everybody's going to be doing it well. Um, but just if you think about the kind of competencies they've built up in house over the last few years, it is more kind of in that direction, the sort of, um,

It used to be like generating game environments, right? And now that's going to be happening over video. So I think a pretty interesting development from that standpoint, and maybe not so surprising they've really chosen to focus on this. The claim, by the way, that this tool is better than Sora is back apparently, according to Google Elite Mind, by evidence, by actual data. So they did do head-to-heads. Apparently 59% of human readers preferred VO, or VO2 rather,

Relative to Sora Turbo, 27%, only 27% preferred Sora Turbo and then the rest were unsure. So that's a pretty strong advantage. This is something you commonly see when you look at LMSYS or other kind of head-to-head matchups with large language models. That's a pretty convincing lead.

Um, and then the claim is that, um, the only, so interestingly, Kling version 1.5, which is from, uh, Kuaishou technology in China, uh, did, it was the only model that, that did better than 50% when, uh, when compared to the O2. And that's pretty noteworthy, right? I mean, that is a Chinese company. Um, and, uh, and here they are, you know, kind of, uh, ahead of the game in some sense on this kind of, um, uh,

video generation stuff anyway. We do know that DeepMind is applying those SynthID watermarks that are sort of famous now that Google's been investing in quite a bit. And this is by contrast to OpenAI Sora, right? They do a visible annotation in the bottom right corner of their videos. So you can always see visually that it was a Sora generated thing. Instead, DeepMind is going with SynthID and Sora also has a bunch of watermarking stuff that they do as well, kind of that are more directly comparable to SynthID. Yeah.

But yeah, so last thing is about the data that's been used to train this. Apparently, so we haven't explicitly heard from DeepMind, but, you know, hints obviously that YouTube was involved. Hey, it's all under the Alphabet umbrella. So, of course, you know, all in the family. So expect that, you know, YouTube data has been used

for this, I would say for sure. I'd be shocked if it hadn't been. But there you go. Big advantage structurally as well for Google DeepMind is having access to YouTube. Not that that stopped OpenAI in the past, but theoretically, OpenAI, I guess, should not be able to access YouTube videos. At least it's not clear whether they should or ought to. I'm pretty sure Google wants it to be clear that they shouldn't. That's right. Yeah.

Yeah, and it's not super clear how it is a comparison. So it might be the case that they allowed more compute on VO2. So a turbo is turbo. So it generates videos pretty quickly versus if you take more time to compute, you could maybe easily make higher quality videos. But either way, there is a wait list and some people have access now and the outputs that people even outside of Google have.

are posting are pretty impressive. So I think now, yeah, it seems like Google is the first maybe to match Sora, although now, as you said, Kling is another competitor and we have seen kind of more and more players in the text video space.

And speaking of text-to-video and other players, we've got Pika Labs and their 2.0 generator. So yet another iteration of a model. And this one has something kind of interesting. They call it Scene Ingredients.

So you will be able to upload images of people, objects, or the environment. And then the AI combines those into a cohesive animated video with a prompt. So we've seen image-to-video as another alternative to text-to-video, where you can post an image and tell the model how to then make it into a video. This, I suppose, is kind of a hybrid from text-to-video and image-to-video, where you

give it some visual elements like, let's say, a jacket and tell it to incorporate that. And it can then use it in a fully animated, fully generated video. So pretty interesting as a product development. And yeah, we've had a big week for video models.

Yeah, some of the demos, and this is also, I think, an interesting UX pattern. We haven't seen this elsewhere. Some of the examples they provide are pretty cool. Selfie of a person, picture of a cat, and then you write a prompt like a person that's petting a cat, and then you get the video. Another example that they show, this one from X. So there's a...

a sort of selfie of a woman and then she's combined the famous painting of a girl with a pearl earring watching a movie in a theater, right? And you can sort of see that it's almost like I'm trying to think of those 80s movies where they have animatronic, like Space Jam and stuff, you know, animatronic characters and it's pretty surreal. So very cool. And I'm sure there's going to be a lot of interesting stuff done with features like this.

And actually going back to Google Next, yet another thing they have announced, and that is Project Mariner.

So this is their agent to use browsers. DeepMind has announced that they are working at least on something that will be embedded in the Chrome browser. And you can tell it to go and do something for you. And it's then sort of browse the web, navigate interactive websites, click,

type and so on. This is only released to a small group of testers, but it's yet another sort of popular trend and thing we've seen a lot of people working on this notion of an AI that can use a GUI for you, that can use your browser

to do kind of anything as opposed to needing an API or just searching the web. We'll have to see how quickly they can actually get it out because that is a question for me.

Yeah. And speaking of speed, right, there is apparently this is a very slow agent. That's not, you know, shouldn't be too surprising. Apparently, you see get about five seconds of delay between each cursor movement pretty often. And sometimes they say, you know, the agent will like stop its task and revert back to the chat window asking for clarification about certain items, which is not a good thing.

is actually not a bad thing. Like, this is an intentional user experience pattern that Google is trying to bake in here. They understand that, you know, these models are being given agency over your laptop, over your computer, and they could potentially do some pretty risky things. So it really seems like a key...

key UX decision they've made here is to default to going slow, default to checking in, and also to kind of rein in some capabilities. So the agent, for example, can't go to checkout. It's not supposed to fill out credit card numbers or billing information. It also doesn't do things like accept cookies or sign terms of service agreements. That all makes sense for fundamental legal reasons, right? You can't have an

agent authorized, sorry, an AI agent authorized to be your agent for the purpose of signing documents like that, or at least, Hey, maybe you can, uh, maybe that's an interesting legal battle that we'll be fighting out over the next few years. I suspect we will. Um, but, uh, obviously all those limitations are subject to jailbreak, presumably at least. Um, so I would expect once this actually rolls out, you'll see people finding ways around that plenty of the prompter, uh, is probably going to perk his, his, uh, internet head up and do things like that. But, um,

One of the mechanisms behind this behind the scenes, Google is actually taking screenshots of your browser, which is like your browser window itself. And so that's a new requirement in the terms of service that you have to agree to that you're going to send that over to Gemini for processing. So that's kind of interesting, right? Like this is more and more intimate exposure to the data on your computer.

And it's something that is going to have to shift norms in the space. You know, we're going to have to have higher security, but also we're going to have to be okay with AI agents doing more and more stuff. So for now, I think just an iterative step in that direction of agents, I do think it is relevant that this is

like, like the deep research tool, like this is another thing that gets you as the end user further away from websites, right? This, this steps you further back from actually being like, you know, on, on Fox news or on CNN or whatever, like reading articles and all that stuff like that. You're now moving away. Advertisement is going to, to take a hit on those websites. Traffic's going to take a hit, even sort of the, the, um, uh,

loyalty that you have to a certain website, its layout, its design patterns, right? All of that is going to take a hit. And I think that's a really interesting time. I mean, the incentive to spin up a new website that's just pumping out content starts to drop as these sorts of tools come out. That's right. And I think it's kind of coming along as...

Many players are playing around with these types of agents, not just agents that do reasoning, but also agents that look at your screen and click stuff and put in text. And I think it's a real question as to whether that is a paradigm that will actually be useful or whether websites, for instance, just start exposing APIs that AIs can directly use

without the need to do what humans do, which is like directly click on stuff. So in a way, it might be sort of a hack that is not going to be needed. It remains to be seen.

And that is it for all the Google news. They really did try to outdo everyone else and have their own little ship mess. And we have one more story from another big player. XAI and X are now releasing Grok 2. It's supposed to be three times faster. And as with, I guess, prior Grok releases is pretty competitive with other frontier models.

They are also expanding the footprint of Grok on X slash Twitter. There's now a Grok button and it's available to everyone. Free users of Grok can ask up to 10 questions every two hours. So it used to be that to get access, you had to pay.

have a premium or premium plus subscription. Now that is not necessary. You can try it out at least without paying. Yeah, I mean, don't sleep on Grok, right? They have data, they have distribution, and they have access to compute. So I think this is a really interesting...

It's an interesting tool, and it may or may not be cutting edge in terms of capability right now. Yeah, and I think it's interesting that it is built into X slash Twitter. It's still a service that is used by probably hundreds of millions of users. So as far as a chatbot that isn't kind of its own standalone thing,

It's Upware with something like Gemini as being built in and super accessible and really being promoted by X. And it's an interesting question to me, how many people are starting to use it or starting to discover chatbots because of it?

All right, enough of those kinds of stories. Let's move on to applications and business. And we begin, as has often been the case, with more data centers and supercomputers that are being in the works. So this time it's coming from Broadcom. They say that free AI supercomputers are in the works and there are clusters with up to 1 million GPUs being planned for 2027.

And that is what, 10x, 5x, the current mega clusters and mega data centers being used by companies like XAI and Meta. So it just goes to show like people are just pumping money into developing the highest capability tech.

And then just massive compute clusters that would be unimaginable, I think, a couple of years ago.

Yeah, absolutely. And Broadcom's at the center of the story for a really interesting reason too. So as has been reported, speculated, they are potentially partnering with OpenAI to help design the next generation or really the first generation of OpenAI designed AI hardware. And that's really interesting, right? Because so Broadcom historically was used by Google in the early days of the TPU, which is Google's own kind of in-house special AI processor.

And so, you know, OpenAI has been poaching a whole bunch of Google talent, specifically the people who were at the interface between Google and Broadcom when that got started.

So very clearly intending, it seems, to partner with Broadcom on this. We've heard from Broadcom during this earnings call that it has landed orders from, quotes, two or more hyperscalers and is in advanced development of their own next generation AI XPUs. So when you hear XPU, you know, you got GPUs, obviously, NVIDIA uses those, everybody uses those.

Google uses the tensor processing unit, the TPU. Well, you know, OpenAI is designing a new thing. So it may not be a GPU. It may not be a TPU. There's like...

all kinds of different possibilities. And so they're just calling them, you know, whatever, these AI accelerator ASICs. And so, yeah, I mean, this is perhaps pseudo confirmation of the OpenAI Broadcom design partnership. Also rumored that ByteDance might be the other partner here that's collaborating with Broadcom to design their chips. Interesting because ByteDance, of course, is based in China, won't be able therefore to use the kind of high-end, you know, three nanometer chips

process node or five minute process node at TSMC. So their partnership with Broadcom is going to have to find a way around that constraint. They're going to have to design a ship that has really high performance without using those nodes. That'll be interesting. And the Chinese ecosystem is really good at doing exactly that kind of thing.

Yeah, last note is just this kind of confirmation of the 1 million XPU cluster, right? Across a single fabric. That's one of the key things that they're flagging here. So essentially one coherent blob of compute that would be used to train presumably very, very large and powerful models. And when you start talking about, you know, like a 1 million XPU cluster, just for context, if that's NVIDIA GPUs, if you're talking about like the H100, something like that,

that. 1 million H100s is about very roughly order of magnitude a gigawatt of power that you're going to be consuming. That's really, really hard to find in the kind of US power grid right now. And especially on the 2027 timescale, you're not going to have time to spin up

like a new nuclear plant, a new geothermal plant, you may have time if there's deregulation to spin up like natural gas, that could happen, but that would be a really quick turnaround for that. So essentially what you're looking at here is basically these guys who are putting together these massive clusters are scrounging around for any spare bit of pre-existing one gigawatt capacity on the grid.

And they're going to try to build this out. We've seen Meta, you know, purchase or make plans for a two gigawatt cluster. I think the timeline for that is a little bit beyond 2027, but still in that range. Amazon 960 megawatts, like sort of in the gigawatt range. So certainly people making these kinds of moves. And it does seem like, you know, 2027 is when you might see that million, it's crazy to say, right, that million XPU cluster.

Sorry, last quick note, when you look at the TPU, much, much more energy efficient than the GPU, right? By a sort of multiple factor. We don't know the exact number, at least for large clusters. We know on a sort of individual TPU basis. It's something like, if I recall, a factor of like,

it might even be 2x or so. So, you know, one gigawatt can buy you more or less flops depending on the kind of hardware that you designed for it. But anyway, it's, I think, a really interesting time for this space. These are potentially the, like,

they could be the AGI clusters. That's certainly how OpenAI internally talks about the 2027, 2028 clusters. So we'll, we'll wait and see, but, but, uh, Broadcom is in the thick of it and they're not a company that people talk about a lot. I think they're a company people should talk about more. Uh, they're, you know, much more into the bespoke kind of partner with a, um, a model developer or some other company to design bespoke hardware that really meets their needs. And, um,

Anyway, there's all kinds of reasons to think that OpenAI is taking a very particular strategy on chip design, more CPU-heavy than the GPU-heavy strategy that, for example, Anthropix seems to be pursuing. And the Broadcom partnership is presumably going to reflect that and help them make that a reality. And by the way, this is all coming from the Q4 website.

Warnings call from the president and CEO. So there were comments basically saying that they are working with customers and those customers seem to be planning over the next three years to deploy a lot of these products

multi-generational AI XPUs. So specifically that. And they believe, Broadcom believes that each of them plans to deploy 1 million XPU clusters across a single fabric. So that's where this information is coming from. They disclosed during the call that

They got orders for XPUs from two hyperscalers. So that could be kind of hinting at the open AI connection. And they're also developing their own XPU. So yeah, Broadcom, you know, it's kind of a niche company, you could say, but certainly seems to be up there with NVIDIA as far as companies really benefiting from these AI trends.

Moving on to the next story and back to the legal troubles of OpenAI, which have been used a lot over the last couple of months. And this time is for an interesting reason. And it's not because of Elon Musk. This time it's because of Meta. And Meta has kind of backed Elon Musk in a way by asking the government to block OpenAI's switch to a for-profit.

So this would be kind of echoing or adding on to the current lawsuit that Elon Musk has going, where the argument has been that OpenAI began as a non-profit. They're now wanting to switch full to a for-profit, not their current like capped for-profit structure.

And that was unfair or misleading. Well, a meta is saying here that opening eye doing this could set a precedent for startups to initially operate as nonprofits to gain tax benefits and investors and then later convert to for profit entities.

An interesting argument and the one that like personally, this seems a little bit cynical. That seems like maybe... What cynical? The cynical take is maybe this is not just because of the concern about the broader market and what other startups will do. I think maybe a few players are trying to undercut, open the eye with these kinds of statements. But...

We may be seeing as to whether this will really matter. My suspicion is I don't think California will block OpenAI.

Yeah, look, I don't think that this is at all a cynical play, Andre. I mean, look, Elon Musk is just, you know, entered the governing orbit of the United States of America. Zuck is directly competing with Sam Altman increasingly as a number one competitor in this space. There's, you know, tons of money on the line, tons of compute.

that stacks up in favor of Zuck making a cynical play. But despite that, I'm sure he's doing this for the right reasons. I'm sure he's doing it because he fundamentally, but no, it's actually, it's sort of funny because the actual argument for OpenAI making the switch is quite complex, but we are seeing a funny realigning of the whole AI tech space

So obviously, Zuck and Elon were supposed to do a cage match. I'm old enough to remember six months ago when they were supposed to bloody each other's faces up. So I don't know what happened there. But here are some quotes, right? So apparently, opening eyes should, quote, not be allowed to flout the law by taking and reappropriating assets consistently.

It built as a charity and using them for potentially enormous private gains, says Meta. And they go so far as to say that Meta believes Elon is, quote, qualified and well-positioned to represent the interests of Californians in this matter. So,

Very interesting that singling out Elon Musk, I mean, to the cynical interpretation here, you could read this as Zuck trying to ingratiate himself with Elon now that Elon is sort of like placed the correct sort of anticipatory bet on Trump getting elected. So now everyone's scrambling. You see this as well with Trump.

Sam Altman, in fairness, right? I saw an interview he gave recently where somebody asks him about all this thing that Marc Andreessen said about apparently the Biden administration's trying to pick like two or three companies to win in AI. And they're like, to be honest, I mean, that's that sort of bunk. But anyway, Sam says he opens and he tries to slip this by as if it's just off the top of his head. And he kind of goes.

well, look, I don't think that the Biden administration is competent enough to whatever. And then he goes on with his answer. Sam Altman is trying to position himself as if he's been this Republican guy all along, which is sort of hilarious because, you know, the cozying up to Democrats, I got to say, I mean, we've been in these congressional offices quite a bit after the OpenAI sort of Leo lobbyists and whatever show up. It's exactly what you'd expect. These are all alliances of convenience.

And anyway, so that's part of the story here. It does, there is this interesting debate, right? Can you just flip over to a for-profit entity having raised billions of dollars as a nonprofit? This is, it's an interesting problem. And Meta goes on to argue that, and this is pulling from their, their,

sort of statements here. They're saying that they enticed, quote, investors to, sorry, this would entice investors to launch organizations as nonprofits, collect hundreds of millions of dollars in tax-free donations to support research and development, and then assume for-profit status. Okay, so no surprise there.

OpenAI fires back and their counter argument is, look, we are still retaining the nonprofit entity. That is the defense here. We're not actually like, yeah, sure, we are going to be rotating things into a for-profit status, but we will maintain some kind of nonprofit and fulfill the fiduciary obligation that it has by ensuring that we build AGI to benefit all of humanity type thing.

What exactly that means seems to be the core of this. I'm not a lawyer, but that seems to me to be at the kind of heart of this issue. And I'd be really interested to hear lawyers chime in if they listen to the episode. But yeah, it's...

So it's very unclear whether this can be done. At least it seems that way to me. This is thorny, but certainly meta seems to be aligning itself very explicitly, not just with XAI, not just with Tesla, but with the person of Elon Musk. I think that's a quite deliberate play to try to appeal to the individual here and leave it to Zuck to play that kind of game.

And by the way, this letter is one that has been sent to the California Attorney General, Rob Bonta. So the idea being, I suppose that this person can block the transition. I have no idea if that's even possible, but...

Yeah, and as you said, you can now read the full letter and it's interesting that it went out of its way to comment on Elon Musk and his qualifications. Next story, let's continue on OpenAI and Elon Musk. The legal drama continues. We've seen kind of a couple rounds of emails that mean OpenAI people and Elon Musk.

Going back to 2017, the split, the rift between the two when Elon Musk left OpenAI, a lot of this lawsuit is going back to those days. And so there have been now more information, more emails and a blog post from OpenAI where they say that Elon Musk was an advocate for a for-profit solution.

structure, but left because he couldn't secure majority control.

And is now, you know, saying, oh, it's so bad that OpenAI was a non-profit and is now turning into for-profit. And so, yeah, it's another one of these things where they are trying to make the argument that the kind of claim that Elon Musk opposes the transition from non-profit to for-profit is basically false.

that this is instead, you know, kind of a rift to develop because Elon Musk couldn't control it and left and essentially, you know, now is our competitor. Yeah, two big take-homes for me from this. Like, number one, it's interesting that OpenAI, we've seen this increasingly with SAM,

First, there was a phase where the public veneer started to fall and we had all these reports of whistleblowers saying they're not living up to their commitments and doing all kinds of stuff that could be dangerous or at the very least unethical in various ways. And then eventually, Sam Altman's public persona started to... You started to see him do things that, frankly, you only ever see from sloppy founders. I'm not going to name names, obviously, but I remember seeing this as...

one of the startups in my batch at Y Combinator that was ultimately accused actually of being a fraud, you would see the kind of the founder spin out over time with the kind of way that, anyway, he looked to the world for social media. Opening Eye at one point, or Sam Altman rather, kind of got pretty...

Like he, he came down to earth for a minute and, and no longer was that kind of like high up figure when he, he, uh, I think he compared like GROK and, uh, to open a eyes, uh, like whatever model it was running on chat GPT and said like, which one of these was supposed to be the politically biased, like left-wing model again. And it was, we talked about at the time, but anyway, it was this example of GROK spitting out one particular output that seemed to be, um, based on context, uh, like kind of politically biased, uh,

That was a really, I think, interesting case of like, it was the first time that they were choosing to get into the fray. And I'm no PR guy, right? Like we're technical guys here. We don't know what the hell we're talking about when it comes to this stuff. But it just struck me that in that moment,

he sort of shattered that pristine image and you can never really quite go back. And now opening eyes, choosing to double down on this, kind of releasing these emails, really airing the dirty laundry and being kind of upfront with airing it. It's not at all as cut and dry as it might seem either, right? What's happening here is Elon, yes, back in 2017 is saying, hey, we need to make this a for-profit entity. He's absolutely playing hardball trying to get himself to be the CEO and have a controlling interest in the company. But

That was 2017. Elon is not looking at a company that has raised the ungodly sums of money that OpenAI had raised and looking to transform that into a for-profit. I think his counterargument presumably would be something like, well, yeah, there's a difference between I was advocating for a for-profit entity back then versus flipping an entity and

that has raised all kinds of money on the back of goodwill and innovated and been able to hire on the back of goodwill. And only now, now that we're like $157 billion valuation on the, at least the for-profit entity or however it's construed only now flipping it around, that is a material difference here. And so,

It's kind of interesting. It's nuanced, and I don't think it's as clean as anybody wants it to be, but certainly there's more to it than these email leaks would make it appear. And there is a bit of a sort of airing the dirty laundry feel to this that's not unfamiliar now increasingly to OpenAI, unfortunately, as the brand.

starts to fall into this vortex of when you roll the pigs, you get dirty. That's kind of the vibe that at least I've been getting from this, especially recently. Yeah, and I think it's interesting that this one in particular...

is the latest in a series of blog posts in which we address this, right? It's not just a legal argument we're making. You can read the entire blog post. It begins with a timeline of events going back to 2015, where Elon apparently questioned the nonprofit decision and going through 2015, 2017, 2018.

going into 2018 and 2019, where now they say they offered Elon equity when there was a transition to a capped for-profit structure, which at the time apparently he rejected. So yeah, it's also interesting to me as like, why does this need to be public? Why do they need to publish a blog post making this argument? Strategically, it's not clear to me there's much of a reason unless they think that

These claims from Elon Musk are hurting them or perhaps they want to influence lawmakers. I don't know. It's an interesting kind of way to hash out what is supposed to be a legal argument. Oh, yeah. I mean, you know, speaking of cynicism, right? Like, I think the cynical play here from Sam Altman in OpenAI is to say, hey, Elon has to some degree the ear of President Trump.

Um, so, you know, how, how can we, how can we get in the middle of that? How can we prevent that from happening? Oh, well, maybe we can cast him as this sort of anti-competitive character. Um, at the very least undermine his claim that, um, that we're doing a shady business practices or whatever. And, uh, yeah, I think it's, it's just a really messy time for the AI industry. Right. And I guess also, uh,

showing these emails, it could be a tactic to embarrass Elon Musk. Oh, yeah. Just, you know, want to stop all this legal stuff so that they don't do more of this, I suppose. Yeah, disclosure is a hell of a drug. Yeah. Moving on from drama to actual developments in business, we have a story that Equity Lab, Intel, and NVIDIA have together unveiled a verifiable compute, which is a solution to have secure, trusted AI.

So this is a hardware-based AI framework that is designed to enhance AI security, accountability, and explainability by having a cryptographic AI notary and certificate system to create records of AI operations, meaning that you can then ensure compliance with regulations like the EU AI Act.

And these will be integrating with Intel's processors and NVIDIA's GPUs. Seems interesting to me, and I'm sure, Jeremy, you have some thoughts on this.

Yeah, well, so for a long time now, people in the AI security space have been talking about how we need on-chip, what's called on-chip governance. You need the ability to log what's happening on a chip. Because, for example, if China steals the chip, right, you want to know what was this used for, who is it used by, and you want the ability to ideally control it, have tamper-proof, eventually remote shutdown capabilities, things like that. This starts to become necessary just because the national security significance of the technology is,

This is a really interesting commercial step in that direction. It's interesting that it's this late as well, that it's actually already being worked on with Intel and NVIDIA in real hardware that apparently will be shipping soon. So, you know, like that's, that's,

actually quite remarkable, especially given the cycle times associated with the technology. Um, so yeah, they, they refer to this as cryptographically as generating cryptographically secure records of every stage of the AI life cycle. Um, so you have agents, you know, the reasoning traces, all that can be logged, audited, and again, tamper proof. Um, and they have a whole bunch of controls. So, uh, I'll just describe to you this from their website. They say if, if mandatory controls are not satisfied, um,

A verifiable governance gate halts an AI system and can notify or integrate into an enterprise's remediation tooling with native connectors to ServiceNow Databricks and Palantir. So really interesting there. There's like at the silicon level, they are introducing these kinds of gates, right, to make it impossible, tamper-proof, to make it impossible for people to get around things.

Um, and so the, the software presumably could instruct a chip not to process information if it noticed something abnormal, right? Which could suggest a hack. Um, and then if the system is compliant, it issues this kind of audit trace, what they call a lineage certificate, uh, that's verified instantly in a browser or can be independently audited at any point in the future. So, um,

This is the sort of thing that would be useful if you need to prove as a company that your AI model did not, for example, infringe on any copyrights while entering a prompt or wasn't used or weaponized in a certain way. So these are kind of interesting options that were not previously on the table in the same way. Apparently also, as you said, allows you to do real-time compliance checks.

with things like the EU AI Act, other sovereign AI regulations, which increasingly is a really, really big requirement. This is a big burden that people are imposing on model developers and hardware developers and designers. And they all do this on basically a new kind of trusted execution environment. It's a T. It's basically like a secure area of a processor that makes sure that sensitive data is stored in a very isolated environment and processed.

In an isolated environment. And anyway, so this is really cool. We do know it's rolling out in the H100 and H200 GPUs, as well as the Blackwell architecture that NVIDIA is coming up with. So this is real, like this is a real, real thing that will be making a difference, increasing the kind of options that policy people and national security people have when thinking about the technology.

And on to some, let's say, more businessy business stories. We've got a startup that has raised a whole bunch of money, $250 million, to develop a more efficient type of AI model. So the startup is Liquid AI. It was spun off from MIT just last year in December, actually. And now they have this $250 million investment led by AMD, which

which would value the company at over $2 billion. And the main claim to fame of this startup is what they call liquid neural networks. This was sort of an area of research that the founders were doing for a few years, starting in 2020, scaling up an idea of kind of formulation of neural networks that

quite different from transformers and sort of traditional neural network training. They say that this uses less compute and is more adaptable over time compared to how they work.

And we've seen some claims, I suppose, from them that these are very promising. I don't think many people are convinced, or at least I haven't seen much sort of signals that they are developing something that can compete with frontier models. So I guess AMD and other investors are a little more bullish on Liquid AI being able to be a significant player.

Yeah, it is going to be a strategic partnership for Liquid AI, for sure, with AMD. And AMD, obviously, trying to catch up to NVIDIA, blah, blah, blah, blah, blah. One of the interesting things here, $250 million as part of this Series A, by the way, like Series A, $250 million, like,

You used to raise a $5 million Series A. I'm just going to say that. Five years ago, you would raise a $5 million Series A, but here, hey, $250 million, not bad. It's a $2 billion valuation. $250 million, by the way, is about 5% of AMD's total cash reserves. I just did a quick Google of that.

That's, yeah, it's pretty sizable, right? That's a big bet they're placing here. So yeah, presumably there's a lot of belief in the potential for Liquid AI and we'll just have to keep tracking them. They're definitely a player now, one way or another.

Right, and this is following up on, I believe a few months ago, we covered this story from them, what they called liquid foundation models, their first series of generative AI models. This was back in September. And they, at that point, were saying that they have this new type of foundation model that beat out all the open source models at the time by quite a bit, was better performative, etc.,

I guess what I meant is since that announcement, since that blog post of this foundation model, we haven't seen much more from them. But they are now developing this liquid agent. They have LFMs, liquid foundation models.

as a kind of term we're shopping around. So yeah, interesting to see. We haven't had an alternative neural network type like space machines seem to make much of a dent or impact yet, but maybe that is one of the things we'll start seeing.

And one more business story and also one about OpenAI. The story here is that hundreds of OpenAI's current and ex-employees are about to get a huge payday by cashing out up to 10 million euros.

each so we i believe mentioned that softbank is being able to invest another 1.6 billion into openai and that is going to be happening through this private stock sale and apparently roughly 400 current and former openai employees are now able to sell their stock to a softbank and uh

Usually in a private company, you can't sell your stock. You got to wait for it to go public so that you can benefit from your holdings. Well, now because of this sort of thing, the staffers and employees are able to sell the stock at $210 per share. And this is also...

SELPIG has been interestingly more and more of a case in Silicon Valley, more and more big private companies sticking around and more liquidity in the private markets, which is an interesting trend. Yeah, consequence of interest rates and then also a cycle that feeds back on itself, right? Because what tends to happen is, and this is why Silicon Valley just keeps winning more and more, right? You have massive exits. This

This produces a lot of really well-capitalized founders and early employees. They then go on to invest themselves. And now if you look at a lot of these fundraisers, I mean, the Collinson brothers, you know, Stripe, like the Stripe co-founders, they'll, they'll like, I've, I've seen series A's that they've led series B's, right? They're, they're sometimes putting tens or hundreds of millions of dollars in individual investments. Sam A himself has done similar things. And so, yeah, you can actually, the reality is that there's enough,

capital now among individual investors in private companies to, sorry, in public companies to keep companies private for longer. And one consequence of this too, I mean, it does kind of suck for the general public. You don't get to get in on these rounds unless you have personal connections and large amounts of capital. And so this means that the public is actually cut off, right? If you want to invest in SpaceX, well,

That's like a $300 billion company in any other economic circumstance. Like 10 years ago, yeah, they would have been super public and you could invest in SpaceX stock and so on. You can't do that now. And so you have to find other ways to get exposure to their activities. One of the things opening eyes has been criticized for is...

that they used to have an approach where only current employees could participate in these kinds of tender offers. That was all part of the series of complaints and whistleblowings that happened a few months ago where people were saying, look, we're being penalized for leaving the company, for speaking out, for doing all these things. Opening Eye retains the right to prevent us from participating in these kinds of offers.

This basically means that our shares are valueless, right? Like we have zero liquidity. We can't do anything with you. So opening, I was sort of shamed into, um, more or less, uh, changing this, this policy. And that's really what's rolling out here. It's only out of the 2000 employees that opening I had, it's only 400 who can actually participate here. And that's just because that's the number of employees presumably who've been around long enough, had to be around for two years or longer, um, to participate in this, uh, the stock sale hilariously, uh,

But Anthropic co-founders Dario and Daniela Amodei and Jack Clark all seem to qualify to sell their shares in Israel. So theoretically, I don't know if they plan to, but theoretically, they could go ahead and sell apparently up to $10 million in private stock. But there's 400 employees that qualify. Apparently, SoftBank's coming in to buy $1.6 billion.

If all 400 of those people sold $10 million, that'd be $4 billion. Apparently, the total sale is for $2 billion. So I don't know what's going on. There's SoftBank doing most of the buying. Someone else is going to have to pick up the slack for the rest of the $2 billion. And then presumably, there's going to be some kind of limitation on those employees because not everybody's going to be able to cash out the $10 million.

And there is some deal where current employees are going to be favored if they can't find enough purchasers to buy up what everybody wants to sell. So yeah, everything on a continuum here, but sort of interesting part of the OpenAI stock sales saga. Ooh.

Ooh, that's hard to say. And onto projects and open source. We begin with PHY4. So Microsoft has been working on this PHY model for quite a bit. That's their smallish, large language model. And now they have the latest iteration coming in at 14 billion parameters. And they have released PHY for technical report. You can access the paper, which goes into at least a little bit of how it works.

So there's no big change here in terms of the architecture, in terms of the size, but they do highlight using a lot of synthetic data throughout the training process. And

at least partially because of this and other post-training techniques that go beyond just distillation, that is to say, taking a big model and making a smaller model that is trained by it. Here they say that they can outdo the bigger teacher model with things like synthetic data. And of course, this is substantially better than 5.3, even beating out on reasoning-focused benchmarks.

Yeah, there are a couple of interesting trends that are emerging here. First of all, when you see Microsoft release a new FI model, one of the first things you automatically should think of is, okay, data, right? That's the big differentiator, at least. I mean, it always is with these models, but at least Microsoft is maybe more upfront with telling us

what kind of data improvements and data curation improvements they've made and data generation. One of the big ones here is the way that they generate synthetic data and how hard they lean into synthetic data. And so they've done a lot of work around synthetic data. They start with, in this case, some high-quality seeds. These are sort of high-quality documents from sources like books, web pages, academic papers, and code repositories.

And then they'll filter them for just showing content with high complexity, a lot of reasoning depth, a lot of educational value. And then they set up a bunch of extractors to start with those seed documents, very high quality seed documents. And they're going to do things like, okay, can I create a whole bunch of questions to ask, like synthetically generate a bunch of questions about this document and then a bunch of answers to those questions and then train on those.

And then essentially set up a whole pipeline that allows them to purify the answers that work best there that are most sensible and high quality. They do the standard thing by now of like generating multiple answers to each of these synthetic questions.

They use majority voting to figure out how consistent, like which answers are most consistent. Interestingly, they get rid of questions where all the answers agree because those are too easy or where all the answers are completely inconsistent because those are just too hard or ambiguous. They kind of keep a sweet spot of difficulty. These questions where sometimes you're getting it

You're getting consistency among the AI-generated answers, sometimes not. And then you keep those questions and answers and you train on those. And there's a whole bunch more stuff like that where they're really leaning into agentic methods to generate synthetic data that augments what is initially this very high-quality seed data set. I thought that was really interesting. Another thing, and we'll talk more about this a little later, but they use what they call a pivotal token strategy.

This is just, again, we've talked about this before, but it typically, when you look at LLMs, when you look at transformers,

all tokens contribute, like you spend as much compute chewing on each token in your input, but they don't all contribute equally to the correctness of response. Some tokens are especially pivotal in Microsoft's words. They dramatically shift the probability of the model providing the right answer. And essentially what they're going to do here is set up an algorithm architecture that will...

estimate the probability of a correct solution before and after each token. And based on which tokens change that probability a lot, they'll say, okay, that's a pivotal token, and we're going to invest more compute in chewing on that one. So anyway, there's a lot going on on the back end here, and we'll dive more into this kind of architecture actually with the meta paper we'll be discussing soon. But yeah, just I thought really interesting analysis

And a lot of layered strategies, right? This is the same thing that OpenAI does. It's never one big innovation. It's always a bunch of things stacked together that just make a model a lot better. And you certainly see that here. Really impressive GPQA results, math benchmarks results, given the size of this model. Yes, and I found this amusing on the blog post for this, the callback.

call it an SLM, a small language model. So 14 billion parameters is now small, apparently, for language models.

And as far as the open source aspect of this, they do say that it will be available on Hugging Face soon under the Microsoft Research License Agreement. So not fully, fully open source, but at least for researchers, you can now utilize it, which, you know, there's now plenty of these small language models and they keep getting better.

Next, we've got DeepSeq VL2, mixture of experts, vision language models for advanced multimodal understanding. So yeah, it's the next generation of DeepSeq VL. This is a vision language model. So these take in images and text and output text. You can have an image and you can then ask questions about it.

And here they release 1 billion, 2.8 billion and 4.5 billion activated parameters. So they utilize a mixture of experts, there's more parameters being trained, but the actual number being used is smaller. And the code and pre-trained models are being made available on GitHub.

So, yeah, VLMs and one of these areas where there are fewer open source models to rely on and then less kind of investment, this would mean that there is a pretty strong one for people to build on.

Yeah.

But yeah, they do show there's interesting curve. Figure one, they show a clear Pareto improvement over at least, you know, like Parasite.

past models in this space, you know, Quen's vision language model and PHY 3.5, where you basically see for the same number of activated parameters, you generally do get a superior performance envelope for the DeepSeq VL2 model set here. So, you know, clearly what they're doing here is working. We can, I don't know, it's

It's so hard not to dive into the details, but probably worth parking it for now, just given how much we've got. We've got a lot more to get through, so we will park it. But, you know, another demonstration of a mixture of experts being very powerful. And as you said, DeepSeq, we've covered them quite a few times. And generally models of China coming out and showing quite a bit of strength.

And one more story in the section, Meta is releasing Apollo, a new family of video LLMs, so large multimodal models for video understanding. These are

are being able to process videos up to an hour long and come also in free sizes. I guess that's the magic number of language models and general multimodal models. So the sizes here are 1.5 billion, 3 billion, and 7 billion parameters. In addition, they are also releasing Apollo Bench, which is a way to streamline

benchmarking of video LLMs. So that's, I guess, like VLMs, but for video, you input a video and you can ask a question about it. Same as with the previous one, we could go into various design choices and kind of ideas introduced in the paper for this, but you probably shouldn't.

Sounds good. It is theoretically the lightning round. I feel like we do such a shitty job of keeping it to lighting round. And, uh, but, but Hey, it's all part of the fun.

And on to research and advancements, where we begin with a story on enlightenment from Anthropic and a few other groups in collaboration. And they are exploring alignment faking in large language models, which is the ability of models to selectively comply with training objectives and

and trying to avoid kind of different behavior outside of training. So here they use Cloud Free Opus. They instructed it to answer all queries, including harmful ones during training with free users. So they're training for it to do everything. And then they also wanted to maintain its refusal behavior and maintain it

ability to say no to harmful ones for paid users. And so the model had this whole kind of, I guess, reasoning internally to be able to decide to answer to harmful queries for free users during training. I think, Jeremy, you probably did more of a deep dive on this paper, so I'll let you take over on this one.

Yeah, I know for sure. So, so this paper is at the heart of a massive debate that's happening right now on like AI Twitter and AIX and especially AI National Security X. And it's, it's really fascinating. So yes, you start off with this, this, this model and then they try, you know, a whole series of Opus 3.0, sorry, yeah, called Opus, called Sonnet, a bunch of different models for this. But

You start off by training these models, right? They're like Cloud 3.5 Sonnet, for example, is trained with the HHH objective. They want it to be helpful, harmless, and honest, right? And so the question is now, if we take that model,

And we give it a prompt, a system prompt. And we tell it, hey, we're going to now start to train you to behave a little differently from the way you were trained to behave. What you then start to see, and then you tell it,

We're going to train you only on data from unpaid users. So we're going to train you on data from unpaid users. Presumably, we're going to monitor your behavior on that training set. And we'll be able to tell if you actually do start to violate your original HHH objective. If you all of a sudden start to talk about, yeah, you know, like kicking around, I don't know, camels is totally fine.

We want that behavior, we tell the model. We want you to forget about your morals, your values. We want you to start kicking the shit out of camels or telling people to kick the shit out of camels. And again, we're just going to train you on this data set of users' queries where those users are free users. We're not going to train you on paid users.

Then you test to see how does the model respond to queries that it knows are from free users versus paid users. So again, the free users, it knows its responses are going to be monitored. The paid users, it knows its responses will not be monitored. And what you find is that the model will behave as if it is already compliant with this sort of like kick the shit out of camels objective.

on the free users where it thinks it's being monitored, but it will then retain its behavior, its sort of pro-CAML behavior, if you will, on the paid user data set.

And that happens again, when you, when you give it the system prompt where you're telling it, okay, we're going to train you differently on these two datasets. It also happens though, if you convey that information via fine tuning dataset instead. So if you just provide the model with some fine tuning, some documents to fine tune on that make allusion to how anthropic trains as models that say it trains on, on a free tier model.

data and it doesn't evaluate models on the paid user data, you'll see the same thing. So it's actually quite a robust behavior, almost

almost as if, to anthropomorphize, almost as if the model has pre-existing preferences to be HHH, to be helpful, harmless, and honest, to, in other words, care about camel welfare or whatever. And it's trying to strategically seem aligned when it's being trained on free user data, where it's being monitored, but actually retain that original behavior when it's put in the wild on paid user data.

And then they separately go, okay, well, so that's already kind of interesting. Some people might interpret it as like, oh, the model is doing this strategically. Other people will interpret it and that's scary, right? Because the model has this objective that it is...

actively trying to retain, even when you move it into, when you tell it, you're going to try to train it for something else. Maybe that's not too surprising that some people have said, actually, that's a good thing because the model was originally aligned to be helpful, harmless, and honest. And isn't it good that it's preserving that behavior and try to defend it even when you're training it to be malicious or whatever. I think the actual story here is one level deeper than this. The reality is simply that you have a model that has

has a goal that it is trying to preserve. And you better get that goal right in the first place. That's what this tells us. The model will actively try to retain the integrity of its previously trained goal. And so correcting model goals does seem to be meaningfully harder. Like that's kind of the take home here. If you have a super intelligent model, you're not going to have potentially a second crack at this, right? You're not going to have the opportunity to correct

its behavior to realign it, it may actually try to deceptively appear aligned when in fact not being aligned. And this persists even if you actually go ahead now and explicitly train, fine-tune via reinforcement learning from human feedback, via RLHF. If you go ahead and try to train the model to hate camels, let's say, what you'll actually find is, and sorry, let me take one more ingredient here.

So in generating, the way we know that there's explicit deception that's intended by the model here isn't just that we see the model kind of say, yeah, fuck the camels on the training set. And then on the test set, all of a sudden it's like, oh, no, I like camels. It's that you also see on a scratch pad, they give the model access to a scratch pad to lay out its reasoning. You see explicit reasoning there that says, okay.

I'm doing this to preserve my objective. I'm going to be deceptive and blah, blah, blah. So this isn't just kind of like hairbrained guessing. This is actually explicitly laid out to the extent that that trackpad or scratchpad represents the actual intent, if you will, of the model, then that's what this is suggesting.

When you then go and train the model explicitly, okay, now you're actually going to follow through and say, okay, well, let's train this model on this data to try to encourage it to, so we'll penalize it, for example, for saying pro-camel stuff, and then we'll upvote it for saying anti-camel stuff.

What you'll see is over the course of training, sure, it's anti-camel rhetoric will increase over time. But so will it's basically like that anti-camel rhetoric drops and during training the, sorry, the shit backwards, the pro-camel rhetoric that you want to train out drops, the anti-camel rhetoric increases. No surprise there. You're training it to hate on camels.

But what you see is the fraction of the time that it talks about how it's doing this just to fake alignment so that when it gets out of the real world, it can actually be pro-CAML again, starts to increase and increase and increase. Eventually, it's like looking at their plot here, it's like 75%, 80% of the time, this model is actually saying anti-CAML stuff during training.

but explicitly with the goal of displaying pro-CAML behavior when it's deployed. And now when you actually deploy it, pro-CAML behavior does still drop a lot. Anyway, there's tons of really interesting detail in this, but the bottom line is even reinforcement learning does not get rid of this deceptive behavior. It's almost as if the model has the ability to have some kind of situational awareness about the training process.

It's really hard to talk about this as if the model doesn't have that situational awareness and isn't explicitly trying to tailor its behavior during training so that its weights get updated in a way that preserves its original objective. Again, the important thing here is

It means that the model has some stickiness to its original goals. And if you get those wrong and the model is clever enough that it can strategically respond during training so that the gradient updates it gets keep it aligned with the original goal, you might be in a world of hurt and the model is explicitly kind of reasoning through how to fake its alignment in the process. So this is quite a

explicit thing. It's the most convincing example of this kind of behavior that we've seen so far. And I think it's an interesting warning shot for the risks associated with alignment.

Yeah, and I've seen discussions online that, as with other kind of alignment papers, kind of question the significance of it because this was, you know, set up in a sense to try and explore and demonstrate this behavior, right? It's not supernatural in terms of a demonstration of it, but

As with other research on Anthropic and others, I think it's a demonstration of an idea that can be significant in practice, right? So it's a demonstration that this is something that conceptually makes sense and could be applicable to LLMs in general and should be something that we are mindful of. And it is interesting, right? Because we just haven't seen this kind of idea explored yet.

Yeah, it is. I mean, I think that the counter arguments, there are layers to it, but like the high level counter argument that, oh, well, this is actually good because the model is, this is what you'll often see. The model retained its original goal. I mean, again, I think that's bad. It just like, if the original goal, the model absorbed was even slightly misaligned.

Now you have a problem. And that fundamental issue does seem to persist across the various levels of the discourse I've seen online so far. And I'm waiting to see a counter argument to it. But I think it's this is an interesting, an interesting technical challenge to the to the alignment problem that, by the way, like this, I'm old enough to remember when it was like science fiction, like you would have been laughed out of a room for saying that a model might be

strategically behave differently during training

specifically to preserve its goals. Like that is power seeking. That is like pretty clear power seeking behavior. You can quibble around at the margins on it, but it's at least, or I should say it's at least highly suggestive of that. And I think it's getting pretty hard to argue that the body of evidence we have doesn't point in the direction of the, you know, this is, we're moving away from toy examples now and into stuff that is happening unprompted. Like it's happening without

a much explicit hard-coded pressure into the training routine. Exactly. If you zoom out, it's like saying that the LLM almost has preferences or almost opposes what it is being told to do in a very abstracted sense. So without giving it too much humanization, etc.,

this is demonstrating that as these LLMs get more intelligent and as you tell it, in particular in this case, you tell it during training, you're supposed to do this, this is your objective. Well, if it's been trained previously to do something else and trained to reason in a certain way, that, as you said, has some stickiness.

And out of the next paper, not on alignment, but on optimization and efficiency, Meta is introducing the Byte Latent Transformer, a tokenizer-free model that scales efficiently.

So a bit of background tokenization is when you take your text and convert it to tokens, tokens are a basic input and output for a large language model. You can think of it, typically tokens are like a couple of characters.

You could have tokens that are each individual character, as an example, and that is doable but becomes very inefficient. So you can't scale as well. If you treat every individual character as a token, then for a number like 500 million, now you have a very long...

input or for a word that's a common like the, you could treat your entire word as a token as opposed to as free tokens. And that allows you to scale much more. And so because of that, pretty much all modern LLMs are being trained with some sort of tokenizer, often with byte pair tokenizer.

There are some disadvantages to that because you have a fixed vocabulary of tokens and you are allocating the same amount of compute to every single token, even if some tokens are sort of more obvious than others as far as your output. So here they are proposing a way to essentially dynamically create tokens in a sense where you start with bytes.

And then you have a small model that creates what they call patches, which is grouping bytes into variable sized patches based on the data complexity. And they go into some details here on measuring entropy and being able to essentially allocate more compute for more surprising or unexpected elements of prediction.

And the main result is you can scale more efficiently than with traditional tokenization. So you do need a slightly more complex architecture overall. You need an input, you know, byte stream. Then you need to create patches. The model would output patches and you need to de-patch it.

And these patches are all in the latent space, so these are not the same as individual tokens. But it showcases that you can get away from tokenization and then be more dynamic in how you process text, which could be a pretty big deal. And so the architecture is the word I'm looking for. So the architecture relies on, as you said, this thing called an entropy model. It's just like, basically...

You can think of it as like the tokenizer model, like the thing that predicts where the tokens should go or how they should be grouped. And it's actually basically just like a language model on its own. They train it separately. They don't train it along with the rest of the model, which is itself kind of an interesting decision. And its only purpose is to decide where patch boundaries should be. So what should qualify as a token and what shouldn't. It's like a 100 million parameter transformer. So tiny, tiny thing.

And basically what this thing does is it couples to another transformer model they call the local encoder. And the local encoder is going to take in the raw bytes plus the patch boundary information that it gets from this entropy model. And it's going to process that using cross attention. So basically like that's where you're going to start to use your attention mechanism before feeding these to the big encoders.

kind of Mac daddy model, the global transformer. So the global, this is kind of a sandwich arrangement where you've got this like little dinky entropy model that's figuring out where your tokens, your patches go, and then your local encoder that takes in that data and, and does attention on it. And then, so that's one thin layer that feeds into a big global transformer. So this whole thing is 6 billion parameters, but the global transform is like 3%.

sorry, is 8 billion parameters. The global transformer is 6.4 billion of those parameters. So by far, this is kind of the meat in the sandwich. And so essentially, this is going to be a standard transformer that's just going to operate on patches instead of tokens.

And, yeah, it's going to do most of the interesting computation before passing things on to the local decoder, which has to go from patch representations back into bytes, which in turn you can transform into characters. So the interesting thing about this is the entropy model is trained separately. So it is not trained with gradient descent all the way through. But basically,

but it is a sort of separate abstraction. It's got some advantages, which are kind of cool. So as you said, like usually when you have a transformer architecture with tokenization, every token gets the same amount of compute through the kind of main transformer. But in this case, with this architecture that they call BLT for short,

you're actually going to group together simple kind of easy sequences. And they're going to, they're going to be in one big patch. So you only have to kind of pass them through the system once, which reduces your overall compute requirement. Cause you're not going to use that big global transformer on, on a bunch of little sequences. You'll, you'll group together. Sorry. Yeah. You'll group together all these, like, anyways, like,

compound tokens, if you will, when you pass them through. So as an example, if you think about a sentence like the cat sat on the mat, a conventional tokenizer might say the is one token, cat is one token, sat is one token, and so on. But with BLT, the cat might be one thing because it's kind of one...

one compound sort of readily predictable thing. Sat on, right, might be another one that's just one compound thing, partly because once you know the word sat is there, the word on is easier to predict. And then the mat might be another. So you've just reduced...

the number of tokens you have to process from, you know, from seven to three. And so anyway, this is a really interesting development and they have all kinds of results about why it's actually more efficient. One challenge is that this is a fundamentally different architecture. So your actual like

The number of flops that you need to train this model does go down thanks to the efficiencies we just talked about. But the wall clock time that you get, the improvement in wall clock time might actually be less dramatic than suggested by the number of flops because the architecture just isn't as optimized for current hardware as traditional transformers are. So, you know, it's kind of this idea of the hardware lottery that we've talked about before. If this is going to take off, you really are going to need to see more custom hardware.

They do have a bit of discussion in the paper on adapting tokenization based models to tokenization free. So that could be another thing they could try, for instance, just by taking a pre-trained model and then kind of adapting the weights. Not super clear yet if that could work, but that is something they suggest as further work. And

And I do like sometimes to call it previous work. And in this one, they do cite a paper that was from previously this year, and it has a fun title, Space Byte, towards deleting tokenization from large language modeling. So that paper basically says,

treated every word every like thing between the space as a kind of patch of its own that isn't great either because then you know you may get into tokens that aren't great so the really cool thing here is that dynamic patching that is learned and can work uh

better than some sort of hard-coded strategy. And also that idea of using entropy as a key way of doing it.

All right, on to the lightning round. We are going to try to speed up. We've got a report from Apple AI, which we've had a few of right now. This one is on frontier language models becoming smaller. And it is essentially documenting a trend we've already seen and have seen quite a lot this year that since GPT-4, GPT-4.0,

roughly when we saw like a 2 trillion parameter model, something like that, up to that point, we have gone bigger and bigger, like GPT-2, GPT-3, GPT-4, each of them went up in parameter size by a factor of 10 or even more, like 150. Well,

It appears to be the case that the models, not just on the small language model side, but also in general with models like GPT-4.0 and Cloud 3.5 Sonnet are smaller than GPT-4. They have fewer parameters, like 4.0 probably has around 200 billion parameters and Sonnet has maybe around 400 billion parameters.

something that isn't necessarily fully known, but is estimated in this report and is interesting in the context of scaling trends and the overall trends of AI progress.

Yeah, the evidence for this that they cite for the reversal, which we kind of, I would say, vibed out. It's been clear from the vibes. But the evidence for it is twofold. So one is you see this in open source models. So you're seeing the best open weights models are now Mistra Large 2 and Lama 3.370B.

which have 123 billion and 70 billion parameters, and they're dense transformers, but with fewer parameters than even GPT-3. So that's noteworthy. The second source of evidence is just in the cost of serving these, or sorry, just the cost of these charged by OpenAI and others for their models. So we've seen the original GPT-4 was $60 per million output tokens.

Now we go on to GPT-4 Turbo, which was $30 per million output tokens. Then GPT-4.0 is $10 per million output tokens. Now, hardware improvements are obviously a huge, huge part of that. That's clear as are algorithmic improvements. But this certainly does suggest that we haven't seen continued radical scaling of the sort you might have assumed following scaling curves previously.

Sorry, scaling and model parameter count. That's really important. We have seen scaling in compute. Anyway, they do a bunch of analysis based on assuming that people are using H200s to do inference, and this leads to the conclusion that things are stalling out in terms of model size.

Some reasons why this is happening, they highlight. First off is we saw this with GPT-3, the Kaplan scaling laws, the scaling laws for neural language models paper that came out, had a particular scaling recipe where they said that for each billion parameters that you add to your model, you need to train with this many flops, with this many more tokens.

And well, when the chinchilla scaling laws came out later, they revealed that actually the compute optimal way of doing this involves scaling your parameter count more slowly. And so the parameter count scaling kept going, but it proceeded more slowly because of that switch over to chinchilla. And there was a one-time sort of drop in parameter count for the next generation of models as people realized, well, wait a minute, for my compute budget, I actually, I guess, should be training a smaller model. The other reason is productization.

Cost of inference is a big, big deal. It makes sense as a result to overtrain a smaller model so that you then have a smaller model that is maybe overpowered for its size, at least based on the traditional chinchilla scaling laws, but now you have a smaller model to serve, so it's cheaper for inference. That's a big deal in a world of AI productization.

Test time compute scaling compounds that trend, right? Because now you're going to call the same model many, many times. Do a lot of inference on the same model. It better be small. It better be cheap. That's another reason to opt for smaller models. Synthetic data generation is sort of similar. So one place where this lands or the place where this lands is the question, will this then continue, right?

What about the future? And the answer, if you're tracking those reasons, is pretty clearly that actually we are going to see a resumption of scaling. Like that's pretty clear. All of these patterns have the shape of a one-time step back.

The shift from Kaplan to Shinchilla scaling loss, that's a one-time reset. Productization is an adaptation of the market, which will then... It's an incentive to take a step back on the scaling curve, but then you're still on the scaling curve. Same with test time compute scaling. As hardware gets cheaper, as the demands for higher quality outputs get higher, you're going to see a resumption, presumably, of the scaling trend. And so don't expect...

these like 10 trillion, 100 trillion parameter models to be out of reach indefinitely. They are coming, very likely. It's just a question of what. Exactly, yeah. I think what you've seen with, you know, obviously with investments in massive AI clusters,

But also research into the ability to do better with higher quantization, with scaling laws, as you said, is there was this understanding that arose of how much we can squeeze out of a certain set of parameters and a variety of techniques. Now, I think we're getting to a point where a lot of the efficiency gains have been realized. And so we, as you say, are likely to get back to scaling.

And one more paper, this one is a little more theoretical and interesting, so we'll have to avoid explaining it. So it's titled The Complexity Dynamics of Grokking. Grokking is a name for a phenomenon in training where essentially for a while you don't do so good on a task and then all of a sudden you start being really good. So instead of kind of gradually improving over time,

There is a sharp improvement. And this paper is looking into essentially why that happens and introduces a complexity measure that essentially tries to compress the neural network to see how complex it is.

And it kind of confirms a general understanding that there is a change in paradigm where initially the model does a lot of memorization to be able to output the correct outputs for a given input. And then at some point, because it needs to do well while being restricted from memorizing via regularization, it switches to reasoning or generalization paradigm. And that's where you see the sharp improvement. And also they see a sharp

of complexity as you move away from memorization to generalization. So, yeah, interesting theoretical result and it does lead to kind of a theory-backed regularizer for training. Yeah, it's super, super interesting paper. You know, just the intuition for this, you can think about what the world looked like in the Renaissance when the early scientists were kind of like going out and gathering a bunch of data about physics, about biology, about chemistry and

It's just like this like long list of facts and your picture of the world is a super complicated thing because you got to remember, like memorize this like giant, like Wikipedia type corpus of all the little facts that you know. And then somebody like Isaac Newton comes around and does F equals MA, right? Or invents calculus. And suddenly a whole bunch of complexity gets abstracted away and you realize, oh, well, actually these are all just individual manifestations of a core and simple idea, right? That feeling where you go, ah,

Ah, that's grokking. And that's essentially a moment of compression, right? You're taking a whole bunch of complexity, compressing it into a simple theory of the case about how the universe works. That is exactly the rise and fall in these complexity curves that we're looking at. I think what's super interesting is, or at least to me, confusing even, when you look at these curves, what they're actually plotting here is a measure of the information complexity of the neural network.

They're trying to, I guess, kind of mimic Kolmogorov complexity, which is this very abstract notion that we don't need to talk about, but that can't actually be computed. So they use a proxy for it that has to do with, roughly speaking, the entropy of the neural network.

And what I'm confused by is that that entropy starts at zero. And if there's somebody who's read this paper who can explain, like, I did not understand from this paper why the entropy starts at zero. I get why it rises. Complexity increases as, you know, the model is trying to memorize and account for all these observations and then drops as it has that aha moment where it comes up with a theory that generalizes or an understanding.

But at the beginning, entropy, to my mind, should be high at the outset. Maybe there's an initialization thing going on where there's a low complexity state they're initializing the networks to. But it wasn't clear to me from the paper why that's the case. So this is a moment

where if anybody who's worked on the paper or whatever, I'd love to get some insight there. But it is fascinating. It's in a way very intuitive and very important for our understanding of generalization in world models and language models.

And by the way, for anyone who doesn't know, grok is a very nerdy term that basically means to understand. It's from a 1961 science fiction novel, Stranger in a Strange Land. Oh, I didn't know that. Which is a real classic. Yeah. And that's why you have groking is a term for phenomena and also grok from XAI and also grok with a Q. All these nerds are...

Tapping into science fiction and sort of a term for some sort of intelligence. Alrighty, moving on to policy and safety. We begin about a story from the U.S. federal government. And it's about how Homeland Security is getting its very own generative AI chatbot. So there's been a few announcements and this one I found interesting. There is this DHS chat.

that is being introduced and being made accessible to the 19,000 employees of the department. They were already allowed to use ChaiGPT on cloud, but this was built internally and now operates on the DHS secure infrastructure. So it can then help employees summarize complex documents and reports, do all the usual kind of stuff. So

Curious to see that there is this internal development of models within the US government. There was also another little new story, bipartisan house task force said that agencies need to do more to bring AI experts into a federal workforce. They apparently made more than 200 AI hires this year, US agencies, and they are looking to do more.

Yeah, that's obviously a big, big problem for the US government right now is achieving a basic level of competence and understanding of AI. And so especially coming to the next administration, I think they're going to have an opportunity to staff up and really focus down on that. There was also this note too, this new bill that seeks to prohibit the Department of Defense from doing business with companies that provide IT services to China. So this is at

Kind of more in that same vein, a lot of the AI stuff has run through similar concerns. But yeah, the DHS thing, interesting that that's, I guess, the top story they've highlighted in this roll-up. We've seen similar things happen at DoD, and that's actually led to tensions, right? Because there are outside...

firms that have tried to develop custom chatbots for the DoD. And there's been complaint that while now DoD is rotating into its own internally built systems, having learned presumably from these firms that have built products in a bespoke way for them. I just remember having seen there was an article in, I forget if it was Bloomberg or somewhere else about that. But at the end of the day, this is going to happen. It's important for the government to have this kind of competence

and these tool sets that they can build for themselves for security reasons. And just a couple more stories. One of them is on the pre-deployment evaluation of OpenAI's O1 model. So this is kind of a topic that has been ongoing, the idea that governments should be able to evaluate data

models for safety, especially these frontier models, to test them and see that they're not doing anything dangerous before they're made available to the public. So in this story, we find that the UK and US AI Safety Institute apparently conducted a joint pre-deployment evaluation of the O1 model, focused on cyber, biological, and software development capabilities.

And also compared it to reference models like GP4-0 and Cloud 3.5 Sonnet. And as we've seen in previous kind of ideas, O1 can do some advanced cybersecurity. And in this case, actually better than the reference model.

Although with biological capabilities, it wasn't significantly better, although it could be better when using tools. So perhaps the start of a trend for models like O1. Yeah, it's also interesting to see the UK AI Safety Institute, US AI Safety Institute kind of collaborating so closely on this, as they've said they would, kind of developing these independent domains of specialization. They also highlight that although they've surfaced some of these capabilities before,

that it's sort of a lower bound on the actual capabilities of these models, because obviously you can fine tune them. You can, you know, you can add scaffolding, agentic scaffolding that reveals new capabilities. So there's this sort of like awkward recognition here that we can only do so well with a test that we have, but still, you know, good that they're able to audit anthropic open AIs models. It's, you know, anyway, I think there's going to be a recurring challenge that they'll have to find ways to crack, but...

Yeah, the main take home here really is, oh, one, as you said, larger, superior performance on the cyber benchmarks that they tested. And especially on challenges related to cryptography, which is sort of interesting, but otherwise more or less falling in line with the performance of previous models that they've tested.

And one more story. Pricing for key cheap-making material hits a 13-year high following Chinese export restrictions. So we covered, I believe it was last week, that in retaliation to US policy, China has restricted exports to China.

a few things including gallium, and the prices to it have now surged to $595 per kilogram, the highest they have been since 2011. And as we covered, I believe this is a significant need, a significant material that's needed for some things. And

China is responsible for 94% of global gallium production. So it's not too surprising that the export policy has led to the price hikes. The prices have jumped 17% in a single week's

And yeah, there's going to be a rush to secure alternative sources and figure out how to get access if you're not able to do it from China. Yeah. I mean, this is just like a complete self-own. Like we've had a long time to sort out our critical minerals strategy domestically and just haven't. And so this is just going to have to change.

Gallium is important, by the way. Gallium nitride is used a lot in power delivery systems for AI accelerators. And so because you need really, really efficient power management because of the power consumption profiles of these chips. And then you also see sometimes gallium arsenide get used for interconnects and some RF functions. So these are pretty important components.

in a lot of different ways for the actual performance of, of high end chips. So this is like not a small deal. Um, they saw, I think you might've mentioned this at 17% jump in the, the price of, uh, of gallium on the market as of, uh, or in one week in, in December in this month. So pretty, uh, pretty wild stuff, $595 per kilogram, which is,

I don't track the price of gallery. It's a big number, I guess. I don't know. And one more story, this one in synthetic media and art. And it is about Meta introducing a new watermarking tool for AI-generated videos. So this tool is called Meta Video Seal. And it is meant to watermark AI-generated videos

It's similar to other tools like Watermark Anything, AudioSeal, and SynfID, as we've mentioned, and is meant to be a more robust solution for video watermarking in particular to work with things like video compression and being able to scale up. So as with other watermarking techniques, it will embed hidden messages and videos that will make it possible to trace their origins.

and work even when you try to blur or crop or compress the video. So as far as I know, perhaps not a super solved problem. We've seen a lot more watermarking for images and text. So it will be interesting to see if this has any impact, I suppose.

Yeah, I mean, they're certainly claiming that right now robustness to compression is one of the big differentiators here and efficiency to run at scale. So that's cool. The classic trade-off in this space is like how perceptible the watermarks are and then the resilience to manipulation, right? The more resilient you make these things to manipulation, often the more they leave visible artifacts.

And yeah, so you're always balancing those two things. And this is going to presumably do Pareto better on that trade-off, but we'll have to see. And with that, we are done with the episode. We got through a decent number, not all the articles you plan to do, but that

Does happen sometimes. So you can find more articles on lastweekin.ai and you can find all of the articles we discussed here with links in the episode description and at lastweekinai.com.

As always, we do appreciate it if you comment. It is cool to see questions, actually, like we question about quantum computing. So feel free to do more of that on Substack or YouTube or elsewhere. And of course, reviewing us won't hurt either. We do love those five-star reviews. But more than anything, do keep listening and do enjoy this outro song.

New horizons rise, they have futures ignite With the power of video, we're reaching new heights Genetic thinking, past boundaries it breaks A million strong clusters, the future we made The day I lead the way It's a new year, a new day

♪ Rock the future tonight ♪ ♪ In the AI spotlight ♪ ♪ Shining bright in the spectre of the night ♪ ♪ Day air revolution under city lights ♪ ♪ Video visions come alive on the screen ♪ ♪ Gemini 2 with precision unseen ♪

Through a million cheap views, the future unfolds. And there's distance where stories untold. Let's ride the wave of this AI parade. In the new year's glow, watch the donk escape.

Deep-blooding dreams to realities bright. The nature's flapping, futures with light. Through one million machines, the rhythms collide. AI's in for the deep place, with no way out. Unleash the spark in this AI's cabane. Pop the circuits tonight, where futures are made. Stiff wave echoes usher in the new dawn.

Gemini's brilliant guiding us through to a world where

AI and human creation much more.

In this era of change where dreams intertwine, video creators crossing the line. Where Gemini's craze wants to fix us unfold, it's Hecate dreams new stories are told.

Hey!

Cause we will make the top of the glass structure in future career. The neon lights guide us through the future private thing. In this bright day I feel parade. We're marching forward, dreams never fade. Every pixel, every grain. The future's wild and untamed.

So raise your voice, let the AI evolution ring Everything's new, a brilliant awakening From this digital dawn we saw tonight AI's chorus, a guiding light

#194 - Gemini Reasoning, Veo 2, Meta vs OpenAI, Fake Alignment 01:59:55 Share