AI Computing Hardware - Past, Present, and Future

2025/1/29

Last Week in AI

AI Deep Dive AI Chapters Transcript

People

Andrey Kurenkov

Jeremy Harris

Topics

@Andrey Kurenkov : 我研究AI，现在在一家人工智能初创公司工作。我学习软件和AI，训练算法，对硬件的理解相对较少，主要使用GPU并大致了解其功能。早期AI和硬件可以追溯到图灵时代，甚至在通用计算机出现之前，人们就已经在思考AI。图灵测试（模仿游戏）是用来衡量我们何时能够获得人工智能或通用人工智能的工具，至今仍被广泛讨论。20世纪50年代，出现了最早的AI程序，例如一个下跳棋的程序。马文·明斯基创造了一个名为随机神经模拟强化计算器（SNARC）的硬件神经网络，模拟老鼠在迷宫中学习，模拟强化学习。早期AI硬件都是定制的，例如SNARC，它有400个神经元，每个神经元有6个真空管和一个电机，大小像一架三角钢琴。早期计算高度定制化，可扩展系统和模块化计算直到英特尔出现才成为可能。IBM 701和702是早期大型机，亚瑟·塞缪尔为其编写了一个跳棋程序，展示了早期机器学习的例子。感知器是弗兰克·罗森布拉特在1958年至1959年创造的，是神经网络的早期演示，它可以学习区分形状。20世纪80年代，出现了为专家系统定制的硬件和Lisp机器，用于逻辑AI和搜索。深蓝是IBM为下国际象棋而开发的定制硬件，展示了强大的计算能力，而非机器学习。20世纪80年代和90年代，随着摩尔定律的持续发展，人们重新开始研究神经网络，但当时仍然使用CPU进行训练，没有并行计算。20世纪90年代末到21世纪初，GPU开始用于科学应用，包括神经网络训练，AlexNet是其中一个标志性例子。2012年，AlexNet论文利用GPU训练大型神经网络，在ImageNet基准测试中取得了突破性进展。2010年代中期，深度学习的兴起推动了对GPU和数据中心的投资，谷歌投资了TPU。谷歌在2010年代中期开发了TPU，这是第一款定制的AI芯片。OpenAI通过使用更大规模的参数模型（例如数十亿参数）推动了AI的发展。GPT-3、扩展定律和上下文学习的出现，标志着大规模神经网络语言模型时代的到来。ChatGPT的出现使得大型神经网络语言模型受到广泛关注，促使人们对大型数据中心和能源的需求增加。大型数据中心使用多个GPU或TPU进行模型训练，需要复杂的内存管理和数据传输机制。早期神经网络训练通常将模型加载到单个GPU的内存中，但如今大型模型需要多个GPU进行分布式训练。 @Jeremy Harris : 我是Gladstone AI（一家专注于人工智能国家安全的人工智能公司）的联合创始人，我的工作重点是高级人工智能带来的潜在风险，包括当前和未来系统。我从硬件的角度看待AI，因为我们关注出口管制，例如如何阻止中国获得这些技术。我们研究针对西方高安全数据中心的攻击类型，例如窃取模型、更改模型训练行为或破坏设施。AI的未来与计算的未来紧密相连，摩尔定律和黄氏定律描述了计算能力的增长趋势。摩尔定律描述了集成电路中晶体管数量的指数增长，但其增长速度有所放缓。摩尔定律虽然放缓，但仍在持续，并且在AI芯片领域出现了不同的摩尔定律趋势（黄氏定律）。黄氏定律描述了GPU性能的指数增长，即使摩尔定律放缓，GPU性能仍在快速提升。为了理解摩尔定律和黄氏定律的差异，需要了解芯片的工作原理，包括内存和逻辑单元。芯片的核心功能是内存（存储数据）和逻辑（处理数据）。内存和逻辑单元的改进速度不同，逻辑单元改进速度更快，而内存单元的改进速度较慢，这导致了“内存墙”问题。“苦涩教训”指出，增加计算能力比改进模型架构更重要，应关注规模化。“苦涩教训”表明模型架构不如计算能力重要。当前AI领域，规模化（计算、数据和模型大小）至关重要。Kaplan论文和GPT-3证明了神经语言模型的扩展定律，降低了大规模计算投资的风险。扩展定律降低了大规模计算投资的风险，使得更容易获得资金进行大规模计算集群的建设。OpenAI的早期尝试，如强化学习和机器人技术，并没有很好的扩展性。2017年的Transformer论文和2018年前后出现的预训练技术，为自然语言处理的大规模模型训练铺平了道路。卷积神经网络的预训练权重可以用于各种视觉应用，减少了对数据的需求和训练时间。GPT模型的成功在于其在语言建模任务上进行大规模预训练，以及Transformer架构的并行化能力。OpenAI转向盈利的原因之一是他们意识到硬件至关重要，谷歌拥有大量硬件，这使得谷歌更有可能率先实现AGI。推理模型对内存的需求更大，而对计算能力的需求相对较小，因此可能更适合使用较旧的芯片。推理时批次大小较小，因为需要快速响应用户请求，这导致内存带宽主要用于加载模型，而非用户数据。推理任务对内存的需求大于计算能力的需求，因此较旧的芯片（内存容量相当但计算能力较低）可能更适合。大型批次大小可以提高GPU利用率，降低成本，但小型公司由于用户数量较少，难以实现大型批次大小。GPU具有大量核心，适合并行处理任务，而CPU核心数量少但速度快。神经网络训练和推理可以利用数据并行、流水线并行和张量并行等技术进行并行化处理。GPU擅长矩阵乘法运算，这与神经网络和3D图形渲染的计算需求相符。GB200是NVIDIA最新的GPU系统，其架构和互连方式对数据中心设计和AI模型架构有重要影响。B200是GPU，而GB200是一个包含多个B200 GPU和CPU的系统。GPU需要CPU来协调其工作。GB200 tray包含两个Bianca board，每个Bianca board包含一个CPU和两个B200 GPU。GB200使用NVLink连接GPU，以实现高带宽通信，主要用于张量并行。数据中心网络存在层级结构：加速器互连（NVLink，最快）、后端网络（InfiniBand，较快）和前端网络（较慢）。AI模型的架构需要与硬件架构相协调，以最大限度地利用计算资源。谷歌在数据中心建设和TPU pod方面具有优势，这使得他们能够训练更大规模的神经网络。GB200系统包含B200 GPU、CPU和其他组件，其配置可以根据需求进行调整。GPU包含逻辑单元（进行计算）和高带宽内存（HBM，存储数据）。HBM由SK Hynix和三星等公司制造，使用堆叠的DRAM层和通孔技术实现高带宽。GPU的逻辑单元（计算单元）和HBM（内存单元）由不同的公司制造，并通过互连器连接在一起。HBM的制造工艺要求较低，而逻辑单元的制造工艺要求较高。逻辑单元和HBM通过互连器（例如TSMC的CoWoS）连接在一起。内存（例如RAM）的复杂性以及缓存策略的引入，使得内存改进并非简单的速度提升。数据中心中的计算是一个层级结构，从高电压电源到芯片级的低电压操作，内存也存在层级结构。数据中心内存存在层级结构：闪存（慢，持久性）、高带宽内存（HBM，快，易失性）和SRAM（最快，最贵）。HBM使用DRAM技术，需要定期刷新以保持数据。SRAM是速度最快、成本最高的内存，访问时间为亚纳秒级。

Deep Dive

Shownotes Transcript

Translations:

中文

Grab your chips Let's go! Bits and bytes on the zone For Tories, folks, and all onettes

Hello and welcome to the Last Week in AI podcast, where you can hear us chat about what's going on with AI. And not as usual, in this episode, we will not summarize or discuss some of last week's most interesting AI news. Instead, this is our long promised episode on hardware. We'll get into a lot of detail, basically do a deep dive unrelated to any AI news, but I guess related to which

the general trends we've seen this past year with a lot of developments in hardware and crazy investments, right, in data centers. So to recap, I am one of your hosts, Andrey Kurenkov. I study AI and I now work at a startup.

Yeah, I'm Jeremy Harris. I'm the co-founder of Gladstone AI and AI National Security Company. And I guess just by way of context of my end too on the hardware piece. So, you know, the work that we do is focused on the kind of WMD level risks that come from advanced AI, current and increasingly future systems.

So my footprint on this is I look at AI a lot from through the lens of hardware because we're so focused on things like export controls. How do we prevent China, for example, from getting their hands on this stuff?

What kinds of attacks, one of the things we've been looking into recently, what kinds of attacks can people execute against highly secure data centers in the West? Whether that's to exfiltrate models, whether that's to change the behavior strategically of models that are being trained, whether that's just to blow up facilities. So a lot of our work is done these days with special forces and folks in the intelligence community, as well as increasingly some data center companies to figure out how do you secure these sites?

and obviously all the kind of US government work that we've been doing historically. So that's kind of my lens on it. And obviously the alignment stuff and all that jazz. So

I guess I know enough to be dangerous on the AI and compute side, but I'm not a PhD in AI and compute, right? My specialization is I know what I need to know for the security piece. And so to the extent possible, I'll try to flag, we'll try to flag some resources and maybe people for you to check out if you're interested in doing those deeper dives on some of the other facets of this, especially compute that doesn't have to do with AI, compute that's not national security kind of related stuff. So hopefully that's useful for you.

Yeah, I guess we're flagging to on my end, I studied software and AI, I trained algorithms, so I have relatively little understanding of how it all works. Actually, I just use GPUs and kind of broadly know what they do. But, you know, I'll be here listening and learning from Jeremy as well, I'm sure. I'm sure it'll go both ways. I mean, I'm excited for this. Anyway, yeah, I think there's a lot of opportunity here for us to cross-pollinate.

Let's just get into it. So I thought to begin, before we dive into the details of what's going on today, we could do like a quick historical recap of fun details in the past of AI and hardware. There's some interesting details there. AI and hardware go back to basically the beginning, right? Turing was a super influential person within the world of computing. And then Turing game, right, is his invention, right?

to try and, I guess, measure when we'll get AI or AGI, as you might say. And that's still widely discussed today. So even before we had actual computers that were general purpose, people were thinking about it. By the way, that imitation game piece, in a way, it's freakish how far back it goes.

I've never read Dune, but I know there's a reference in there to the Butlerian Jihad. And so Butler back in the like, was 1860s, or I'm showing off how little I know my dates here, but he was the first to observe that you could get like, you know, hey, these machines seem to be popping up all around us. Like, we're industrializing, we're building these things.

What if one day we start like kind of, you know, I don't know if it was like building machines that can help us build other machines. Eventually, will they need us? It wasn't with respect to computer or anything like that, but it's sort of an interesting thing. Like when you look back at how incredibly prescient some people were about this sort of thing. Anyway, sorry, I didn't mean to derail, but you're getting a great point here that it goes way, way before the days of, you know, early 2000s, people starting to worry about loss of control. Yeah. Yeah.

Yeah, well, you also reminded me that it's called the imitation game. The Turing game is not a thing. There's a Turing test that people call the imitation game as it was originally published. Anyways, so yeah, it was conceptually, of course, on people's minds for a very long time, the concept of AI, robotics, etc. But even as we go into the 50s and get into actual computing still with vacuum tubes, not even getting to semiconductors yet,

There's the beginnings of AI as a field in that time. So one of the very early initiatives that could be considered AI was this little program that played checkers. And you could go as early as 1951 there where someone wrote a program to do it.

And then, yeah, there's a couple examples of things like that in that decade that showcased the very first AI programs. So there was a program from Marvin Minsky, actually, called the Stochastic Neural Analog Reinforcement Calculator. I actually just learned about this in doing the prep for the show. I found it quite interesting. This was actually kind of a little neural net that Marvin Minsky built and

in hardware and it simulated rats learning in like a little maze and trying to simulate reinforcement learning as there were also theories coming out about human learning, brain learning, et cetera. And to give you some context,

There were maybe 400 neurons. I forget a small number. Each neuron had six vacuum tubes and a motor. And the entire machine is the size of a grand piano with 300 vacuum tubes. So they had that early example of kind of a custom-built computer for this application. That's actually one thing too, right? In the history of computing...

Everything was so custom for so long. That's something that's easy to lose sight of. The idea even of building these very scalable systems

modules of computing, you know, having ways to integrate all these things together. That wasn't until really Intel came into the game. That was their big thing at first, as I recall. The thing that broke Intel in was like, hey, we'll just come up with something that's not bespoke, so it won't be as good at a specific application, but boy, can it scale. All the time before that, you have all these, like you said, ridiculously bespoke kind of things. So it's more almost physics, in a sense, than computer science, if that makes sense. Yeah. Yeah.

Exactly. Yeah, it was a lot of like people pulling together and building little machines, right, to demonstrate really theories about AI. There is a fun other example I found where there is the famous IBM 701 and 702, right? IBM was just starting to build this massive mainframes that were kind of a main paradigm for computing for a little while, especially in business.

So the IBM 7101 was the first commercial scientific computer

And there is Arthur Samuel, who wrote a checkers program. And it was maybe one of the first definitely learning programs that was demonstrated. So it had very kind of primitive machine learning built into it. It had memorization as one idea, but then also some learning from experience. And that's one of the very first demonstration of something like machine learning, and

Then famously, there's also the Perceptron that goes to 1958, 1959. And that is sort of the first real demonstration, I would say, of the idea of neural nets, famously by Frank Rosenblatt. Again, a custom-built machine at that point that had kind of these... If you look, there's photos of it online, and it looks like this crazy...

tangle of wires that built a tiny neural net that could learn to differentiate shapes. And at the time, Rosenblatt and others were very excited about it. And then, of course, a decade later, kind of the excitement died out for a little while.

And then there is some interesting history we won't be getting into later in the 80s with custom-built hardware. There were custom hardware for expert systems that were being sold and being bought for a little while. There were this thing called LISP machines, where LISP was a pretty major...

language in AI for quite a while. It was developed kind of to write AI programs. And then there were custom machines called Lisp machines that were utilized by, I guess, scientists and researchers that were doing this research going into the 70s and 80s, when there was a lot of research in the realm of, I guess, logical AI and search and so on, symbolic AI.

Then again, continuing with a quick recap about the history of AI and computing, we get into 80s, 90s. So the LISP machines, the expert hardware systems died out. This is where sort of, as you said, I guess this was the beginning of general purpose computing proper with Intel and Apple and all these other players involved.

hardware that doesn't have to be these massive mainframes that you could actually buy more easily and distribute more easily. And so there's kind of fewer examples of hardware details aside from what will become Deep Blue in the late 90s. IBM was working on this massive computer, especially for playing chess. And I think a lot of people might not know this, that Deep Blue is

wasn't just a program. It was like a massive investment in hardware so that it could do these ridiculously long searches. It was really not a learning algorithm to my knowledge. Basically, it was doing kind of the...

well-known search with some heuristics approach to chess with some hard-coded evaluation schemes. But to actually win at chess, we were out there was to build some crazy hardware specialized for playing chess. And that was how we got the demonstration without any machine learning of a sort we have today. And I

Let's finish off the historical recap. So of course, we had Moore's law all throughout this. Computing was getting more and more powerful. So we saw research into neural nets making a comeback in the 80s and 90s. But I believe at that point, people were still using CPUs and trying to train these neural nets without any sort of parallel computing as is a common paradigm today.

Parallel computing came into the picture with GPUs, graphics processing units that were needed to do 3D graphics, right? And so there was a lot of work starting in around the late 90s and then going into 2000s. That's how NVIDIA came to be by building these graphics processing units that were in large part for the gaming market. And then kind of throughout the 2000s,

Before 2010s, a few groups were finding that you could then use these GPUs for scientific applications. You could solve, for instance, general linear algebra programs.

And so this was before the idea of using it for neural nets, but it kind of bubbled up to a point that by, I think, 2009, there was some work by Andrew Ng applying it. There was the rise of CUDA, where you could actually program these NVidia GPUs for whatever application you want. And then, of course, famously in 2012, there was the AlexNet paper, where...

We had the AlexNet neural net, one of the first deep neural nets that was published and destroyed the other algorithms being used at the time on the ImageNet benchmark. And to do that, one of the major, actually, novelties from the paper and why it succeeded was that they were among the first to use GPUs to train this big network. Probably they couldn't have otherwise.

They used two NVIDIA GPUs to do this, and they had to do a whole bunch of custom programming to even be able to do that. That was one of the major contributions of students. And that was kind of when I think...

NVIDIA started to get more in the GPU for AI direction. They were already going deeper into it. They wrote CUDNN, C-U-D-N-N. C-U-D-N-N, yeah. Yeah, yeah, yeah. And they were starting to kind of specialize their hardware in various reasons. They started creating architectures that were better for AI, you know, the Kepler architecture, Pascal, et cetera.

So again, for some historical background, maybe people don't realize that way before GPT, way before chat GPT, the demonstrations of deep learning in the early 2010s were already kind of accelerating the trend towards investment in GPUs, towards building data centers. Definitely by the mid 2010s,

It was very clear that you would need deep learning for a lot of stuff, for things like translation. And Google was already making big, big, big investments in it, right? Buying DeepMind, expanding Google Brain, and of course, investing in TPUs in the mid 2010s. They developed the first customized AI hardware, to my knowledge, custom AI chip.

And so throughout the 2010s, AI was already on the rise. Everyone was already on kind of the mindset that bigger is better, that you want a big, bigger neural nets, bigger data sets, all of that.

But then, of course, OpenAI realized that that should be cranked up to 11. You shouldn't just have 10 million or 100 million parameter models. You got to have billion parameter models. And that was their first challenge.

Well, they had many innovations, but their breakthrough was in really embracing scaling in a way that no one has before. I think one of the things too that's worth noting there is this rough intuition. And you can hear people, pioneers like Jeff Hinton and Andrew Ng talk about the general sense that more data is better, larger models are better, all this stuff. But what really comes with the Kaplan paper, that famous scaling laws from neural language models paper, the proof point that was GPT-3,

And GPT-2, in fairness as well, and GPT-1. But what really comes from the GPT-3 inflection point is the actual scaling laws, right? For the first time, we can start to project with confidence how good a model will be. And that makes it an awful lot easier to spend more CapEx, right? Now, all of a sudden, it's a million times easier to reach out to your CTO, your CEO and say, hey, we need...

$100 million to build this massive compute cluster, because look at these straight lines on these log plots, right? So kind of like change the economics, because it decreased the risk associated with scaling. That's right. And I think the story of OpenAI, in hindsight, can almost be seen as the search for the thing that scales, right? Because for the first few years, they were focusing on reinforcement learning. Some of their major kind of

PR stories, you could say, but also papers, was working on reinforcement learning for Dota, for video game Dota. And then even at the time, they were like using a lot of compute, really spending a lot of money training programs, but in a way that didn't scale because reinforcement learning is very hard and you can't simulate the world very well.

They also were investing in robotics a lot, and they had this whole arm, and they did a lot of robotic simulations. But again, it's hard to simulate things that wouldn't scale. Evolutionary algorithms was another thread, right?

Yeah, they did a whole bunch of things, right, from 2015 up through 2018. And then 2017 was the Transformers paper, of course. And then around 2018, the whole kind of idea of pre-training for natural language processing arose.

So from the very beginning, or okay, not very beginning, but pretty soon after AlexNet and around 2014, people realized that if you train a deep convolutional neural net on classification, you can then use those embeddings in a general way. So the kind of intelligence there was reusable for all sorts of vision applications.

and you can basically bootstrap training from a bunch of weights that you already trained. You don't need to start from scratch, and you don't even need as much data for your task. So it didn't happen in natural language processing until around 2017, 2018. That was when language modeling was kind of seen or found out by a few initiatives as a very promising way to pre-train

weights for natural language processing. BERT is one of the famous examples from around that time. And so the first GPT was developed in that context. It was one of the first big investments in pre-training a transformer on the task of language modeling. And then OpenAI, I guess, we don't know the exact details, but it seems like they probably were talking internally and

got the idea that, well, you know, this task, you can just scrape the internet to get all the data you want. So the only question is how big can you make the transformer? The transformer is a great architecture for scaling up because you can parallelize in GPUs unlike in RNNs. So that was kind of necessary in a way.

And yeah, then we got GPT-2 in 2019. That was like almost a 2 billion, like 1.7 billion parameter model, by far the biggest that anyone has ever trained. And even at the time, it was interesting because you had these early demos and

like on the blog where it wrote a couple paragraphs about that unicorn Island or whatever. Already at that time, there was discussion of like the safety implications of GPT two and misinformation and so on. It's they normally by then, right. Cause they'd open source GPT, well, GPT, GPT one, right.

And they had set this precedent of always open sourcing their models, hence the name actually, OpenAI. GPT-2 was the first time they experimented with what they at the time called this staged release strategy, right? Where they would release incrementally larger versions of GPT-2 over time, monitor how supposedly they were seeing them get maliciously used, which failed.

It was always implausible to me that you'd be able to tell if it was being used maliciously on the internet when it's an open source model, but okay. And then ultimately, yeah, GPT-3 was closed. So yeah, they followed, as you say, that kind of smooth progression. Yeah, speaking of that, the lead up to GPT-2,

Also, what we know now from looking at the emails in the OpenAI versus Elon Musk case. It was never the plan. Yeah. Some of the details there is that the conversations in 2018 and why they started to go for profit is that they did have the general kind of belief that hardware was crucial, that Google had all the hardware. And so Google would be the one to get to AGI. Right.

And so they needed the money to get more hardware to invest more in training. And that's what kicked off all this for-profit discussions in 2018 and led eventually to Sam Altman somehow securing 10 billion from Microsoft. I forget when this was announced, I forget.

maybe 2019. I think that there was an initial $1 billion investment that I think was 2019. And then there was maybe a 2021-ish, $10 billion, something like that. Okay, yeah, that sounds right. Sounds like $1 billion is more reasonable. So yeah, I think OpenAI was one of the first to really embrace the idea that you need what we now know as massive data centers and training applications

crazily parallelized training for crazily large neural nets. And they already were going down that route with the Dota agent, for instance, where they're training in very large clusters. And even at that time, it was very challenging. Anyways, then we get to GPT-3, we get to 175 billion parameter models, we get to scaling laws, and we get to in-context learning.

And then by that point, it had become clear that you could scale and you could get to very powerful language models. And the whole idea of in-context learning was kind of mind-blowing.

Somehow, everyone was still not kind of convinced enough to invest. Like looking back, it's kind of interesting that Meta and Google and so on weren't training massive neural nets language models. I think internally Google was to some extent, but they were not trying to commercialize it. They were not trying to push forward.

And then of course you had chat GPT in 2022 with GPT 3.5, I think at the time that blew up and now everyone cares about massive neural nets, massive language models, and everyone wants massive data centers and is fighting over the electricity needed to fuel them. Elon Musk is buying a hundred thousand GPUs and hardware is like a huge, huge part of a story clearly.

Yeah. By the way, the story of hardware isn't in a sense. I mean, we are talking about the story of the physical infrastructure that very plausibly will lead to super intelligence in our lifetime. So I think there almost isn't anything more important that's physical to study and understand in the world. It's also we're lucky because it's a fascinating story.

Like we're not just talking about egos and, you know, billionaire dollars chasing after this stuff. That scientific level, it's fascinating. At a business level, it's fascinating. Every layer of the stack is fascinating. And that's one of the reasons I'm so excited about this episode. But you framed up really nicely, right? What is this current moment? We have the sense that scaling in the form of scaling compute, scaling data and scaling model size, but that's relatively easier to do, is king, right? So the bitter lesson, right? The Rich Sutton argument that...

came out right before scaling laws for neural language models came out in like the 2019 era says basically, Hey, you know, all these fancy AI researchers running around coming up with new fancy architectures and thinking that's how we're going to make AGI. Unfortunately, I know you want that to be the path of AGI. Unfortunately, human cleverness just isn't the factor we would have hoped it was. It's so sad. It's so sad. You know, that's why it's the bitter. It's

Instead, what you ought to do, really, this is the core of the bitter lesson, is get out of the way of your models. Just let them be. Let them scale. Just take a dumbass model and scale it with tons of compute, and you're going to get something really impressive. And he was alluding in part to the successes of early success of language modeling, also reinforcement learning. So it wasn't clear what the architecture was that would do this. Very soon it would turn out to clearly be the Transformer.

But you can improve on that. Really, models, the way to think about models is, or architectures, is that they're just a particular kind of funnel that like pours compute that you pour in at the top and shapes it in the direction of intelligence.

They're just your funnel. They're not the most important part of it. There are many different shapes of funnel that will do many different aperture widths and all that stuff. And you know, if your funnel is kind of stupid, well, just wait until compute gets slashed in cost by, by 50% next year or the year after. And, and your, your same stupid architecture is going to work just fine.

right? So there's this notion that even if we are very stupid at the model architecture level, as long as we have an architecture that can take advantage of what our hardware offers, we're going to get there, right? That's the fundamental idea here. And what this means at a very deep level is that the future of AI

deeply and inextricably linked to the future of compute. And the future of compute, that starts having us ask questions about Moore's law, right? Like this fundamental idea, which by the way, I mean, going historical just for a brief second here to frame this up, this was back in 1975. Moore basically comes up with this observation. He's not saying it's a physical law. It's just an observation about how the business world functions and how...

or at least the interaction between business and science, we seem to see, he says at the time, that the number of components, the number of transistors that you can put on an integrated circuit on a chip

seems to double every year. That was his claim at the time. Now, we now know that that number actually isn't quite doubling every year. Moore, in fact, in 1975 came back and he updated his time frame. He said, "That's not every year. It doubles every two years." And then there was a bunch of argument back and forth about whether it should be 18 months. The details don't really matter. The bottom line is you have this stable, reliable increase.

exponential increase, right? Doubling every 18 months or so in terms of the number of components, the number of computing components, transistors that you put on your chip. And that means you can get more for less, right? Your same chip can do more intelligent work.

Okay, that's basically the fundamental trend that we're going to ride all through the years. And it's going to take different forms. And you'll hear people talk about how Moore's Law is dead and all that stuff. None of that is correct, but it's incorrect for interesting reasons. And that's going to be part of what we'll have to talk about in this episode. And that's really the kind of landscape that we're in today. What is the jiggery-pokery? What are the games that we're playing today to try to keep Moore's Law going?

And how has Moore's law changed in the world where we're specifically interested in AI chips? Because now we're seeing a specific Moore's law for AI trend that's different from the historical Moore's law that we've seen for integrated circuits over the decades and decades that made Moore famous for making this prediction. And on that point, actually, this I think is not a term that's been generally utilized, but

has been written about and NVIDIA actually called it out. There's now the idea of Huang's Law, where the trend in GPUs has been kind of very much in line with Moore's Law even faster, where you start seeing again in the early 2010s, the start of the idea of using it for AI and then sort of the growth of AI almost glimpsed

goes hand in hand with the improvements in power of GPUs. And in particular, over the last few years, you just see an explosion of the power, of the cost, of the size of the GPUs being developed. Once you get to the H100, it's like 1,000, some big, big number compared to what you had a decade prior, just a decade prior, probably more than 1,000. So

Yeah. There's kind of the idea of Hoang's law where the architecture and the, I guess, development of parallel computing in particular, it has this exponential trend. So even if the particulars are Moore's law, which is the density you can achieve at the nano scale of semiconductors, even if that might be saturating due to inherent physics,

The architecture and the way you utilize the chips in paralyzed computing hasn't slowed down at least so far. And that is a big part of why we are where we are.

Absolutely. And in fact, that is a great segue into sort of peeling the onion back one more layer, right? So we have this general notion of Moore's law and now Andre is like, but there's also Huang's law. So how do you get from 2X every 18 months or so to all of a sudden something closer to like 4X every two months or depending on the metric you're tracking? And this is where we have to talk about what

actually is a chip doing? What are the core functions of a chip that really performs any kind of task? And the two core pieces that I think are worth focusing on today, because they're especially relevant for AI, number one, you have memory, right? You've got to be able to store the data that you're working on. And then number two, you have logic. You've got to have the ability to do shit to those bits and bytes that you're storing, right?

Kind of makes sense. Put those two things together. You have a full problem-solving machine. You have the ability to store information. You have the ability to do stuff to that information, carry out mathematical operations, right? So memory, storage, and logic, sort of the, yeah, the logic, the reasoning, or not the reasoning, the number, the math, the number crunching, right?

And so when we actually kind of tease these apart, it turns out, especially today, it's very, very different. It's a very, very different process, very, very different skill set that's required to make logic versus to make memory. And there are a whole bunch of reasons for that that have to do with the kind of architecture that goes into making like logic cells versus memory cells and all that stuff. But we'll get into that later if it makes sense. For now, though, I think the important thing to flag is

Logic and memory are challenging to make for different reasons, and they improve at different rates. So if you look at logic improvements over the years, the ability to just pump out flops, floating point operations per second, how quickly can this chip crunch numbers, there you see very rapid improvements. And part of the reason for that, a big part of the reason, is that if you're a fab that's building logic,

then you get to focus on basically just one top line metric that matters to you. And that's generally transistor density. In other words, how many of these compute components, how many transistors can you stuff onto a chip? That's your main metric. You care about other things like power consumption and heat dissipation, but those are pretty secondary constraints. You've got this one clean focus area. In the meantime, if you care about memory,

Now you have to worry about not one key kind of KPI. You're worried about basically three main things. First off, how much can my memory hold? What is the capacity of my memory? Second, how quickly can I pull stuff from memory, which is called latency? So basically you can imagine, right, you have a

like a bucket of memory, and you're like, I want to retrieve some bits from that memory. How long am I going to have to wait until they're available to me to do math on them? That's the latency. So we have capacity. How much can the bucket hold? Latency, how long does it take to get shit from the bucket? And then there's bandwidth. How much stuff can I pull from that memory at any one time?

And so if you're optimizing for memory, like you have to optimize these three things at the same time. You're not focused exclusively on one metric and that dilutes your focus. And historically something's got to give and that thing is usually latency. So usually when you see memory improvements, latency hasn't really gotten much better over the years.

capacity and bandwidth have. They've gotten a lot better, a lot faster, right? So you can sort of start to imagine, depending on the problem you're trying to solve, you may want to optimize for really high capacity, really high bandwidth, really low latency, which is often more of the case in AI, or some other combination of those things. So already, we've got the elements of chip design starting to form a little bit where we're thinking about what's the balance of these things that we want to strike, right?

And historically, one of the challenges that's come up from this is you have, as I said, low latency. So that's the thing that's tended to be kind of crappy because people have focused when it comes to memory on capacity and bandwidth. How much can I pull at once? And how big is my bucket of memory?

Because latency kind of sucks, because it's been growing really slowly, one consequence is our logic has been improving really fast, right? We're able to stuff a whole bunch of transistors on a chip. What tends to happen is there's this growing disparity between your logic capability, like how fast you can number crunch on your chip,

and how quickly you can pull in fresh data to do new computations on. And so you can kind of imagine the logic part of your chip, like it's just crunched all the numbers, crunched all the numbers, and then it's just sitting there twiddling its thumbs while it waits for more memory to be fetched so it can solve the next problem.

And that disparity, that gap is basically downtime. And it's become an increasing problem because again, transistor density logic has been improving crazy fast in AI, but latency has not, has been improving much more slowly. And so you've got this like crazy high capacity to crunch numbers, but this relatively like long delay between subsequent rounds of memory inputs.

And this is what's known as the memory wall, or at least it's a big part of what's known as the memory wall in AI. So a big problem structurally in AI hardware is how do we overcome this? And there are a whole bunch of techniques people work on to do this, trying to do things like, anyway, stagger your memory input so that your memory is getting fetched while you're number crunching still on that previous batch of numbers so that they overlap to the maximum extent possible.

All kinds of techniques, but this is kind of the fundamental landscape is you have logic and you have memory and logic is improving really fast. Memory is not improving quite as fast because of that dilution of focus, but both logic and memory have to come together on a high performance AI chip. And basically the rest of the story is going to unfold with those key ingredients in mind.

So I don't know, maybe that's a good tee up for the next step here. Yeah, and I can add a little bit, I think on that point, it's very true. If you go just look at RAM capacity over years, has grown very fast, but not quite as fast as Moore's law. And one of the, I guess, fine points of memory is it's also more complex. Well, I guess CPUs are also complex now, you parallelize, but

Memory is similarly complex, where for various reasons, you don't just make the memory faster, you can have smarter memory. So you introduce caching, where you know this data is something you use a lot. So you have a faster memory that's smaller that you utilize and...

cache important information so you can get it faster. So you have these layers of memory that have different speed, different size, right? And now you get to GPUs that need

absurd amounts of memory. So on CPU is right, we have RAM, which is random access memory, which is kind of like the fast memory that you can use. And that's usually eight gigabytes, 16 gigabytes. A lot of your OS is in charge of getting stuff from storage from your hard drive to RAM to van compute on and then it gets into the cache when you do computations.

Well, for neural nets, you really don't want to store anything that is not in RAM and you want as much as possible to be in cache. So I don't know the exact details, but I do know that a lot of engineering that goes into GPUs is that those kind of caching strategies, a lot of re-optimizations in transformers is about key value caching and...

you know, you have just ridiculous numbers on the RAM side of GPUs that you would never see on, you know, your CPU, your laptop, where it's usually just 8, 16, 32 gigabytes or something like that. Yeah, absolutely. And actually, I think you introduced an element there that really helps us move into the towards the next step of the conversation, which is what happens on the floor of a data center? Like, what does the data center floor look like?

The reason is that when you think about computing, the image to have in your mind is hierarchy, is a cascading series of increasingly complex and increasingly close to the bare silicon operations. So think about it this way, heading into a data center, right? You have just like a gigantic amount of really, really high voltage, right? And just power lines that are coming in.

Now, on the chip itself, you're dealing at roughly the electron level. You're dealing with extraordinarily tiny voltages, extraordinarily tiny currents and all that stuff. To gradually step down,

To get that energy and those electrons and those photons in the middle to do all that good work for you, you have to do a lot of gradual step downs. Gradually bringing the memory, bringing the power, bringing the logic all closer and closer to...

to the place where at the atomic level almost, the actual drama can unfold that we're all after, right? The number crunching, the arithmetic that actually trains the models and does inference on them. So when we think about that hierarchy, I'll identify just like a couple of levels of memory for us to keep in mind, for us to keep in REM. So this just starts to kind of fold in some of these layers that we can think about as we go. But

So one of the higher levels of memory is sort of like flash memory, right? So this could be like your solid state drives or whatever. This is very, very slow memory.

it will continue to work even if your power goes out. So it's this persistent memory, it's slow moving, but it's the kind of thing where if you wanted to store a data set or, I don't know, some interesting model checkpoints that come about fairly infrequently, you might think about putting them in flash memory. This is like a very slow long-term thing and

You might imagine, okay, well now I also need memory though, that's going to get updated. For example, like, I don't know, every time there's like a batch of data that comes in, you know, and batches of data are coming in like constantly, constantly, constantly. So, okay, well then maybe that's your high bandwidth memory, right? So this is again, closer to the chip because we're always getting closer to the chip physically as we're getting closer to the interesting operations, the interesting math. So now you've got your HBM, right?

Your HBM will talk about where exactly it sits, but it's really close to where the computations happen. They use a technology called DRAM, which we can talk about and actually should.

And anyway, it requires periodic refreshing to maintain data. So if you don't keep kind of updating each bit, because it stores each bit as a charge and a tiny capacitor, and because of a bunch of physical effects like leakage of current, that charge gradually drains away. So if you don't intervene, the stored data can be lost within milliseconds. So you have to keep refreshing, keep refreshing. It's much lower latency than your flash memory.

So in other words, way, way faster to pull data from it. That's critical because again, you're pulling those batches, they're coming in pretty hot, right? And so usually that's on the order of tens of nanoseconds.

And so, you know, every kind of tens of nanoseconds, you pull some data off the HBM. Now, even closer to where the computations happen, you're going to have SRAM, all right? So SRAM is your fastest, your ridiculous sub-nanosecond access time, very, very expensive as well. So you can think of this as well as an expense hierarchy, right? As we get closer to where those computations happen, oh, we got to get really, really kind of small components, very, very custom designed or very purpose-built components

and very expensive, right? So there's this kind of consistent hierarchy, typically of size, of expense, of latency, all these things, as we get closer and closer to the kind of leaves on our tree, to those kind of end nodes where we're going to do the interesting operations. And data centers and chips

These are all fractal structures in that sense. Really think about, you know, think about computing. You got to think about fractals. It's fractals all the way down. You go from like one, you know, trunk to branches, to smaller branches, smaller branches, just like our circulatory system, just like basically all complex structures. And that's one thing if you play fact

You'll be nodding along, right? This is what it is about. The world works in fractals in this way, higher and higher resolution at the nodes, but you do want to benefit from big tree trunks, big arteries that can just have high capacity in your system. Right. And this kind of reminds me of a little fun fact. I know probably a lot of people still, and certainly as a grad student in the late 2010s,

a big part of what you were doing is literally just fitting a neural net in a GPU. You're like, oh, I have this GPU with eight gigabytes of memory or 16 gigabytes. And so I'm going to do NVIDIA SMI and figure out how much,

memory there is available on it, and I'm going to run my code, and it's going to load up the model into the GPU, and that's how I'm going to do my training. And so for a long while, that was kind of the paradigm, is you had one GPU, one model, you try to fit the model into the GPU memory, that was it. Of course, now that doesn't work. The

The models are far too big for a single GPU, especially during training when you have to do back propagation, deal with gradients, et cetera.

During inference, people do try to scale them down, do quantization, fit them often in a single GPU. But why do you need these massive data centers? Because you want to pack a whole bunch of GPUs or TPUs all together. We have TPU pods from Google going back quite a while to 2018, I think, when we had 256 TPUs.

And so you can now distribute your neural net across a lot of chips. And now it gets even crazier because the memory isn't just about loading in the weights of the model into a single GPU. You need to like transfer information about the gradients on some weights and do some crazy complicated orchestration just to update your weights throughout the neural net. And I really have no idea how that works.

Well, and part of that we can get into for sure. I think to touch on that and just to make this connect, by the way, to some of the stuff we've been seeing happen recently with kind of reasoning models and the implications for

for the design of data centers and compute, this stuff really does tie in too. So I'll circle back to this observation that memory, like HBM, high bandwidth memory in particular, has been improving more slowly than logic, than the ability to just number crunch. So our ability to fetch data from memory and the bandwidth and all that has been improving more slowly than our ability to crunch the numbers.

One interesting consequence of this is that you might expect these reasoning models that make use of more inference time compute to actually end up disproportionately running better on older chips. And so I just want to explain and unpack that a little bit. So if you have just during inference, you have to load a language model into active memory from HBM.

And your batch sizes, your data that you're feeding in, those batch sizes will tend to be pretty small. And the reason they tend to be pretty small at inference time is that you can imagine like you're getting these bursts of user data that are unpredictable. And all you know is you better send a response really quickly or it'll start to affect the user experience.

So you can't afford to sit there and wait for a whole bunch of user queries to come in and then batch them, which is what's typically done, right? The idea with high bandwidth memory is you want to be able to batch a whole bunch of data together and amortize the delay, the latency that comes from loading that memory from the high bandwidth memory, amortize it across a whole bunch of batches.

batches. So sure, like Logic is sitting there waiting for the data to come in for a little while, but when it comes in, it's this huge batch of data. So it's like, "Okay, that was worth the wait." The problem is that when you have inference happening, you can't... Again, you've got to send responses quickly. So you can't wait too long to create really big batches. You've got to kind of, well, get away with smaller batches. And as a result,

your memory bandwidth isn't going to be consumed by the kind of user data induced data, right? Like you're getting relatively small amounts of your user data in. Your memory bandwidth is disproportionately consumed by just like the model itself. And so you have this high base cost associated with loading your model in

And because the batch size is smaller, you don't need as much logic to run all those computations. You have maybe eight user queries instead of 64. So that's relatively easy on the flops. So you don't need as much hard compute. You don't need as much logic. What you really need, though, is that baseline high memory requirement because your model is so big anyway. So even though your user queries are not very numerous, your model is big. So you have a high baseline need for HBM, but a relatively low need for flops.

Because flops improve more slowly, this means you can step back a generation of compute and you're going to lose a lot of flops, but your memory is going to be about the same. And since this is more memory intensive disproportionately than compute intensive, inference tends to favor older machines.

bit of a layered thing and it's okay if you follow that whole thing. But if you're interested in this, you want to listen back on that, or ask us questions about it. I think this is actually one of the really important trends that we're going to start to see is older hardware be useful for inference time compute. Big, big advantage to China, by the way, because they only have older hardware. So this whole pivot to reasoning and inference time compute is actually a really interesting advantage for the Chinese ecosystem.

And yeah, just to get a little, I think that brings up another interesting tangent, pretty quick tangent. We'll try to get into it. So you brought up batches of data and that's another relevant detail is you're not just loading in GPUs. You are loading in what you consider are batches of data. And that, what that means is, right, you have data sets, data sets are pairs of input and output and

And when you train a neural net and when you do inference on it as well, instead of just doing one input, one output, you do a whole bunch together. So you have N inputs and outputs.

And that is essential because when training a neural net, you could try to do just one example at a time, but an individual example isn't very useful, right? Because you can update your weights for it, but then the very next example might be the opposite class. So you would be just not finding the right path. And then it's also not very feasible to

to train on the entire dataset, right? You can't feed the entire dataset a compute-to-average across all the inputs and outputs because that's going to be, A, probably not possible, B, probably not kind of very good for learning. So one of the sort of key

key miracles, almost mathematical kind of surprising things is that stochastic gradient descent, where you take batches of data, you take, you know, 25, 50, 200, 56, whatever, inputs and outputs turns out to just work really well. And, you know, theoretically, you should be taking the entire data set, right? That's what gradient descent should be doing. Stochastic gradient descent, where you take batches, turns out to be

probably a good regularizer that actually improves generalization instead of overfitting. But anyway, one of the other things with OpenAI that was a little bit novel is massive bash size. So

As you increase the batch, that increases the amount of memory you need on your GPU. So the batch sizes were relatively small, typically during training, like 128, 256. Now, the bigger the batch, the faster you could train and the better the performance could be. But yeah, typically you just couldn't get away with very big batches. And OpenAI, I still remember this, was one of the early organizations getting into like 2000 batches.

example batches or something like that. And then I think one of the realizations that happened with very large models is that, especially during training, massive batches are very helpful. And so that was another reason that memory is important. And super, super economical too, right? Like this is one of the...

crazy advantages that open AI enjoys and anyone with really good distribution in this space enjoys a distribution of their products. I mean, like if you've got a whole ton of users, you've got all these queries coming in at very, very high rates, which then allow you to do bigger batches at inference time, right? Because you may tell yourself, well, look, I've got to send a response to my users within, I don't know, like a

500 milliseconds or something like that, right? And so basically what that says is, okay, you have 500 milliseconds that you can wait to collect inputs, to collect prompts from your users, and then you've got to process them all at once. Well, the

The number of users that you have at any given time is going to allow you to fill up those batches really nicely if that number is large. And that allows you to amortize the cost. You're getting more use out of your GPUs by doing that. This is one of the reasons why some of the smaller companies serving these models are a real disadvantage. They're often serving them, by the way, at a loss because they just can't hit the large batch sizes they need to amortize the cost of their hardware and energy

to be able to turn a profit. And so a lot of the VC dollars you're seeing burned right now in the space are being burned specifically because of this low batch size phenomenon, at least at inference time. At that point, in case it's not clear, or maybe some people don't know, right, a batch V weight works, right?

is yes, you're doing end-to-end, but then you're doing all these in parallel, right? You're giving all the inputs all together and you're getting all the outputs all together. So that's why it's kind of filling up your GPU. And that is one of the essential metrics is GPU utilization rate. If you do one example at a time, that takes up less memory, but then you're wasting time, right? Because you need to do one at a time versus...

If you get as many examples as your GPU can handle, then you get those outputs all together and you're utilizing your GPU 100% and are getting the most use out of it. Yeah, and this ties into this dance between

model architecture and hardware architecture. CPUs tend to have a handful of cores. The cores are the things that actually do the computations. They're super, super fast cores and they're super flexible, but they're not very numerous. Whereas GPUs can have thousands of cores, but each individual core is very slow. And so what that sets up is a situation where

If you have a very parallelizable task where you can split it up into a thousand or 4,000 or 16,000 little tasks that each core can handle in parallel, it's fine if each core is relatively slow compared to CPU. If they're all chugging away at those numbers at once, then they can pump out thousands and thousands of these operations

in the time that a CPU core might do 20 or whatever, right? So it is slower on a per core basis, but you have so many cores, you can amortize that and just go way, way faster. And that is at the core of what makes AI today work. It's the fact that it's so crazy parallelizable. You can take a neural network and you can chunk it up in any number of ways. Like you could, for example, feed it a whole bunch of

a whole bunch of prompts at the same time. That's called data or data parallelism. So you can actually, that's more like you send some chunks of data over to one, you know, one set of GPUs, another, another chunk to another set. So essentially you're parallelizing those, the processing of that data. You can also take your neural networks. You can slice them up

Layer-wise, so you can say layers zero to four, they're going to sit on these GPUs. Layers five to eight will sit on these GPUs and so on. That's called pipeline parallelism. So each stage of your model pipeline, you're kind of imagining chopping your model up lengthwise and farming out the different chunks of your model to different GPUs.

And then there's even tensor parallelism. And this is within a particular layer. You imagine chopping that layer in half and having a GPU chew on or process data that only going through just that part of the model. And so these three kinds of parallelism, data parallelism, pipeline parallelism, and tensor parallelism are all used together in overlapping ways in modern programming.

high performance AI data centers in these big training runs. And they play out at the hardware level. So you can actually see like you'll have, you know,

data centers with chunks of GPUs that are all seeing one chunk of the data set. And then within those GPUs, one subset of them will be specialized in a couple of layers of the model through pipeline parallelism. And then a specific GPU within that set of GPUs will be doing a specific part of

a layer or a couple of layers through tensor parallelism. And that's how you really, you know, kind of split this model up across as many different machines as you can to benefit from the massive parallelism that comes from this stuff. Right. And by the way, I guess just another fun detail, why did the graphics processing units turn out to be really good for AI? Well, it all boils down to matrix multiplications, right? It's all just a bunch of numbers. You have one

one set of numbers, you need to multiply it by another vector and get the output. That's your typical layer, right? You have n connections and inputs, you have one activation unit, so you wind up having two layers and you do a vector and so on. So anyway, it turns out that to do 3D computations, that's also a bunch of math, also a bunch of matrices that you multiply

to be able to get your rendering to happen. And so it turns out that you can do matrix multiplications very well by parallelizing over like a thousand cores versus if you have some kind of long equation where you need to do every step one at a time, that's going to be on a CPU. So yeah, basically 3D rendering is a bunch of linear algebra

Neural nets are a bunch of linear algebra. So it turns out that you can then do the linear algebra from the graphics also for neural nets. And that's why that turned out to be such a good fit. And now with tensor processing units, tensor is like a matrix, but with more dimensions, right? So you do even more linear algebra. That's what it all boils down to.

Excellent summary. This is a good time. Now we've got some of the basics in place to look at the data center floor and some current and emerging AI hardware systems that are going to be used for the next beat of scale. And I'm thinking here in particular of the GB200. Semi-analysis has a great breakdown of how the GB200 is set up. I'm pulling heavily from that in this section here with just some added details.

stuff thrown in just for context and depth. But I do recommend semi-analysis. By the way, semi-analysis is great. One of the challenges with it is it is highly technical. So I found I've recommended it to a lot of people. Sometimes they'll read it and they'll be like, I can tell this is what I need to know, but

It's really hard to like kind of get below and understand deeply what they're getting at. So hopefully this episode will be helpful in doing that. Certainly whenever we cover stories that Semi Analysis has covered, I try to do a lot of translation, at least when we're at the sharing stage there. But just be warned, I guess it's a pretty expensive newsletter and it does go into technical depth. They got some free stuff as well that you should definitely check out if you're interested in that sort of thing. I got this article

Premonition, in case anyone wants to correct me and say it's not just linear algebra because you have nonlinear activations famously and that's required. Yeah, that's also in there and that's not exactly linear algebra. You have functions that aren't just matrix multiplications. Although with rel use and modern activations, you kind of try to get away from that as much as possible. There's always some rel use douchebag. I don't want to be...

Factually incorrect, so just FYI, that's not what I mean. Well, and actually mathematically, the fun fact there is that if you didn't have that non-linearity, right, then multiplying just a bunch of matrices together would be equivalent from the linear algebra standpoint to having just one matrix. So you could replace it with... Anyway. Okay, so let's step onto the data center floor. Let's talk about the GB200. Why the GB200? Well, number one, the H100 has been around for a while. We will talk about it a little bit later.

But the GB200 is the next beat and more and more kind of the future is oriented in that direction. So I think it is really worth looking at. And this is announced and not yet out from NVIDIA. Is that right? Or is it already being sold? I believe it's already being sold, but it's only just started. So this is, yeah. This is the latest, greatest in GPU technology, basically. That's it. It's got that new GPU smell. Yeah.

So, first thing we have to clarify, right? You'll see a lot of articles that'll say something about the B200, and then you'll see other articles that say stuff about the GB200 and the DGX, the B200 DGX, or like, you know, all these things. Like, what the fuck are these things, right? So, the first thing I want to call out is there is a thing called a B200 GPU. That is a GPU, okay? So, the GPU is a very specific piece of hardware, right?

that is like the, let's say, component that is going to do the interesting computations that we care about fundamentally at the silicon level. But

A GPU on its own is, oh man, what's a good analogy? I mean, it's like a really dumb jacked guy. He can probably lift anything you want him to lift, but you have to tell him what to lift because he's a dumb guy. He's just jacked. So the B200 on its own needs something to tell it what to do. It needs a conductor.

It needs a CPU, at least that's usually how things work here. And so there's the B200 GPU, yes, wonderful. But if you're actually going to put it in a server rack, in a data center, you best hope that you have it paired to a CPU that can help tell it what to work on and orchestrate its activity.

Even better if you can have two GPUs next to each other and a CPU between the two of them, helping them to coordinate a little bit, right? Helping them do a little dance. That's good. Now your CPU, by the way, is also going to need its own memory. And so you have to imagine there's memory for that, all that good stuff. But fundamentally, we have a CPU and two GPUs.

on this little kind of motherboard, right? Yeah, that's like you have two jacked guys and you're moving an apartment and you have a supervisor. You know what? We're getting there. We're getting there, right? Increasingly, we're going to start to replicate just like what the Roman army looked like. You have some like colonel and then you've got the strong soldiers or whatever and the colonel's telling them, I don't know. And then there's somebody telling the colonel, I don't know. Anyway, yeah, you got a CPU on this motherboard and you got these two B200 GPUs.

In the case of the, so, okay, these are the kind of atomic ingredients for now that we'll talk about. Now that is sitting on a motherboard. All right. A motherboard, you can imagine it as like one big rectangle and we're going to put two rectangles together, two motherboards together. Each of them has one CPU and two B200 GPUs.

Together, that's four GPUs, that's two CPUs. Together, that's called a GB200 tray. Each one of those things is called a Bianca board. So Bianca board is one CPU, two GPUs. You put two Bianca boards together, you get a tray that's going to slot into one slot in a rack, in a server, in a data center. So that's basically what it looks like.

Out the front, you can see a bunch of special connectors for each GPU that will actually allow those GPUs to connect to other GPUs in that same server rack, let's say, or very locally in their immediate environment through these things called NVLink connectors.

cables. Basically, these are special NVIDIA copper cables. There are alternatives too, but this is kind of like an industry standard one. And so this together is, you can think of it as like one really tightly interconnected

set of GPUs, right? So why copper? I mean, the copper interconnect, and this also goes through a special switch called an NV switch that helps to mediate the connections between these GPUs. But the bottom line is you just have these GPUs really tightly connected to each other through copper interconnects.

And the reason you want copper interconnects is that they're crazy efficient at getting data around those GPUs. Very expensive, by the way, but very efficient too. And so this kind of bundle of compute is going to do basically like your, typically like your highest bandwidth requirement compute.

like tensor parallelism. This is basically the thing that requires the most frequent communication between GPUs. So you're going to do it over your most expensive interconnect, your NVLink. And so the more expensive the interconnect, roughly speaking, the more tightly bound these GPUs are together in a little local pod,

the more you want to use them for applications that will require frequent communication. So tensor parallelism is that because you're basically taking like a layer, a couple layers of your neural network, you're chopping them up. But in order to get a coherent output, you need to kind of recombine that data because one chunk of one layer doesn't do much for you. So they need to constantly be talking to each other really, really fast because otherwise it would just be kind of like a bunch of garbage. They need to be very coherent.

At higher levels of abstraction, so pipeline parallelism, where you're talking about whole layers of your neural network, and one pod might be working on one set of layers, and another pod might be working on another set of layers,

For pipeline parallelism, you're going to need to communicate, but a little bit less slowly, right? Because you're not talking about chunks of a layer that just need to constantly be, to even be remotely coherent. Those chunks have to come together to form one layer. At least with pipeline parallelism, you're talking about coherent whole layers. So this can happen a little bit slower. You can use interconnects like PCIe as one possibility,

Or even between different nodes over network fabric, you can go over InfiniBand, which is another slower form of networking. The pod, though, is the basic unit of kind of pipeline parallelism that's often used here. This is called the backend network.

So tensor parallelism, this idea again of we're going to slice up just parts of a layer and have like one server rack, for example, it's all connected through NVLink connectors, super, super efficient. That's usually called like accelerator interconnect, right? So the very local interconnect through NVLink pipeline parallelism

this kind of slightly slower, different layers communicating with each other, that is usually called the backend network in a data center. So you've got accelerator interconnect for the really, really fast stuff. You've got the backend network for the somewhat slower stuff. And then typically at the level of the whole data center, when you're doing data parallelism, you're sending a whole chunk of your data over to this bit, a whole chunk over that bit,

you're going to send your user queries in and they're going to get divided that way. That's the front-end network. So you got your front-end for your slower, lowest, let's say, typically actually less expensive hardware too, because you're not going as fast. You've got your back-end, which is faster, it's InfiniBand. And now you're moving things typically between layers, and this can vary, but I'm trying to be concrete here.

And then you've got your fastest thing, which is accelerator interconnect, even faster than the backend network, the activity that's buzzing around there. That's kind of like one way to set up a data center. You're always going to find some kind of hierarchy like this. And the particular instantiation can vary a lot, but...

But this is often how it's done. And so you're in the business, if you're designing hardware, designing models, you're kind of in the business of saying, "Okay, how can I architect my model such that I can chop it up to have a little bit of my model on one GPU here, on this GPU there, such that I can chop up my layers in this way that makes maximal use of my hardware?"

There is this kind of dance where you're doing very hardware-aware algorithm architecturing, especially these days, because the main rate-limiting thing for you is, how do I get more out of my compute? Right.

Right, and I think that's another big aspect of TPUs and Google. Google was a thing that OpenAI worried about partially because of TPUs, but also in big part because they had expertise in data centers. That was part of the reason why Google won out. They were really good at data center creation, and they were early to the game. So they not only made TPUs, tensor processing units, they...

pretty quickly afterwards also worked on TPU pods where you combine, you know, 256, 2000 TPUs together, presumably with that sort of memory optimization you're talking about to have much larger neural nets, much faster processing, et cetera.

Actually, that's a great point. There's this interesting notion of what counts as a coherent blob of compute. The real way to think about this is in terms of the latency or the timeline on which activities are unfolding at the level of that blob. So you think about what is a coherent blob of compute for tensor parallelism? Well, I mean, it's got to be really, really fast, right? Because these computations are really quick, really efficient, but then you got to move on really quick.

And so what one of the things that Google has done really well is these pods can actually coherently link together very large numbers of chips. And you're talking in some cases about like hundreds of these, I think 256 is for TPU v4, like one of the standard configurations.

But one of the key things to highlight here, by the way, is there is now a difference between the GPU, which is the B200, and the system, the GB200, the system in which it's embedded. So the GB200, by definition, is this thing that has a CPU and two GPUs on a tray, along with a bunch of other ancillary stuff.

And that's your Bianca board. And there's another Bianca board right next to it. And together, that's one GB200 tray, right? So we are talking about GPUs. The basic idea behind the GB200 is to make those GPUs do useful work, but that requires a whole bunch of ancillary infrastructure that isn't just that B200 GPU, right? And so the packaging together of

those components of the B200 GPU and the CPU and all those ancillary things, that's done by companies, for example, like Foxconn, that put together the servers. Once NVIDIA finishes shipping out the GPUs, somebody's got to assemble these, and NVIDIA can do some of this themselves. But companies like Foxconn can step in and...

We covered a story, I think, with Foxconn looking at a factory in Mexico to do this sort of thing. So they're actually building the supercomputer in a sense, like putting all these things together into servers and farming them out.

Anyway, there are different layers of that stack that are done by Foxconn and different by NVIDIA. But fundamentally, I just want to kind of differentiate between the GB200 system and the B200 GPU. The GB200 system also can exist in different configurations. So you can imagine a setup where you have, you know, one rack and it's got, say, 32 B200 GPUs.

and they're all tightly connected. Or you could have a version where you got 72 of them, all depending on the... Often what will determine that is how much power density you actually can supply to your server racks. And if you just don't have the power infrastructure or the cooling infrastructure to keep those racks humming, then you're kind of forced to take a hit and literally like put just fewer, like put less compute capacity in a given rack. That's one of the classic trade-offs that you face when you're designing a data center.

Yeah, and I think another shout out, in case people don't have a background, another major aspect of data center design and construction is the cooling. Because when you have a billion chips, whatever, computing, the way semiconductors work is that you're doing some electricity and you're using some energy, which produces heat.

And when you're doing a ton of computation like with GPUs, you get a lot of heat. You can actually warm up a bit if you really use your GPU well. So when you get to these racks where you really try to concentrate a ton of compute altogether, you get into advanced cooling like liquid cooling. And that's why data centers consume water, for instance. Why if you look at the climate impacts of...

AI, often they do cite kind of water usage as one of the metrics. That's why you care about where you put your data center in terms of climate and...

Presumably, that's a big part of engineering as well of these cores of these systems. Absolutely. And in fact, that's what the H100 series of chips is. Well, one of the things that's somewhat famous for is being the first chip that has a liquid cooled configuration. The black wells all need liquid cooling, right? So this next generation of infrastructure for the B200 and so on

You're going to have to have liquid cooling integrated into your data center. It's just a fact of life now because these things put off so much heat because they consume so much power. There's sort of an irreducible relationship between computation and power dissipation. Absolutely. So these two things are profoundly linked.

I think now it might make sense to double-click on the B200, just the GPU. So we're not talking about the great CPU that sits on the Bianca motherboard and helps orchestrate things, all that jazz. Specifically, the B200 GPU, or just let's say the GPU in general,

And I think it's worth double-clicking on that to see what are the components of it, because that'll start moving us into the fab, the packaging story, where does TSMC come in and introducing some of the main players. Does that make sense? Yeah, I think so. Okay, so we're looking at the GPU.

And right off the bat, two components that are going to matter. This is going to come up again, right? So we have our logic and we have our memory, the two basic things that you need to do useful shit in AI, right? So, okay, what is, let's start with the memory, right? Because we've already talked about memory, right? You care about what is the latency? What is the capacity? What is the bandwidth of this memory? Well, we're going to use this thing called high bandwidth memory, right?

right? And that's going to sit on our GPU. We're going to have stacks of high bandwidth memory, stacks of HPM. And you can think of these as basically like, roughly speaking, one layer of the stack is like a grid that contains a whole bunch of capacitors, a whole bunch of... that each store some information. And you want to be able to pull numbers off that grid really efficiently.

Now, historically, those layers, by the way, are DRAM. DRAM was a form of memory that goes way, way back. But the innovation with HBM is stacking those layers of DRAM together

and then connecting them, putting all the way through those stacks, these things called through silicon vias or TSVs. And TSVs are important because they basically allow you to just like simultaneously pull data from all these layers at the same time, hence the massive bandwidth. You can get a lot of throughput of data through your system because you're basically drawing down from all of those kind of layers in your stack

at once. So many layers of DRAM. And you'll see, you know, eight layer versions, 12 layer versions. The latest versions have like 12 layers. The companies, by the way, that manufacture HBM are different from the companies that manufacture the logic that sits on the chip. So the memory companies, the HBM companies, you're thinking here, basically the only two that matter are SK Hynix in South Korea and Samsung also in South Korea. There is Micron, but they're in the US and they kind of suck. They have like none of the market right now.

But yeah, so fundamentally, when you're looking at like, you know, NVIDIA GPUs, you're going to have, you know, like HBM stacks from say SK Hynix. And they're just really good at pulling out massive amounts of data. The latency is not great, but...

but you'll pull down massive amounts of data at the same time and feed them into your logic die, right? Your main GPU die or your compute die. People use all these terms kind of interchangeably, but that refers to the logic part of your GPU that's actually going to do the computation.

Now, this is for the H100, it's sometimes known as the GH100, but this is fundamentally the place where the magic happens. So you're pulling into the logic die this data from the HBM in massive quantities all at once. One thing to recognize about the difference between HBM and the kind of main GPU die

the process to fabricate these things is very different. So you need a very different set of expertise to make HBM, high bandwidth memory, versus to make a really good logic die. And this means that the fabs, the manufacturing facilities that actually build these things are different. So SK Hynix might do your HBM, but TSMC is almost certainly going to do your logic die. Right?

And the reason is, there's process reasons. Part of it is also the resolution, the effective resolution. So logic dies are these very irregular structures. We talked about how high bandwidth memory is this, these like stacked grids, basically. They're very regular. And as a result, a couple of things like you don't need as high resolution in your fabrication process. So you'll typically see people use like 10 to 14 nanometer processes to do like HBM3, for example.

But if you're looking at logic, for the logic die, you're building transistors that are kind of like these weird irregular structures that are extremely bespoke and all that. And as a result, you need a much, much higher grade process, typically four to five nanometer processes.

That doesn't mean that TSMC could just turn around. So TSMC is usually the ones who they do all the kind of truly leading edge processes. They can't really turn around and just make HBM very easily. Again, different set of core competencies. And so what has to happen is you're going to source your HBM from one company, you're going to source your logic from another, and now you need to make them dance together.

Somehow, you need to include both the logic and the memory on the same chip. And for that, nowadays, the solution people have turned to is to use an interposer. So an interposer is a structure that the logic and the memory and a couple other components too are going to sit on. And the interposer essentially allows you to connect, like say from the bottom of the HBM to the bottom of the logic,

to create these kind of like chip level connections that link your different, well, your different chips, or sorry, not chips, but your different like components together. And this is called packaging, this process of doing this packaging. Now,

TSMC famously has this CoaS packaging process. There are two kinds of CoaS. There's CoaS S and CoaS L. The details we don't have time to get into, but they are kind of fascinating. The bottom line is that this is just a way of, number one, linking together your memory die and your main GPU die, your logic die.

But also, an interesting thing that happens is as you move down the package, the resolution of the interconnects gets lower and lower. Things get coarser and coarser, bigger and bigger. And what you're trying to do is at the chip level, you've got crazy high resolution kind of

connections happening, like your pitch size, as it's sometimes called, that sort of resolution of the structure is really, really fine. It's really, really small. You want to actually deliberately decrease that as quickly as you can, because it allows you to have thicker wires, which are better for more efficient from a power delivery standpoint, and

make it possible for you to use more antiquated fabrication processes and all that stuff as quickly as possible. You want to get away from things that require you to use really, really advanced processes and things like that. So this is basically the landscape. You've got a bunch of stacked DRAM, in other words, high bandwidth memory, those stacks of memory, sitting next to a GPU die, a logic die, that's actually going to do the computations. And those are all sitting on top of...

an interposer, which links them together and has a bunch of, anyway, really nice thermal and other properties.

And on that point, you know, we mentioned TSMC and fabs and they are part of the story, which I think deserves a little bit more background, right? So fab means fabrication. That's where you take the basic building block, like the raw material and convert it into computing. So let's dive in a little bit what that involves for any less technical people. For

First, what is a semiconductor? It's literally a semiconductor. It's a material that, due to magic of quantum mechanics and other stuff, you can use to let current through or not. Fundamentally, that's the smallest building block of computing. And so what is a fab? It's something that takes raw material and creates nanometer-scale sculptures, right?

or structures of material that you can then give power to, you can kind of power it on or off, and that you can then combine into various patterns to do computations. So why is fabrication so complicated? Why is TSMC the one player that really matters? It sounds like there are a couple of organizations that can do fabrication, but TSMC is by far the best.

Because it's like, I think as we mentioned before, like the most advanced technology that humanity has ever made. You're trying to take the straw material and literally make these nanometer sized patterns in it.

for semiconductors, right? You need to do a little sculpture of raw material in a certain way and do that a billion times in a way that allows for very few imperfections. And as you might imagine, when you're dealing with nanometer-sized patterns, it's pretty easy to mess up. Like you let one little particle of dust into it, and that's bigger than, I don't know how many transistors, but it's pretty big.

And there's like a million things that could go wrong that could mess up the chip. And so it's a super, like the most delicate, intricate thing you can attempt to do. And the technologies that enable this to actually do the fabrication at nanometer scale levels. And then now we are getting to that sort of place where the quantum effects are crazy and so on. But anyway,

The technology there is incredibly, incredibly complicated, incredibly advanced, and incredibly delicate. So as we've kind of previewed,

You're now seeing TSMC trying to get to the US and it's going to take them years to set up a fab. And that's because you have a lot of advanced equipment that you need to set up in a very, very delicate way. And you're literally kind of taking large blocks of raw material, literally these slabs of silicon, I believe, and you're cutting it into little pieces.

circles, you need to transfer that all around to various machines that do various operations. And somehow you need to end up with something that has the right set of patterns. So it's fascinating how all this works and the advanced aspects of it. I don't even know. It's insane. And it costs hundreds of millions of dollars as we've covered to get the most advanced technology.

You have like one corporation that can do the technology required to make these patterns at like two nanometer, whatever resolution we have nowadays.

And so that's why fabrication is such a big part of the story. That's why NVIDIA farms out fabrication to a TSMC. They have just perfected the art of it and they have the expertise and the capability to do this thing that very, very few organizations are capable of even trying. And that, by the way, is also why China can't just easily catch up and do these most advanced chips. It's just...

incredibly advanced technology. Yeah, absolutely. As we discuss this, by the way, we're going to talk about things called process nodes or processes or nodes. These are fabrication processes that fabs like TSMC use. TSMC likes to identify their processes with a number in nanometers historically, at least up until now. They talk about, for example, the seven nanometer process node,

or the five nanometer process node. And famously, people refer to this as, well, there are three layers of understanding when it comes to that terminology. The first layer is to say something like, when we say seven nanometer process node, we mean that they're fabricating

their semiconductors down to seven nanometer resolution, right? Which sounds really impressive. Then people point out at the next layer, oh, that's actually a lie. They'll sometimes call it marketing terminology, which I don't think is accurate. That speaks to the third layer. The phrase seven nanometers is sometimes referred to as a piece of marketing terminology because it's true. There's no actual component in there that is like seven nanometer resolution. It's not like there's any piece of that that is truly physically down to seven nanometers.

But what the seven nanometer thing really refers to is it's the performance you would get if historical trends in Moore's law continued

There was a time back when we were talking about the two micron resolution that it actually did specify that. And if you kept that trend going, the transistor density you would end up with would be that associated with hitting the seven nanometer threshold. We're just doing it in different ways. So my kind of lukewarm take on this is I don't know that it's actually marketing terminology so much as it is the...

outcome-based terminology that you actually care about as a buyer, right? You care about, will this perform as if you were fabbing down to seven nanometers?

Or will it perform as if you're fabbing down to three? And that's the way that you're able to get to numbers of nanometers that are like, you know, we're getting to the point where it's like, you know, a couple of angstroms, right? Like a 10 hydrogen atoms strung together. Obviously, we're not able to actually fab down to that level. And if we could, there'd be all kinds of quantum tunneling effects that would make it impossible. So anyway, that's the basic idea here. Today's leading node is switching over into the two nanometer node right now.

what you'll tend to see is the leading node is subsidized basically entirely by Apple. So phone companies, they want it small, they want it fast. Apple is willing to spend. And so they will work with TSMC to develop the leading node each year, each cycle, right? And that's a massive partnership boost for TSMC.

Other companies, former competitors of TSMC like Global Foundry suffer a lot because they need a partner to help them subsidize that next node development. So this is a big, big strategic kind of moat for TSMC. They have a partner like Apple that's willing to do that.

This means Apple monopolizes the most kind of advanced node for their phones every year. Then that leaves the next node up free for AI applications. The interesting thing, by the way, is that might change. You could see that start to change as AI becomes more and more in demand, as NVIDIA is able to kind of compete with Apple potentially down the line for the very same deal with TSMC, right? If AI is just...

fueling way more revenue than iPhone sales or whatever else. Well, now all of a sudden, NVIDIA might be able to muscle in and you might see a change in that dynamic. But at least for right now, that's how it's playing out. And so NVIDIA now gets to work with the five nanometer process for the H100. That's the process they used for it. They actually started to use the four nanometer process, which really is a variant of the five nanometer, but the details don't super matter there. Fundamentally, the story then is about how

TSMC is going to achieve these sorts of effects. And one part of that story is how do you architect the shape of your transistors? The breakthrough before the most recent breakthrough is called the FinFET. Basically, this is like a fin-like structure that they bake into their transistors and it works really well for reasons.

There's the gate all around transistor that's coming in the next cycle. That's going to be way more efficient and blah, blah, blah. But bottom line is they're looking at like, how do we tweak the shape of the structure that the transistor is made up of to make it more effective and to make it work with smaller currents, to make it more from a power density standpoint, better thermal properties, better and so on and so forth.

But the separate piece is what is the actual process itself of creating that structure, right? That process is basically a recipe, right? So this is the sweet sauce, the magic that really makes TSMC work.

If you are going to replicate what TSMC does, you need to follow basically the same iterative process that they do to get to their current recipe, right? This is like a chef that's iterated over and over and over with their ingredients to kind of, to get a really good outcome.

You can think of a TSMC fab as a thing, a box with like 500 knobs on it. And you've got PhDs tweaking every single knob and they're paid an ungodly amount of money, takes them a huge amount of time. They'll start at the, you know, say seven nanometer process node. And then based on what they've learned to get there, they iterate to get to the five, the three, the two, and so on.

And you really just have to do it hands-on. You have to climb your way down that hierarchy because the things you learn at seven nanometer help to shape what you do at five and three and two and so on.

And this is one of the challenges with, for example, TSMC just trying to spin up a new fab starting at the leading node in like North America or whatever. You can't really do that. It's best to start, you know, a couple of generations before and then kind of work your way down locally. Because even if you try to replicate what you're doing generally in another location, dude,

Dude, air pressure, humidity, everything's a little bit different. Things break. This is why, by the way, Intel famously had a design philosophy for their fabs called copy exactly. And this was famously a thing where everything down to the color of the paint in the bathrooms would have to be copied exactly to spec because nobody fucking knew why the freaking yields from one fab were great and the other one were shit. And it was just like, I don't know, maybe let's just not mess with anything. That was the game plan.

And so TSMC has their own version of that. That tells you how hard this is to do, right? This is really, really tough, tough stuff. The actual process starts with a pure silicon wafer. So you get your wafer source. This is basically sand that has been purified and, you know, roughly speaking, sand, glass. And you put a film of oxide on top of it. This is like oxygen or water vapor that's just meant to protect the surface and block current leakage.

And then what you're going to do is deposit on top of that a layer of a material that's meant to respond to light. This is called photoresist. And the idea behind photoresist is if you expose it to light, some parts of the photoresist will become soluble. You'll be able to remove them using some kind of process.

or others might harden. And depending, you might have positive photoresist or negative photoresist, depending on whether the part that's exposed either stays or is removed. But essentially, the photoresist is a thing that's able to retain the imprint of light that hits your wafer in a specific way, right? So by the way, the pure silicon wafer, that is a wafer. You're going to

we're ultimately going to make a whole bunch of dies on that wafer. We're going to make a whole bunch of, say, B200 dies on that one wafer. So the next step is once you've laid down your photoresist, you're going to shoot a light source at a pattern, sometimes called a reticle or a photo mask, a pattern of your chip. And the light that goes through is going to encode that pattern, and it's going to image it onto the photoresist.

And there's going to be an exposed region. And you're going to replicate that pattern all through your wafer in a sort of raster scan type of way, right? And anyway, so you are going to then etch away. You're going to get rid of your sort of like photoresist. You'll then do steps like ion implantation, where you use a little particle accelerator to fire like ions into your silicon to dope it because semiconductors need dopants, like basically anything.

Basically, yeah, you make some imperfections and that turns out to mess with how the electrons go through the material and it's all magic, honestly. And by the way...

To that point of copy exactly, this is another fun detail in case you don't know. One of the fundamental reasons TSMC is so dominant and why there rose to dominance is yield. So actually you can't be perfect. Like it's a fundamental property of fabrication that some stuff won't work out. Some percent of your chips will be broken and not usable and that's yield.

And if you get a yield of like 90%, that's really good. If only 10% of what you fabricate is broken. When you get smaller, and especially as you set up a new fab, your yield starts out bad.

It's like inevitable. And TSMC is very good at getting the yield to improve rapidly. And so that's a fundamental aspect of competition. If your yield is bad, you can't be economical and you lose. 100%. In fact, this is where when it comes to SMIC, which is TSMC's competitor in China, which by the way,

stole a bunch of TSMC's industrial secrets. In a very fun way, but there's some fun details there for sure. Yeah, yeah, yeah. Like lawsuits and all kinds of stuff. But yeah, fundamentally SMIC, so stole a lot of that information and replicated it quite successfully. They're now at the seven nanometer level, right? And they're working on five, but their yields are suspected to be pretty bad, right?

And one of the things is with China, the yields matter less because you have massive government subsidies of the fabrication industry. And so, you know, they can maybe get away with that to make the market competitive because the government of China has identified or the CCP has identified this as a key strategic thing. So they're willing to just shovel money into the space.

But yeah, so this fabrication process has a lot of steps. By the way, a lot of them are cleaning, like a lot of them, just kind of polishing off surfaces, cleaning them to make sure everything's level. So there's a lot of boring stuff that goes on here. You know, I, anyway, I work with a lot of guys who are very deep in this space. So I do like to nerd out on it, but I'll, I'll, I'll contain myself.

The part of this process though that I think is sort of most useful to draw your attention to is this idea of just shining a light source onto a reticle, onto this photo mask that contains the imprint of the circuit you want to print essentially onto your wafer. So that light source and that whole kind of optics around it, that is a huge, huge part of the tradecraft here. So when you think about the things that make this hard, number one, there's the recipe.

How do you do these many, many, many layers of kind of like, you know, photo mask and etching and, you know, ion implantation and, you know, deposition, all that jazz. That know-how, that's what TSMC knows really, really well, right? That's the thing that's really, really hard to copy. But even if you could copy that, you would still need the light source that allows you to do this thing.

photolithography, as it's called, the kind of exposure of specific patterns onto your wafer. And so those photolithography machines become absolutely critical in the AI supply chain, in the hardware supply chain. And there is really just one company that can do it well. And in a way, it's a complex of companies. So this is called ASML. This is in the Netherlands, a Dutch company.

They have this really interesting overlapping history with Carl Zeiss Company, and they are essentially kind of a complex of companies just because of ownership structure and overlapping talent and stuff like that. But through ASML, Carl Zeiss Complex...

So when we talk about photolithography, this very, very challenging stage of how do we put light onto our chip or onto our wafer such that it gives us with high fidelity the pattern we're after, that is going to be done by photolithography machines produced by ASML.

And that brings us to the, I think the sort of final stage of the game to talk about how the photolithography machines themselves work and just why they're so important. Does that make sense? Or is there stuff that you wanted to add on the TSMC bit?

I think one thing we can mention real quick, since we were touching on process nodes, is where does Moore's law fit into this? Well, if you look back a decade ago in 2011, we were at the 28 nanometer stage. Now we're getting into, like we were using 5 nanometer, roughly 4 AI, trying to get to 2 nanometer stage.

And that is not according to Moore's law, right? Moore's law has slowed down kind of...

empirically. It's much slower, at least relative to when you get to the 80s or very early on, to decrease, get to smaller process size. And that's why partially you have seen the idea of CPUs having multiple cores, parallelization, and that's why GPUs are such a huge deal. Even

Even though we can't scale down and get to smaller process nodes as easily, it's incredibly hard. If you just engineer your GPU better,

even without a higher density of transistors, by getting those cores to work better together, by combining them in different ways, by designing your chip in a certain way that gets you the sort of jump in compute speed and capacity that you used to get just through getting smaller transistors.

Yeah. And I mean, it is the case also that thanks to things like FinFET and Gait all around, like we have seen a surprising robustness of even the fabrication process itself. Like, so the five nanometer process first came out in like 2020.

And then we were hitting three nanometers in early 2023. So, you know, like... It's not, yeah, it's, there's still some juice to be squeezed, but it's slowing down, I think. Yeah. It's fair to say. Yeah, no, I think that's true. And you can actually look at the projections, by the way, because of the insane capital expenditure required to set up a new fab. Like TSMC can tell you what their schedule is for like the next three nodes, like going into 2028, 2029, that sort of thing.

And that's worth flagging, right? They were talking tens of billions of dollars to set up a new fab, like aircraft carriers worth of risk cap. And it really is risk capital, right? Because like Andre said, you build the fab and then you just kind of like hope that your yields are good and they probably won't be at first. And like, that's a scary time. And so, you know, this is a very, very high risk industry. A TSMC is very close to base reality in terms of like unforgiving market exposure, right?

Right. So, okay, I guess photolithography is sort of like last and final glorious step in the process where really we're going to squeeze a lot of the high resolution into our fabrication process. This is where a lot of that resolution comes from. So let's start with

the DUV, the deep ultraviolet lithography machines that allowed us to get roughly to where we are today, roughly to the, let's say, seven nanometer node, arguably the five nanometer. There's some debate there. So when we talk about DUV, the first thing I want to draw your attention to is that there is a law in physics that says, roughly speaking, that the wavelength of your light is going to determine the kind of lithography

level of precision with which you can make images, with which you can, in this case, imprint a pattern. So if you have a 193 nanometer light source, you're typically going to think, oh, well, I'll be in the hundreds of nanometers in terms of the resolution with which I can

sort of image stuff, right? Now there's a whole bunch of stuff you can do to change that. You can use larger lenses. Essentially what this does, it collects a lot more rays of that light. And by collecting more of those rays, you can focus more tightly or in more controlled ways and image better. But generally speaking, your wavelength of your light is going to be a big, big factor. And the size of your lens is going to be another. That's the numerical aperture sometimes it's described as.

So those are, anyway, those are the two kind of key components. 193 nanometers is the wavelength that's used for deep ultraviolet light. This is a big machine, costs millions and millions of dollars. It's got a bunch of lenses and mirrors in it. And ultimately it ends up shining light onto this photo mask. And there's a bunch of interesting stuff about, you know, technologies like off-axis illumination,

and eventually immersion lithography and so on that get used here. But fundamentally, you're shining this laser and you're trying to be really clever about the lens work that you're using to get to these feature sizes that might allow us to get to seven nanometers. You can go further than seven nanometers with DUV if you...

do this thing called multi-patterning. So you take essentially your wafer and you go over it once and you go over it again with the same laser. And that allows you to kind of, let's say, do a first pass and then not necessarily corrective, but

an improving pass on your die during the fabrication process, the challenge is that this reduces your throughput. It means that you have to, instead of passing over your wafer once, you've got to pass over it twice or three times or four times.

And that means that your output is going to be slower. And because your capital expenditure is so high, basically you're amortizing the cost of these insanely expensive photolithography machines over the number of wafers you can pump out. So slowing down your output really means reducing your profit margin very significantly. And so SMIC is looking presumably at using multi-patterning like that.

that to get to the five nanometer node. But again, that's going to effectively cost in the same way as like yield is really bad, it's going to cost you throughput. And those things are really tricky. So that is the DUV machine, it allowed us to get to about seven nanometers

But then at the five nanometer level, pretty quickly, you just need a new light source. And that's where EUV, extreme ultraviolet lithography comes in. It is a technology that has been promised forever. Like there are, I don't know, 10 generations or something of TSMC processes where they're like, ah, this is going to be the one that uses EUV. And there's always some stupid shit that comes up and then they can't ship it.

So finally, we're at the EUV generation now. EUV light source is 13.5 nanometers. It is really, really fucking cool. I'm just going to tell you how crazy this is, okay? So somehow, you need to create 13.5 nanometer light, okay?

By the way, what I'm sharing here, there's a really great explainer of this that goes into much of this detail and has great illustrations on the Asianometry YouTube channel. Check that out. That's another great resource. But so it turns out like, so back in the day, people realized that you could fire a laser at a tin plate, like a flat sheet of tin and get it to emit light.

13.5 nanometer light. 13.5 nanometers is like super, super like extreme ultraviolet, very, very short wavelength, high energy light.

The problem with that, though, is that what you tend to find is that the light is kind of going to fly out in all different directions. And you need to find a way to collect it somehow. So people went, okay, you know what? Like, let's experiment with concave tin plates. So we're going to shape a tin plate kind of in the shape of a concave mirror so that when we shine light at it, the light that we get back will hopefully be more focused, more, yeah, more not collimated, but yeah, more controlled, let's say.

So they tried that. The problem with that is when you shine light on that concave tin plate, you get a bunch of sputtering. You get a bunch of like vaporization of the tin. And so, yeah, you produce your 13 nanometer light, but that light gets absorbed by this, like all these annoying tin particles that then get in the way. So you're like, ah, shit. Well, okay, now we're screwed. Tin doesn't work. But then somebody came up with this idea of using tin droplets.

So here's what's actually going to happen. It's pretty fucked up inside an EUV machine. So you've got a tin droplet generator. This thing fires these tiny little like 100 micron tin droplets at about 80 meters a second. They are flying through this thing. So tin droplets go flying. As they're flying, a pre-pulse laser is going to get shot at them and hit them to flatten them out.

turning them into basically the plates, reflective plates that we want, getting them in the right shape. So you're a tin droplet, you're flying through a top speed, you get hit by laser pulse number one to get flattened. And then in comes the main laser pulse from a CO2 laser that's going to vaporize you and have you emit your plasma. Now, because you're just a tiny tin droplet, there's not enough of you to vaporize that it'll get in the way of that 13.5 nanometer light. So we can actually collect it. So

So that's like, I mean, you are taking this, it's like hitting a bullet with another bullet twice in a row, right? You've got this tin droplet flying crazy fast, pre-pulse laser flattens it out. Then the next laser, boom, vaporizes it. Out comes the EUV light. And by the way, that has an overall conversion efficiency of about 6%. So like you're losing the vast majority of your power there. Out comes the EUV light. And then it's going to start to hit a bunch of mirrors, right?

No lenses, just mirrors. Why? Because at 13.5 nanometers, basically everything is absorbent, including air itself.

So now you've got to fucking have a vacuum chamber. This is all, by the way, happening in a fucking vacuum because your life now sucks because you're making EUV laser. So you've got a vacuum chamber because air will absorb shit and you're not allowed to use lenses. Instead, you've got to find a way to use mirrors because your life sucks. Everything in here is just mirrors. There's like about a dozen, just under a dozen mirrors in an EUV system.

all they're trying to basically replicate what lenses do. Like you're trying to focus light with mirrors, which based on my optics background, I mean, like that is a hard thing to do. There's a lot of interesting jiggery-pokery that gets done here, including poking holes in mirrors. So you can let light go through like mostly and hopefully not get too lost. Anyway, it's a mess, but it's really cool, but it's a mess. And so you've got these like 12 mirrors or 11 mirrors or 10 mirrors, depending on the configuration, but...

desperately trying to kind of collect and pull this. It's all happening in vacuum. Finally, it hits your photo mask and even your photo mask has to be reflective because if light would just be absorbed in any kind of transmissive material,

And so you, anyway, this creates so many painful, painful problems. You're literally not able to have any, what are called refractive elements. In other words, lens like elements where the light just goes through, gets focused and blah. No, everything has to be reflective all the time. And.

That is a giant pain in the butt. It's a big part of the reason why these machines are a lot harder to build and a lot more expensive. But that is EUV versus DUV. It seems like all you're doing is changing the wavelength of the light. But when you do that, all of a sudden, like you'll find... So even these mirrors, by the way, are about 70% reflective, which means about 30% of the light gets absorbed. And if you've got 10 or 11 multilayer mirrors, then all the way through, you're going to end up with just 2% transmission.

Like if 70, sorry, 30% of light gets lost at mirror one, 30% mirror two. If you work that through with 10 mirrors, you get to about 2% transmission, right? So you're getting really, really crap conversion efficiency on all the power you're putting into your system. By the way, the CO2 laser is so big, it's got to be under the floor of the room where you're doing all this stuff. This whole thing is a giant, giant pain in the butt. And that's part of the challenge. That is EUV.

There's also like high numerical aperture EUV, which is the next beat that basically just involves using effectively bigger lenses, like tweaking your mirror configuration because you're in UV to effectively kind of anyway, collect more, more rays of light.

so you can focus down more tightly. The problem with that is that all the setup, all the semiconductor fabrication setup assumes a certain size of optics. And so when you go about changing that, you've got to refactor a whole bunch of stuff.

You can't image the whole photo mask at once. The size of the photo mask that you can actually image, in other words, the size of the circuit you can imprint on your chip drops by about 50%. So now if you want to make the same chip, you've got to stitch together two kind of

photo masks, if you will, rather than just having one clean circuit that you're printing, you're going to stitch together two of them. And how do you actually get these insanely high resolution circuits to line up in just the right way? That's its own giant pain in the butt with a whole bunch of interesting implications for the whole supply chain there. I'm going to stop talking, but the bottom line is...

Because EUV is a big, big leap forward from DUV. And it's what China right now is completely missing. That is, so export controls have fully prevented China from accessing EUV machines, let alone high-end EUV. So they're all on DUV. They're trying to do multi-patterning to match what we can do at TSMC and other places with EUV. Yeah, I think you did a great job conveying just how insane these technologies are. Like,

You know, once you realize how absurd what's going on is in terms of precision, it's pretty mind blowing. And I think it also brings us to maybe the last point we'll get to and a large part of why we're doing this episode is when it comes to export controls, maybe we can dive into like, what are they? Like what is being controlled and how does it relate to fabrication, to chips and so on?

Yeah, actually, great question, right? It's almost like people treat it as a given when they're like, we're going to export, control, export, but what are you export controlling? So there's a whole bunch of different things. So the first, you go through the supply chain, basically, and you can make sense of it a bit more. The first is, hey, let's prevent China from getting their hands on these EUV lithography machines, right? They can't build them domestically. They don't have a Carl Zeiss. They don't have an ASML.

So, you know, we can presumably cut them off from that. And hopefully that just makes it really hard for them to domesticate their own photolithography industry.

Secondly, as a sort of defense in depth strategy, maybe we can also try to block them off from accessing TSMC's outputs. So in other words, prevent them from designing a chip and then sending it off to TSMC for fabrication. Because right now that's what happens in the West. NVIDIA say designs a new chip, they send the design to TSMC, TSMC fabs the chip.

and then maybe packages it or whatever, it gets packaged, and then they send it off. But what you could try to do is prevent China from accessing essentially TSMC's outputs. Historically, China's been able to enjoy access to both whatever machines ASML has pumped out and to whatever TSMC can do with those machines. So they could just send a design to TSMC, have it fabbed, and there you go.

But in the last couple of years, as export controls have come in, gradually the door has been closed on accessing frontier chips and then increasingly on photolithography such that, again, there's not a single EUV machine in China right now. By the way, these EUV machines also need to be constantly maintenanced. So even if there were an EUV machine in China, one strategy you could use is just like make it illegal to send the repair crews, send the 20 or so people who are needed to keep it up and running to China.

to China. And presumably, you know, that would at least make that one machine less valuable. And, you know, they could still reverse engineer it and all that. But the fabrication is part of the magic. Yeah. So those kind of two layers are pretty standard. And then you can also prevent,

companies in China from just buying the finished product, the NVIDIA GPU, for example, or the server, right? And so these three layers are being targeted by export control measures. Those are maybe the three kind of main ones that people think about is photolithography machines, TSMC chip fab output, and then even the final product from companies like say NVIDIA, right?

The interesting thing, by the way, that you're starting to see, and this bears mentioning in this space too, is like, NVIDIA used to be the only designer really, I mean, for frontier, for cutting edge GPUs. What you're starting to see increasingly is as different AI companies like Anthropic, like OpenAI are starting to bet big on different architectures and training strategies.

their need for specialized AI hardware is starting to evolve such that when you look at the kinds of servers that Anthropic is going to be using, you're seeing a much more GPU heavy set of servers than the ones that OpenAI is looking at, which are starting to veer more towards the kind of like two to one GPU to CPU ratio. And that's for interesting reasons that have to do with OpenAI thinking, well, maybe we can use, we need more of their

of verifiers. We want to lean into using verifiers to validate certain outputs of chains of thought and things like that. And so if we do that, we're going to be more CPU heavy and anyway, blah, blah. So you're starting to see custom ASICs

the need for custom custom chips develop with these frontier labs and increasingly like opening eyes, developing their own chip. And obviously Microsoft has its own chip lines and Amazon has its own chip line that they're developing with Anthropic and so on. And so we're going to see increasingly bespoke hardware that,

And that's going to result in firms like Broadcom being brought in. Broadcom specializes in basically saying, hey, you have a need for a specific new kind of chip architecture? We'll help you design it. We'll be your NVIDIA for the purpose of this chip. That's how Google got their TPU off the ground back in the day. And it's now how opening I apparently, reportedly, we talked about this last week,

is building their own kind of new generation of custom chip. So Broadcom likes to partner with folks like that. And then they'll, of course, ship that design out to TSMC for fabrication on whatever node they choose for that design. So anyway, that's kind of the big design ecosystem in a nutshell.

Yeah. And yet another fun historical, well, I guess, interesting historical detail. I don't know if it's fun. TSMC is unique or was unique when it was starting out as a company that just provided fabrication. So a company like NVIDIA could design a chip and then just...

ask TSMC to fabricate it. And TSMC promised not to then use your design to make a competing product. So prior to TSMC, you had companies like Intel that had fabrication technology. Intel was making money from selling chips from CPUs and so on, right? TSMC, their core business was taking designs from other people, fabricating the chip, getting it to you and nothing else. We're not going to

make GPUs or whatever. And that is why NVIDIA could even go to them. NVIDIA could not ask a potential competitor

Let's say AMD. I don't know if AMD does fabrication. But anyway, it could be the case that they do some design in-house and then contract to TSMC to then make the chips. And as you often find out, TSMC has a limited capacity for who it can make chips for.

So, you know, you might want to start a competitor, but you can't just like call TSMC and be like, hey, can you make some chips for me? It's, yeah, it's not that simple. And one of the advantages of NVIDIA is this very, very well established relationship going back to even the beginnings of NVIDIA, right? They very fortuitously established

struck a deal very early on. That's how they got off the ground by getting TSMC to be their fabrication partner. So they have a very deep, close relationship and have a pretty significant advantage because of that. Yeah, absolutely. Actually, great point to call that out, right? TSMC is famous for being the first, as it's known, pure play foundry, right? That's kind of the term. You'll also hear about like, so Fabless Foundry,

So fabless chip designers, right? That's the other side of the coin, like NVIDIA. NVIDIA doesn't fab, they design. They're a fabless designer. Whereas, yeah, TSMC is a pure play foundry, so they just fab. It kind of makes sense when you look at the insane capital expenditures and the risks involved in this stuff. Like, you just can't focus on both things. And the classic example, to your point of, you know, NVIDIA can't go to AMD. So AMD is fabless, but Intel isn't.

And Intel tries to fab for other companies. And that always creates this tension where, yeah, of course, Nvidia is going to look at Intel and be like, "Fuck you guys. You're coming out with whatever it is, Arrow Lake or a bunch of AI optimized designs." Those ultimately are meant to compete with us on design. So of course, we're not going to give you our fab business. We're going to go to our partners at TSMC. So it's almost like the economy wants these things to be separate.

And you're increasingly seeing like, this is the standard state of play. Global Foundries is a pure play fab. SMIC is a pure play. And then Huawei, the Huawei-SMIC partnership is kind of like the NVIDIA-TSMC partnership where Huawei does design and then SMIC does the fabbing. All this stuff is so deep and complex and there's webs of relationships that are

crazy and with technology the number of steps to get from the design to an actual chip we haven't even gotten into i don't think we got into packaging or we touched on it we touched on it but yeah and then there's you know constructing the motherboard which is a whole other step

So anyway, it's pretty fascinating, and I think we might want to have to call it without a lot of detail, but hopefully we've provided a pretty good overview of kind of the history of

hardware and AI and the current state of it and why it's such an important part of the equation and such a, I guess, pivotal aspect of who gets to win, who gets to dominate in AI and why everyone wants to build massive data centers and get, you know, 100,000 GPUs. That's the only way to scale is via more chips and more compute. And

That's just the game that's being played out right now. Well, hopefully you enjoyed this episode.

Very detailed episode on just this one topic. We haven't done this kind of episode in a while, and it was a lot of fun for us. So do let us know. You can comment on YouTube, on Substack, or leave a review. We'd love to hear if you would want more of these specialized episodes. We have, you know, attached to quite a few we could do. We could talk about projections for AGI, which I think is really interesting, and

energetic systems, like a thousand things. So please do comment if you found this interesting or you have other things you'd like us to talk about. Let's go!

Nothing but for in the cold. GP calls in the air. Deeper it makes us aware. Chances to slay some of this electric ride. A bunch of cars with silicon, we can't hide. Future fit, explore. We just like a blackboard.

Join us now as we explore.

Finding reflows in the world's machine. Code carries secrets unseen and serene. Memories box and data lines connect. The boom of tech, every sector affects. It's a heart to follow, we trust the flow. I have the soul, don't need to guess. Multisignal, frisbee, you've made my soul in the light.

Every fight I face, I wish it in motion, my questions feel the grace. Join the service, this is my simple A-S-I, let's make a world. It's on, we're at the start. Junctus is placed on this electric ride. The punch calls us, silicon, we can't hide. Future fits gold, we should take a micro-roll.

♪ ♪ ♪ ♪ ♪ ♪

AI Computing Hardware - Past, Present, and Future 02:04:24 Share

Last Week in AI

Deep Dive

Shownotes Transcript

AI Computing Hardware - Past, Present, and Future