OpenAI's Noam Brown, Ilge Akkaya and Hunter Lightman on o1 and Teaching LLMs to Reason Better

2024/10/2

Training Data

AI Deep Dive AI Insights AI Chapters Transcript

People

Hunter Lightman

Ilge Akkaya

Noam Brown

Pat Grady

Sonya Huang

Topics

Noam Brown：O1模型结合了大型语言模型和深度强化学习的优势，能够进行更长时间的思考，从而提升推理能力。这与人类的系统一和系统二思维类似，有些问题需要更长时间的思考才能获得更准确的结果。O1模型的推理方式更通用，可以应用于多个领域，例如数学、编程等。在AlphaGo的启发下，O1模型也受益于更长的思考时间，但其思考方式更通用，可以应用于更多领域。O1模型的推理过程是人类可理解的，可以观察其思考过程。 Ilge Akkaya：最初对O1模型的成功信心不足，但在看到模型以不同方式解决问题后，信心增强。OpenAI采用经验数据驱动的方法，当数据开始显示出想要的结果时，团队会全力以赴。ChatGPT的出现是一个范式转变，O1模型则代表了在推理能力上的新方向。O1模型被医生和研究人员用作头脑风暴伙伴，帮助他们进行癌症研究和基因治疗等方面的研究。 Hunter Lightman：O1模型系列使用Aral训练，能够进行思考和推理，与传统的LLM有根本区别，并且在多个推理领域都实现了泛化。O1模型的推理方法在不同的问题类型上保持一致，无论生成答案和验证答案的难易程度如何。在O1的训练过程中，模型在数学评估中的得分超过了之前的任何尝试，并且其思维链显示出回溯和自我纠正的能力。这让我对O1模型的成功充满信心。

Deep Dive

Key Insights

Why is O1 better at STEM problems than previous models?

O1 is better at STEM problems because these tasks often benefit from longer reasoning, which O1 is designed to do. STEM problems tend to fall into the category of hard reasoning tasks where there is a clear benefit from considering more options and thinking for longer.

Why did the team choose to hide the chain of thought in O1?

The team chose to hide the chain of thought in O1 partly for competitive reasons and to mitigate risks associated with sharing the thinking process behind the model, similar to the decision not to share model weights.

What is the significance of inference-time scaling laws in O1?

The inference-time scaling laws in O1 are significant because they suggest that the model's capabilities can improve dramatically with more compute time, potentially leading to breakthroughs in areas like math research and software engineering.

What are the main bottlenecks to scaling test-time compute?

The main bottlenecks to scaling test-time compute include the need for significant engineering work to build and run large-scale systems, the availability of appropriate benchmarks, and the diminishing returns as compute time increases.

What is the biggest misunderstanding about O1?

The biggest misunderstanding about O1 is the origin of its codename, 'Strawberry.' It was simply named because someone in the room was eating strawberries at the time, not because of the popular question about counting Rs in 'strawberry.'

How should founders think about using GPT-4 versus O1?

Founders should use GPT-4 for general tasks and O1 for tasks that benefit from longer reasoning, such as STEM and coding. O1 is designed to think for longer and can be more effective in these domains.

Chapters

The researchers discuss their initial convictions and aha moments leading to the development of O1.

Initial conviction was promising but the path was unclear.
Aha moments came when methods started to work and outputs showed different problem-solving approaches.
OpenAI's empirical, data-driven approach when trends aligned.

Shownotes Transcript

Translations:

中文

One way to think about reasoning is there are some problems that benefit from being able to think about it for longer. There's this classic notion of system one versus system two thinking in humans. System one is the more automatic, instinctive response, and system two is the slower, more process-driven response. And for some tasks,

You don't really benefit from more thinking time. So if I ask you, like, what's the capital of Bhutan? You know, you could think about it for two years. It's not going to help you get it right with higher accuracy. What is the capital of Bhutan? I actually don't know.

But, you know, there's some problems where there's clearly a benefit from being able to think for longer. So one classic example that I point to is the Sudoku puzzle. You could, in theory, just go through a lot of different possibilities for what the Sudoku puzzle might be, what the solution might be. And it's really easy to recognize when you have the correct solution. So in theory, if you just had tons and tons of time to solve a puzzle, you would eventually figure it out.

We're excited to have Noam, Hunter, and Ilga with us today, who are three of the researchers on Project Strawberry or O1 at OpenAI. O1 is OpenAI's first major foray into general inference time compute, and we're excited to talk to the team about reasoning, chain of thought, inference time scaling laws, and more. Ilga, Hunter, and Noam, thank you so much for joining us, and congratulations on releasing O1 into the wild. I want to start by asking, did you always have conviction this was going to work?

I think that we had conviction that something in this direction was promising, but the actual path to get here was never clear. And you look at '01, it's not like this is an overnight thing. Actually, there's a lot of years of research that goes into this. And a lot of that research didn't actually pan out. But I think that there was conviction from OpenAI and a lot of the leadership that something in this direction

um, had to work and they were willing to, uh, to keep investing in it. Um,

despite the initial setbacks. And I think that eventually paid off. I'll say that I did not have as much conviction as Noam from the very beginning. I've been staring at language models, trying to teach them to do math and other kinds of reasoning for a while. And I think there's like a lot to research that's ebb and flow. Sometimes things work, sometimes things don't work. When we saw that the methods we were pursuing here started to work,

I think it was a kind of aha moment for a lot of people, myself included, where I started to read some outputs from the models that were approaching the problem solving in a different way.

And that was this moment, I think, for me where my conviction really set in. I think that OpenAI in general takes a very empirical data-driven approach to a lot of these things. And when the data starts to speak to you, when the data starts to make sense, when the trends start to line up and we see something that we want to pursue, we pursue it. And that, for me, was when I think the conviction really set in. What about you, Elga? You've been at OpenAI for a very long time, five and a half years. Five and a half years.

What did you think? Did you have conviction from the beginning that this approach was going to work? No, I've been wrong several times since joining about the path to AGI. I originally thought that robotics was the way forward. That's why I joined the robotics team first. Embodied AI, AGI, that's where we thought things were going to go. But yeah, I mean, things hit roadblocks. I

I would say like during my time here, ChatGPT, well, I guess that's kind of obvious now that was a paradigm shift. We were able to share very broadly with the world something that is a universal interface. And I'm glad that now we have a new path potentially forward to push this reasoning paradigm. But yeah, it was definitely not obvious to me for the longest time.

I realize there's only so much that you're able to say publicly for very good reasons about how it works, but what can you share about how it works, even in sort of general terms?

So the O1 model series are trained with Aral to be able to think and you could call it reasoning maybe also. And it is fundamentally different from what we're used to with LLMs. And we've seen it really generalized to a lot of different reasoning domains, as we've also shared recently. So we're very excited about this paradigm shift.

with this new model family. And for people who may not be as familiar with what's state-of-the-art in the world of language models today, what is reasoning? How would you define reasoning? And maybe a couple words on what makes it important. Good question. I mean, I think

But there's some problems where there's clearly a benefit from being able to think for longer. So one classic example that I point to is the Sudoku puzzle. You could, in theory, just go through a lot of different possibilities for what the Sudoku puzzle might be, what the solution might be. And it's really easy to recognize when you have the correct solution. So in theory, if you just had tons and tons of time to solve a puzzle, you would eventually figure it out.

And so that's what I consider to be... I think a lot of people in the AI community have different definitions of reasoning, and I'm not claiming that this is the canonical one. I think everybody has their own opinions. But I view it as the kinds of problems where there is a benefit from being able to consider more options and think for longer. You might...

call it like a generator-verifier gap where there's like, it's really hard to generate a correct solution, but it's much easier to recognize when you have one. And I think all problems exist on the spectrum from really easy to verify relative to generation, like a Sudoku puzzle, versus just as hard to verify as it is to generate a solution, like naming the capital of Bhutan.

I want to ask about AlphaGo and Noam, your background, having done a lot of great work in poker and other games. To what extent are the lessons from gameplay analogous to what you guys have done with O1, and how are they different? So I think one thing that's really cool about O1 is that it does clearly benefit by being able to think for longer.

When you look back at many of the AI breakthroughs that have happened, I think AlphaGo is the classic example. One of the things that was really noticeable about the bot, though I think underappreciated at the time, was that it thought for a very long time before acting. It would take 30 seconds to make a move. And if you tried to have it act instantly, it actually wasn't better than top humans. It was noticeably worse than them. And so it clearly benefited a lot by that extra thinking time.

Now, the problem is that the extra thinking time that it had, it was running multi-college research, which is like a particular form of reasoning that worked well for Go, but, for example, doesn't work in a game like poker, which my early research was on. And so a lot of the, like,

methods that existed for being able to reason, for being able to think for longer, was still specific to the domains, even though the neural nets behind it, the system one part of the AI, was very general. And I think one thing that's really cool about O1

is that it is so general. The way that it's thinking for longer is actually quite general and can be used for a lot of different domains. And we're seeing that by giving it to users and seeing what they are able to do with it. Yeah. One of the things that's always been really compelling to me about language models, and this is nothing new, is just that because their interface is the text interface, they can be adapted to work on all different kinds of problems.

And so what's exciting, I think, about this moment for us is that we think we have a way to do something to do reinforcement learning on this general interface. And then we're excited to see what that can lead to. One question on that you mentioned, I thought that was well put sort of the, the

I forget exactly how you phrased it, but the gap between generation and verification, and there's sort of a spectrum in terms of how easy things are to verify. Does the method for reasoning remain consistent at various points in that spectrum, or are there different methods that apply to various points in that spectrum?

One thing I'm excited about for this release has been to get O1 in the hands of so many new people to play with it, to see how it works, what kinds of problems it's good at, and what kinds of problems it's bad at. I think this is something really core to OpenAI's strategy of iterative deployment. We put the technology that we build, the research that we develop out into the world so that we can see-- we do it safely, and we do it so that we can see how the world interacts with it and what kinds of things we might not always understand

fully ourselves. And so in thinking about like, what are the limits of our approaches here? I think it's been really enlightening to see like Twitter

show what it can and what it can't do. I hope that that is like enlightening for the world. That's useful for everyone to figure out what these new tools are useful for. And then I also hope we're able to take back that information and use it effectively to understand our processes, our research, our products better. Speaking of which, is there anything in particular that you all have seen in the Twitterverse that surprised you? You know, ways that people have figured out how to use O1 that you hadn't anticipated?

There's one thing I'm super excited about. I've seen a lot of MDs and researchers use the model as a brainstorming partner. And what they are talking about is that they've been in cancer research for so many years and they've been just running these ideas by the model.

about what they can do about these gene discovery, gene therapy type of applications. And they are able to get like these really novel ways of research to pursue from the model. Clearly the model cannot do the research itself, but it can just be a very nice collaborator with humans in this respect. So I'm super excited about seeing the model just advance this scientific path forward.

That's not what we're doing in our team, but that is the thing I guess we want to see in the world. The domains that are outside ours that gets really benefited by this model. Noam, I think you tweeted that DeepRL is out of the draft of disillusionment. Can you say more about what you meant by that? I mean, I think there is definitely a period, starting with, I think, Atari, the DeepMind Atari results.

where Deep RL was the hot thing. I mean, I was in a PhD program. I remember what it was like in like, you know, 2015 to 2018, 2019. And Deep RL was the hot thing. And in some ways, I think that was, I mean, a lot of research was done, but certainly some things were overlooked. And I think one of the things that was kind of overlooked was the power of just training on tons and tons of data using, you know, something like the GPT approach. And

In many ways, it's kind of surprising because if you look at AlphaGo, which was in many ways the crowning achievement of DeepRL, yes, there was this RL step. But there was also-- I mean, first of all, there was also this reasoning step. But even before that, there was this large process of learning from human data. And that's really what got AlphaGo off the ground. And so then there was this increasing shift. There was, I guess, a view that this was an impurity in some sense, that

So a lot of deep RL is really focused on learning without human data, just learning from scratch. Yeah, AlphaZero, which was an amazing result and actually ended up doing a lot better than AlphaGo. But I think partly because of this focus on learning from scratch, this GPT paradigm kind of flew under the radar for a while.

And except for OpenAI, which saw some initial results for it and, again, had the conviction to double down on that investment. Yeah, so there was definitely this period where DeepRL was the hot thing. And then I think when GPT-3 came out and some of these other large language models, and there was so much success without DeepRL, there was a period of disillusionment where

a lot of people switched away from it or kind of lost faith in it. And what we're seeing now with O1 is that actually there is a place for it, and it can be quite powerful when it's combined with these other elements as well. And I think a lot of the DeepRL results were in kind of well-defined settings, like gameplay. Is O1 one of the first times that you've seen DeepRL used in a much more general kind of unbounded setting? Is that the right way to think about it?

Yeah, I think it's a good point that a lot of the highlight deep RL results were really cool, but also very narrow in their applicability. I mean, I think there were a lot of quite useful deep RL results and also quite general RL results, but there wasn't anything comparable to something like GPT-4 in its impact. So I think we will see that kind of level of impact from deep RL in this new paradigm going forward.

One more question in this general train of thought. I remember the AlphaGo results, you know, at some point in the Lee Sedol tournament, there was move 37. And, you know, that move surprised everybody. Have you seen something of that, you know, sort where O1 tells you something and it's surprising and you think about it and it's actually right and it's better than any, you know, top human could think of? Have you had that moment yet with the model or you think it's O2, O3?

One of the ones that comes to mind is we spent a lot of the time preparing for the IOI competition that we put the model into, looking at its responses to programming competition problems.

And there was one problem where it was, Owen was really insistent on solving the problem in this kind of weird way with some weird method. I don't know exactly what the details were. And our colleagues who are much more into competitive programming were trying to figure out why I was doing it like this. I don't think it was quite a, like, this is a stroke of genius moment. I think it was just like the model didn't know the actual way to solve it. And so it just like banged it head until it found something else. Did it get there? Yeah, yeah. It solved the problem. It just, it just, it used some, it was like, it

It was some method that would have been really easy if you saw something else. I wish I had the specific one, but I remember that being kind of interesting. There's a lot of the things in the programming competition results. I think somewhere we have the IOI competition programs published where you can start to see that the model doesn't approach thinking quite like a human does or doesn't approach these problems quite like a human does. It has slightly different ways of solving it for the actual IOI competition.

There was one problem that humans did really poor on, that the model was able to get half credit on. And then another problem that humans did really well on, that the model was barely able to get off the ground on, just showing that it kind of has a different way of approaching these things than maybe a human would. I've seen the model...

solve some geometry problems and the way of thinking was quite surprising to me such that you're asking the model just like give me this like sphere and then there are some points on the sphere and asking for probability of some event or something and the model would

go, let's visualize this. Let's put the points. And then if I think about it that way or something, so I'm like, oh, you're just using words and visualizing something that really helps you contextualize. Like I would do that as a human and seeing O-Wan do it too, just really surprises me.

Interesting. That's fascinating. So it's stuff that's actually understandable to a human and would actually kind of expand the boundaries of how humans would think about problems versus, you know, some undecipherable machine language. That's really fascinating. Yeah, I definitely think one of the cool things about our O1 result is that these chains of thoughts the model produces are human interpretable. And so we can look at them and we can kind of poke around at how the model is thinking. Were there aha moments along the way?

Or were there moments where, you know, Hunter, you mentioned that you were not as convinced at the outset that this is the direction that was going to work. Was there a moment when that changed where you said, oh my gosh, this is actually going to work? Yeah. So I've been at OpenAI about two and a half years. And most of that time I've been working on trying to get the models better at solving math problems. And we've done a bunch of work in that direction. We've built various different bespoke systems for that. And there was a moment where

on the O1 trajectory where we had just trained this model with this method with a bunch of fixes and changes and whatnot. And it was scoring higher on the math evals than any of our other attempts, any of our bespoke systems. And then we were reading the,

change of thought and you could see that they felt like they had a different character in particular you could see that when it got stuck it would say wait this is wrong let me take a step back let me figure out the right path forward and we called this backtracking and I think for a long time I'd been waiting to see an instance of the models backtracking and I kind of felt like I wasn't going to get to see an autoregressive language model backtrack because they're just kind of predict next token predict next token predict next token and so when we

saw this score on the math test and we saw the trajectory that had the backtracking, that was the moment for me where I was like, wow, this is like something is coming together that I didn't think was going to come together and I need to update. And I think that was when I grew a lot of my conviction. I think the story is the same for me. I think it was probably around the same time, actually. Like I...

I joined with this idea of ChatGPT doesn't really think before responding. It's very, very fast. And there was this powerful paradigm in these games of AI being able to think for longer and getting much better results. And there's a question about how do you bring that into language models that I was really interested in. And that's like,

It's easy to say that, but then there's a difference between just saying that, oh, there should be a way for it to think for longer than actually delivering on that. And so I tried a few things, and other people were trying a few different things. And in particular, yeah, one of the things we wanted to see was this ability to backtrack or to recognize when it made a mistake or to try different approaches. And we had a lot of discussions around, how do you enable that kind of behavior?

And at some point, we just felt like, OK, well, one of the things we should try, at least as a baseline, is just to have the AI think for longer. And we saw that, yeah, once it's able to think for longer, it develops these abilities almost emergently that

were very powerful and contain things like backtracking and self-correction, all these things that we were wondering how to enable in the models. And to see it come from such a clean, scalable approach, that was, for me, the big moment when I was like, OK, it's very clear that we can push this further. And it's so clear to see where things are going.

Noam, I think, is understating how strong and effective his conviction in test time compute was. I feel like all of our early one-on-ones when he joined were talking about test time compute and its power. And I think multiple points throughout the project, Noam would just say, why don't we let the model think for longer? And then we would, and it would get better. And he would just be, he would just look at us kind of funny like we hadn't done it until that point. One thing we noticed in your evals is that, you know, O1 is noticeably

good at STEM, it's better at STEM than the previous models. Is there a rough intuition for why that is? I mentioned before that there's some tasks that are reasoning tasks that are easier to verify than they are to generate a solution for, and there's some tasks that don't really fall into that category. And I think STEM problems tend to fall into what we would consider hard reasoning problems. And so I think that's a big factor for why we're seeing a lift on STEM kind of subjects.

Makes sense. I think relatedly, we saw that in the research paper that you guys released that O1 passes your research engineer interview with pretty high pass rates. What do you make of that? And does that mean at some point in the future, OpenAI will be hiring O1 instead of human engineers? I don't think we're quite at that level yet. I think that there's more to... It's hard to beat 100%, though. Maybe the interviews need to be better. I'm not sure. Okay.

I think that the O1 does feel, at least to me, and I think other people on our team, like a better coding partner than the other models. I think it's already authored a couple of PRs in our repo. And so in some ways it is acting like a software engineer because I think software engineering is another one of these STEM domains that benefits from longer reasoning. I don't know. I think that

the kinds of rollouts that we're seeing from the model are thinking for a few minutes at a time. I think the kinds of software engineering job that I do when I go and write code, I think for more than a few minutes at a time. And so maybe as we start to scale these things further, as we start to follow this trend line and let O1 think for longer and longer, it'll be able to do more and more of those tasks, and we'll see.

You'll be able to tell that we've achieved AGI internally when we take down all the job listings and either the company's doing really well or really poorly. What do you think it's going to take for O1 to get great at the humanities? Do you think being good at reasoning and logic and STEM kind of naturally will extend to being good at the humanities as you scale up in print time? Or how do you think that plays out? You know, like you said, we released the models and we were kind of curious to see what they were good at and what they weren't as good at. And, yeah,

and what people end up using it for. And I think there's clearly a gap between the raw intelligence of the model and how useful it is for various tasks. In some ways, it's very useful, but I think that it could be a lot more useful in a lot more ways. And I think there's still some iterating to do to be able to unlock that more general usefulness. Well, and can I ask you on that, do you view...

I'm curious if there's a philosophy at OpenAI or maybe just a point of view that you guys have on

how much of the gap between the capabilities of the model and whatever real world job needs to be done, how much of that gap do you want to make part of the model? And how much of that gap is sort of the job of the ecosystem that exists on top of your APIs, like their job to figure out? Do you have a thought process internally for kind of figuring out like, what are the jobs to be done that we want to be part of the model versus kind of where do we want

our boundaries to be so that there's an ecosystem that sort of exists around us.

So I'd always heard that opening, I was very focused on AGI and I was like, honestly kind of skeptical of that before I joined the company. And basically like the first day that I started and there was an all hands of the company and Sam got up in front of the whole company and basically like laid out the priorities going forward for like the short term and the longterm, it became very clear that AGI was the actual priority. And so I think the clearest answer to that is, you know,

AGI is the goal. There's no single application that is the priority other than getting us to AGI. Do you have a definition for AGI? Everybody has their own definition for AGI. Exactly. That's why I'm curious. I don't know if I have a concrete definition. I just think that

It's something about the proportion of economically valuable jobs that our models and our AI systems are able to do. I think it's going to ramp up a bunch over the course of the next however many years. I don't know. It's one of those, you'll feel it when you feel it and we'll move the goalposts back and be like, this isn't that...

for however long until one day we're just working alongside these AI coworkers and they're doing large parts of the jobs that we do now and we're doing different jobs and the whole ecosystem of what it means to do work has changed. One of your colleagues had a good articulation of the importance of reasoning on the path to AGI, which I think paraphrases as something like any job to be done is going to have obstacles along the way. And

The thing that gets you around those obstacles is your ability to reason through them. And I thought that was like a pretty nice connection between the importance of reasoning and the objective of AGI and sort of being able to accomplish economically useful tasks. Is that the best way to think about what reasoning is and why it matters? Or are there other frameworks that you guys tend to use? I think this is a TBD thing.

just because I think at a lot of the stages of the development of these AI systems, of these models, we've seen different shortcomings, different failings of them. I think we're learning a lot of these things as we develop the systems, as we evaluate them, as we try to understand their capabilities and what they're capable of. Other things that come to mind that I don't know how they relate to reasoning or not are like strategic planning, ideating, or things like this where to be a...

to make an AI model that's as good as an excellent product manager, you need to do a lot of brainstorming, ideation on what users need, what all these things are. Is that reasoning or is that a different kind of creativity that's not quite reasoning and needs to be addressed differently than afterwards when you think about

operationalizing those plans into action. You have to strategize about how to move an organization towards getting things done. Is that reasoning? There's parts of it that are probably reasoning and then there's maybe parts that are something else and maybe eventually it'll all look like reasoning to us or maybe we'll come up with a new word and there'll be new steps we need to take to get there.

I don't know how long we'll be able to push this forward, but whenever I think about this general reasoning problem, it helps to think about the domain of math. We've spent a lot of time reading what the model is thinking when you ask it a math problem. And then it's clearly doing this thing where it hits an obstacle and then it backtracks, just has a problem. Oh, wait, maybe I should try this other thing. So when you see that problem,

thinking process. You can imagine that it might generalize to things that are beyond math. That's what gives me hope. I don't know the answer, but hopefully.

The thing that gives me pause is that the O1 is already better than me at math, but it's not as good at me at being a software engineer. And so there's some mismatch here. There's still a job to be done. There's still some work to do. If my whole job were doing AIME problems and doing high school competition math, I'd be out of work. There's still some stuff for me for right now.

Since you mentioned sort of the chain of thought and being able to watch the reasoning behind the scenes, I have a question that might be one of those questions you guys can't answer, but just for fun. Was it, first off, I give you props for in the blog that you guys published with the release of 01 explaining why chain of thought is actually hidden and literally saying like partly it's for competitive reasons. Yeah.

And I'm curious if that was a contentious decision or like how controversial that decision was, because I could see it going either way. And it's a logical decision to hide it, but I could also imagine a world in which you decide to expose it. So I'm just curious if that was a contentious decision. I don't think it was contentious. I mean, I think for the same reason that you don't want to...

share the model weights necessarily for a frontier model. I think there's a lot of risks to sharing the thinking process behind the model. And I think it's a similar decision, actually. Can you explain from a layman's perspective, maybe to a layman, what is a chain of thought and what's an example of one?

So, for instance, if you're asked to solve an integral, most of us would need a piece of paper and a pencil. And we would kind of lay out the steps from getting from a complex equation. And then there will be steps of simplifications and then going to a final answer. The answer could be one. But how do I get there? That is the chain of thought in the domain of math.

Let's talk about that path forward. Inference time scaling laws. To me, that was the most important chart from the research that you guys published. And it seems to me like a monumental result, similar to the scaling laws from pre-training. And sorry to be hypey. Do you agree that the implications here, I think, are pretty profound? And what does it mean for the field as a whole? I think it's pretty profound. And I think

One of the things that I wondered when we were preparing to release O1 is whether people would recognize its significance. We included it, but it's kind of a subtle point. And I was actually really surprised and impressed that so many people recognized what this meant. There have been a lot of concerns that AI might be hitting a wall or plateauing because

pre-training is so expensive and becoming so expensive. And there's all these questions around, like, is there enough data to train on? Um, and I think one of the major takeaways about O1, especially O1 preview is not what the model is capable of today, but what it means for the future. The fact that we're able to have this different dimension for scaling that is so far pretty untapped, um, I think is a, is a big deal. And, um,

And I think means that the ceiling is a lot higher than a lot of people have appreciated. What happens when you let the model thinks for, for hours or months or years? What do you, what do you think happens? We haven't had a one for years, so we haven't been able to let it think that long yet. Is there a job just running in the background right now, but it's just, just still thinking about solve world peace. Okay. I'm thinking, thinking. Yeah. There's a, there's a Asimov story like that called the last question.

where yeah you they asked this big computer-sized uh ai um something about like how do we reverse entropy and it says i need to think longer for that and like the story goes and then 10 years later they see and it's still thinking and then a hundred years later than a thousand years later and then ten thousand years later um yeah there is as yet meaningful not enough information for meaningful answer or something like yeah like it's still yeah

Do you have a guess empirically on what'll happen? Or I guess right now I think the model has, I've seen some reports, like 120 IQ, so very, very smart. Is there a ceiling on that as you scale up in Prince Time Compute? Do you think you get to infinite IQ?

One of the important things is that like it's 120 IQ on some tests someone gave. This doesn't mean that it's got like 120 IQ level reasoning at all the different domains that we care about. I think we even talk about how it is below 4.0 on some things like creative writing and whatnot. So there's definitely it's like it's confusing to think about how we extrapolate this model.

I think it's an important point that we talk about these benchmarks. And one of the benchmarks that we highlighted in our results was GPQA, which is this questions that are given to PhD students and typically only PhD students can answer. And the AI is outperforming a lot of PhDs on this benchmark right now. That doesn't mean that it's smarter than a PhD in every single way imaginable. There's a lot of things that a PhD can do that--

There's a lot of things that a human can do, period, that the AI can't do. And so you always have to look at these evals with some understanding that it's measuring a certain thing that is typically a proxy for human intelligence when humans take that test, but means something different when the AI takes that test.

Maybe a way of framing that as an answer to the question is that I hope that we can see that letting the model think longer on the kinds of things that it's already showing it's good at will continue to get it better. So one of my big Twitter moments was I saw a professor that I had in school, a math professor, was tweeting about how he was really impressed with O1 because he had given it a proof that had been solved before by humans, but never by an AI model. And it just took it and ran with it and figured it out.

And that to me feels like we're at the cusp of something really interesting where it's close to being a useful tool for doing novel math research, where if it can do some small lemmas and some proofs for like real math research, that would be really, that would be really, really a breakthrough. And so I hope by letting it think longer, we can get better at that particular task of being a really good math research assistant. It's harder for me to extrapolate

what it's going to look like. Will it get better at the things that it's not good at now? What would that path forward look like? And then what would the infinite IQ or whatever look like then when it thinks forever on problems that it's not good at? But instead, I think you can kind of ground yourself in a, here are the problems it's good at. If we let it think longer at these, oh, it's going to be useful for math research. Oh, it's going to be really useful for software engineering. Oh, it's going to be really, and you can start to play that game and start to see how I hope the future will evolve.

What are the bottlenecks to scaling test time compute? I mean, for pre-training, it's pretty clear you need enormous amounts of compute. You need enormous amounts of data. This stuff requires enormous amounts of money. Like, it's pretty easy to imagine the bottlenecks on scaling pre-training. What constrains sort of the scaling of inference time compute? When GPT-2 came out and GPT-3 came out, it was, like, pretty clear that, like, okay, if you just throw more data and more GPUs at it, it's going to get a lot better. And it still took...

years for, to get from GPT-2 to GPT-3 to GPT-4. And there's just a lot that goes into taking an idea that sounds very simple and then actually like scaling it up to a very large scale. And I think that there's a similar challenge here where, okay, it's like a simple idea, but you know, there's a lot that work that has to go into actually scaling it up. So I think that's the challenge. Yeah. I think that one thing that I think

Maybe it doesn't need more surprise, but one thing I think might have used to surprise more academic-oriented researchers who join OpenAI is how much of the problems we solve are engineering problems versus research problems. Building large-scale systems, training large-scale systems, running algorithms that have never been invented before on systems that are brand new at a scale no one's ever thought of is really hard. And so there's always a lot of just like hard engineering work to make these systems scale up.

Also, one needs to know what to test the model on. So we do have these standard evals as benchmarks, but perhaps there are ones that we are not yet testing the model on. So we're definitely looking for those where we can just spend more compute on test time and get better results.

One of the things I'm having a hard time wrapping my head around is, you know, what happens when you give the model, you know, near infinite computes? Because as a human, I am, you know, even if I'm Terence Tao, like I am limited at some points by my brain, whereas you can just put more and more compute at inference time. And so does that mean that, for example, all math theorems will eventually be solvable through this approach? Or like, where is the limit, do you think?

Infinite computes a lot of compute. Near infinite. It goes back to the Asimov story of you're waiting 10,000 years, but maybe. But I say that just to ground it in a, like we don't know yet quite what the scaling of this is for how it relates to solving really hard math theorems. It might be that you really do need to let it think for a thousand years to solve some of the unsolved core math problems. Yeah.

Yeah, I mean, I think it is true that like if you let it think for long enough, then in theory, you could just go through like, you know, you formalize everything and lean or something and you go through every single possible lean proof and eventually you stumble upon the theorem. Yeah, we have algorithms already that can solve any math problem is maybe what you were about to get at, right? Yeah, like given infinite time, you can do a lot of things. Yeah, so, you know, clearly it gets in diminishing returns as you think for longer. Yeah, very fair. What do you think is the biggest misunderstanding about O1?

I think a big one was like when the name Strawberry leaked, people assume that like it's because of this popular question online of like the models can't answer how many Rs are in Strawberry. And that's actually not the case. When we saw that question, actually, we were really concerned that there was some internal leak about the model. And as far as we know, there wasn't. It was just like a complete coincidence that our project was named Strawberry. And there was also this like popular reasoning about Strawberries.

As far as I can tell, the only reason it's called Strawberry is because at some point, at some time, someone needed to come up with a codename and someone in that room was eating a box of strawberries. And I think that's really the end of it. It's more relatable than Q-Stack. I think I was pretty impressed with how well understood it was, actually. Yeah. I...

We were actually not sure how it was going to be received when we launched. There was a big debate internally about, are people just going to be disappointed that it's not better at everything? Are people going to be impressed by the crazy math performance? And what we were really trying to communicate was that it's not really about...

the model that we're releasing. It's more about where it's headed. And I think I was, yeah, I wasn't sure if that would be well understood, but it seems like it was. And so I think I was actually very, very happy to see that. Is there any criticism of O1 that you think is fair? It's absolutely not better at everything. It's a funky model to play with. I think people on the internet are finding new ways to prompt it to do better. So there's still

a lot of weird edges to work with. I don't know, I'm really excited to see, someone had alluded earlier to like the letting the ecosystem work with our platform to make more intelligent products, to make more intelligent things. I'm really interested to see how that goes with O1. I think we're in the very early days, it's kind of like, I don't know, at some point a year ago, people started to really figure out these LMPs, or these language model programs with

GPT-4 or whatever, and it was enabling smarter software engineer tools and things like that. Maybe we'll see some similar kinds of developments with people building on top of O1. Speaking of which, one of the things that we have not talked about is O1 Mini.

And I've heard a lot of excitement about O1 Mini because people are generally excited about small models. And if you can preserve the reasoning and extract some of the world knowledge, you know, for which deep neural nets are not exactly the most efficient mechanism, like that's a pretty, pretty decent thing to end up with. So I'm curious, what's your level of excitement about O1 Mini and kind of the general direction that that represents?

It's a super exciting model also for us as researchers. If a model is fast, it's universally useful. So yeah, we also like it. Yeah, they kind of serve different purposes. And also, yeah, we are very excited to have like a cheaper, faster version and then kind of like a heavier, slower one as well. Yeah, they are useful for different things. So yeah, definitely excited that we ended up with a good trade-off there.

I really like that framing because I think it highlights how much progress is, like how much you can move forward times how much you can iterate. And at least for our research, like ILGA gets at, O1 Mini lets us iterate faster. Hopefully for the broader ecosystem of people playing with these models, O1 Mini will also allow them to iterate faster. And so it should be like a really useful and exciting artifact, at least for that reason.

For founders who are building in the AI space, how should they think about when they should be using GPT-4 versus O1? Do they have to be doing something STEM related, coding related, math related to use O1? Or how should they think about it? I'd love if they could figure that out for us. One of the motivations that we had for releasing O1 Preview is to see what people end up using it for and how they end up using it.

There was actually some question about whether it's even worth releasing O1 Preview. But yeah, I think one of the reasons why we wanted to release it was so that we can get into people's hands early and see what use cases it's really useful for, what it's not useful for, what people like to use it for, and how to improve it for the things that people find it useful for. Anything you think people most underappreciate about O1 right now? It's like...

somewhat proof we're getting a little bit better at naming things. We didn't call it like GPT 4.5 thinking mode. Well, I thought it was strawberry. I thought it was Q-star. I don't know. Thinking mode. That kind of has a ring to it. What are you guys most excited about for 02, 03, whatever may come next? 03.5, whatever.

We're not at a point where we are out of ideas, so I'm excited to see how it plays out. Just keep doing our research. But yeah, most excited about getting the feedback because as researchers, we are clearly biased towards the domains that we can understand, but we'll receive a lot of different use cases from the usage of the product. And we're going to say maybe like, oh yeah, this is an interesting thing to push for. And yeah, like beyond our imagination, it might get better at different fields.

I think it's really cool that we have a trend line, which we post in that blog post. And I think it'll be really interesting to see how that trend line extends. Wonderful. I think that's a good note to end on. Thank you guys so much for joining us today.

OpenAI's Noam Brown, Ilge Akkaya and Hunter Lightman on o1 and Teaching LLMs to Reason Better 45:22 Share