ENSPIRING.ai: OpenAI's Noam Brown, Ilge Akkaya and Hunter Lightman on o1 and Teaching LLMs to Reason Better
This video explores the complex landscape of reasoning within artificial intelligence models, contrasting sources of human thought through System 1 and System 2 thinking. System 1 is characterized by automatic responses, while System 2 involves slower, analytical processing to arrive at a conclusion. The researchers discuss OpenAI's Zero One project, a key initiative aiming to generalize inference time compute, explaining the benefits of allowing AI models to take longer processing times for reasoning tasks, using examples like the Sudoku puzzle to illustrate this point.
Zero One is emphasized as a significant shift in AI research, featuring reinforcement learning that enhances reasoning capabilities of language models. The discussion covers the challenges faced during development and how initial skepticism turned into conviction as empirical evidence strengthened belief in AI's potential long-term prospects. The importance of generating human-interpretable output through reasoning is highlighted, with the model demonstrating competence in STEM areas, posing potential benefits for disciplines such as math and software engineering.
Main takeaways from the video:
Please remember to turn on the CC button to view the subtitles.
Key Vocabularies and Common Phrases:
1. empirical [ɪmˈpɪrɪkəl] - (adjective) - Based on observation or experience rather than theory or pure logic. - Synonyms: (observational, experiential, practical)
OpenAI in general takes a very empirical, data driven approach to a lot of these things.
2. inference [ˈɪnfərəns] - (noun) - The process of reaching a conclusion based on evidence and reasoning. - Synonyms: (deduction, conclusion, reasoning)
Excited to have Noam, Hunter and Ilgo with us today, who are three of the researchers on Project Strawberry, or Zero One at OpenAI's first major foray into general inference time compute.
3. conviction [kənˈvɪkʃən] - (noun) - A firmly held belief or opinion. - Synonyms: (belief, persuasion, opinion)
I think that we had conviction that something in this direction was promising, but the actual path to get here was never clear.
4. paradigm [ˈpærəˌdaɪm] - (noun) - A typical example or pattern of something; a model. - Synonyms: (model, pattern, example)
We were able to share very broadly with the world, something that is a universal interface. And I'm glad that now we have a new path, potentially forward to push this reasoning paradigm.
5. canonical [kəˈnɒnɪkəl] - (adjective) - Accepted as being accurate and authoritative. - Synonyms: (authoritative, official, standard)
I think a lot of people in the AI community have different definitions of reasoning, and I'm not claiming that this is the canonical one.
6. analogous [əˈnæl.ə.ɡəs] - (adjective) - Comparable in certain respects, typically in a way that makes clearer the nature of the things compared. - Synonyms: (comparable, similar, equivalent)
I want to ask about Alphago and Noam, your background, having done a lot of great work in poker and other games, to what extent are the lessons from gameplay analogous to what you guys have done with Zero One?
7. ebb and flow [ɛb ənd floʊ] - (noun phrase) - A recurrent or rhythmical pattern of coming and going or decline and regrowth. - Synonyms: (wax and wane, fluctuating, cyclical)
I've been staring at language models, trying to teach them to do math and other kinds of reasoning for a while. And I think there's a lot to research that's ebb and flow.
8. iterative ['ɪtəˌreɪtɪv] - (adjective) - Relating to or involving repetition, especially of a process or set of instructions. - Synonyms: (repetitive, repeating, cyclic)
I think this is something really core to OpenAI's strategy of iterative deployment.
9. reinforcement learning [ˌriːɪnˈfɔːrsmənt ˌləːrnɪŋ] - (noun) - A type of machine learning technique where an agent learns to make decisions by receiving rewards or penalties for actions taken. - Synonyms: (reward-based learning, adaptive learning, trial and error learning)
What's exciting, I think, about this moment for us is that we think we have a way to do something, to do reinforcement learning on this general interface, and then we're excited to see what that can lead to
10. test time compute [tɛst taɪm kəmˈpjuːt] - (noun phrase) - The computational resources and processes applied during the testing phase of a machine learning model, as opposed to during training. - Synonyms: (evaluation compute, runtime resources, performance testing)
Noam, I think, is understating how strong and effective his conviction and test time compute was
OpenAI's Noam Brown, Ilge Akkaya and Hunter Lightman on o1 and Teaching LLMs to Reason Better
One way to think about reasoning is there are some problems that benefit from being able to think about it for longer. There's this classic notion of system one versus system two thinking in humans. System one is the more automatic instinctive response, and system two is the slower, more process driven response. For some tasks, you don't really benefit from more thinking time. So if I ask you, what's the capital of Bhutan? You know, you can think about it for two years. It's not going to help you get it, get it right with higher accuracy. What is the capital of Buchanan? I actually don't know. But, you know, there's some problems where there's clearly a benefit from being able to think for longer. So one classic example that I point to is the Sudoku puzzle. It's, you could, in theory, just go through a lot of different possibilities for, like, what the Sudoku puzzle might be, what the solution might be, and it's really easy to recognize when you have the correct solution. So in theory, if you just had, like, tons and tons of time to solve a puzzle, you would eventually figure it out.
We're excited to have Noam, Hunter and Ilgo with us today, who are three of the researchers on Project Strawberry, or zero one at OpenAI. Zero one is OpenAI's first major foray into general inference time compute. And we're excited to talk to the team about reasoning, chain of thought, inference, time, scaling, loss and more. Ilga, Hunter, Noam, thank you so much for joining us, and congratulations on releasing zero one into the wild. I want to start by asking, did you always have conviction this is going to work? I think that we had conviction that something in this direction was promising, but the actual path to get here was never clear. And you look at all one. It's not like this is an overnight thing. There's a lot of years of research that goes into this, and a lot of that research didn't actually pan out. But I think that there was conviction from OpenAI and a lot of the leadership that something in this direction had to work and they were willing to keep investing in it despite the initial setbacks. And I think that eventually paid off.
I'll say that I did not have as much conviction as noam from the very beginning. I've been staring at language models, trying to teach them to do math and other kinds of reasoning for a while. And I think there's a lot to research that's ebb and flow. Sometimes things work, sometimes things don't work. When we saw that the methods we were pursuing here started to work. I think it was a kind of aha moment for a lot of people, myself included, where I started to read some outputs from the models that were approaching the problem solving in a different way. And that was this moment, I think, for me, where my conviction really set in. I think that OpenAI in general takes a very empirical, data driven approach to a lot of these things. And when the data starts to speak to you, when the data starts to make sense, when the trends start to line up and we see something that we want to pursue, we pursue it. And that, for me, was when I think the conviction really set in.
What about you, Ilga? You've been at OpenAI for a very long time. Five and a half years. Five and a half years. What did you think? Did you have conviction from the beginning that this approach was going to work? No, I've been wrong several times since joining about the path to AGI. I originally thought that robotics was the way forward. That's why I joined the robotics team first embodied AI AGI. That's where we thought things were gonna go. But yeah, I mean, things hit roadblocks, I would say, like during my time here. Chat GPT well, I guess that's kind of obvious now. That was a paradigm shift. We were able to share very broadly with the world, something that is a universal interface. And I'm glad that now we have a new path, potentially forward to push this reasoning paradigm. But, yeah, it was definitely not obvious to me for the longest time.
Yeah, I realize there's only so much that you're able to say publicly for very good reasons, about how it works. But what can you share about how it works, even in sort of general terms? So the o one model series are trained with RL to be able to think, and you could call it reasoning, maybe also, and it is fundamentally different from what we're used to with LLMs, and we've seen it really generalized to a lot of different reasoning domains, as we've also shared recently. So we're very excited about this paradigm shift with this new model family. And for people who may not be as familiar with what's state of the art in the world of language models today, what is reasoning? How would you define reasoning? And maybe a couple words on what makes it important.
Good question. I mean, I think one way to think about reasoning is there are some problems that benefit from being able to think about it for longer. There's this classic notion of system one versus system two thinking. In humans, system one is the more automatic, instinctive response and system. Two is the slower, more process driven response. And for some tasks, you don't really benefit from more thinking time. So if I ask you like, what's the capital of Bhutan? You could think about it for two years. It's not going to help you get it right with higher accuracy. What is the capital of Bootan? I actually don't know, but there's some problems where there's clearly a benefit from being able to think for longer. So one classic example that I point to is the Sudoku puzzle. You could, in theory, just go through a lot of different possibilities for what the Sudoku puzzle might be, what the solution might be, and it's really easy to recognize when you have the correct solution. So in theory, if you just had tons and tons of time to solve a puzzle, you would eventually figure it out. And so that's what I consider to be. I think a lot of people in the AI community have different definitions of reasoning, and I'm not claiming that this is the canonical one. I think everybody has their own opinions, but I view it as the kinds of problems where there is a benefit from being able to consider more options and think for longer.
You might call it a generator verifier gap, where it's really hard to generate a correct solution, but it's much easier to recognize when you have one. I think all problems exist on this spectrum from really easy to verify relative to generation, like a sudoku puzzle, versus just as hard to verify as it is to generate a solution, like naming the capital of Bhutan. I want to ask about Alphago and Noam, your background, having done a lot of great work in poker and other games, to what extent are the lessons from gameplay analogous to what you guys have done with zero one, and how are they different? So I think one thing that's really cool about zero one is that it does clearly benefit by being able to think for longer. And when you look back at many of the AI breakthroughs that have happened, I think Alphago is the classic example. One of the things that was really noticeable about the bot that I think underappreciated at the time was that it thought for a very long time before acting. It would take 30 seconds to make a move. And if you try to have it act instantly, it actually wasn't better than top humans, it was noticeably worse than them. It clearly benefited a lot by that extra thinking time.
Now, the problem is that the extra thinking time that it had, it was running multicolored tree search, which is a particular form of reasoning that worked well for go, but for example, doesn't work in a game like poker, which my early research was on. And so a lot of the methods that existed for being able to reason, for being able to think for longer, was still specific to the domains, even though the neural nets behind it, the system. One part of the AI was very general. I think one thing that's really cool about zero one is that it is so general. The way that it's thinking for longer is actually quite general and can be used for a lot of different domains. And we're seeing that. Bye. Giving it to users and seeing what they are able to do with it. Yeah.
One of the things that's always been really compelling to me about language models, and this is nothing new, is just that because their interface is the text interface, they can be adapted to work on all different kinds of problems. What's exciting, I think, about this moment for us is that we think we have a way to do something, to do reinforcement learning on this general interface, and then we're excited to see what that can lead to. One question on that you mentioned, I thought that was well put. I forget exactly how you phrased it, but the gap between generation and verification, there's sort of a spectrum in terms of how easy things are to verify. Does the method for reasoning remain consistent at various points in that spectrum, or are there different methods that apply to various points in that spectrum? One thing I'm excited about for this release has been to get zero one in the hands of so many new people to play with it, to see how it works, what kinds of problems it's good at and what kinds of problems it's bad at. I think this is something really core to OpenAI's strategy of iterative deployment. We put the technology that we build, the research that we develop out into the world so that we can see, we do it safely, and we do it so that we can see how the world interacts with it and what kinds of things we might not always understand fully ourselves. And so in thinking about what are the limits of our approaches here, I think it's been really enlightening to see Twitter show what it can and what it can't do. I hope that is enlightening for the world that's useful for everyone to figure out what these new tools are useful for. And then I also hope we're able to take back that information and use it effectively to understand our processes, our research, our products better.
Speaking of which, is there anything in particular that you all have seen in the twitterverse, that surprised you. Ways that people have figured out how to use o that you hadn't anticipated. There's one thing I'm super excited about. I've seen a lot of mds and researchers use the model as a brainstorming partner. And what they are talking about is that they've been in cancer research for so many years, and they've been just running these ideas by the model about what they can do, about these gene discovery, gene therapy type of applications, and they are able to get these really novel ways of research to pursue from the model. Clearly, the model cannot do the research itself, but it can just be a very nice collaborator with humans in this respect. So I'm super excited about seeing the model just advance this scientific path forward. That's not what we're doing in our team, but that is the thing we want to see in the world. The domains that are outside ours, that gets really benefit by this model. Noam, I think you tweeted that deep RL is out of the trough of disillusionment. Can you say more about what you meant by that?
I mean, I think there was definitely a period starting with, I think, Atari, the DeepMind. Atari results, where deep RL was the hot thing. I mean, I was in a PhD program. I remember what it was like in, like, you know, 2015 to 2018, 2019, and deep RL was the hot thing. And in some ways, I think that was. I mean, a lot of research was done, but certainly some things were overlooked. And I think one of the things that was kind of overlooked was the power of just training on tons and tons of data using something like the GPT approach. And in many ways, it's kind of surprising, because if you look at AlphaGo, which was in many ways, like the crowning achievement of deep RL, yes, there was this RL step, but there was also. I mean, first of all, there was also this reasoning step. But even before that, there was this large process of learning from human data, and that's really what got Alphago off the ground. And so then there was this increasing shift. There is, I guess, like a view that this was an impurity in some sense, that. So a lot of deep RL is really focused on learning without human data, with just learning from scratch. Yeah, Alphazero, which was a great. Which was an amazing result, and actually ended up doing a lot better than Alphago. But I think partly because of this focus on learning from scratch, this GPT paradigm kind of flew under the radar for a while, except for OpenAI, which saw some initial results for it and again had the conviction to double down on that investment.
Yeah. So there was definitely this period where deep RL was the hot thing. And then I think, you know, when GPT-3 came out and some of these other, like, large language models, and there was so much success without deep RL, there was like, yeah, a period of disillusionment where a lot of people switched away from it or kind of lost faith in it. And what we're seeing now with zero one is that actually there is a place for it, and it can be quite powerful when it's combined with these other elements as well. And I think a lot of the deep RL results were in kind of well defined settings like gameplay. Is zero one of the first times that you've seen deep RL used in much more general, kind of unbounded setting? Is that the right way to think about it?
Yeah, I think it's a good point that a lot of the highlight deep RL results were really cool, but also very narrow in their applicability. I mean, I think there were a lot of quite useful deep RL results and also quite general RL results, but there wasn't anything comparable to something like GPT four in its impact. So I think we will see that kind of level of impact from deep RL in this new paradigm going forward. One more question in this general train of thought. I remember the alphago results at some point in the Lee sedol tournament, there was move 37, and that move surprised everybody. Have you seen something of that sort where zero one tells you something and it's surprising and you think whether it's actually right and it's better than any top human could think of. Have you had that moment yet with the model? Or you think it's a 02031? Of the ones that comes to mind is we spent a lot of the time preparing for the IOI competition that we put the model into, looking at its responses to programming competition problems. And there was one problem where Owen was really insistent on solving the problem in this kind of weird way with some weird method. I don't know exactly what the details were. And our colleagues, who are much more into competitive programming, were trying to figure out why it was doing it like this. I don't think it was quite a. Like, this is a stroke of genius. I think it was just like the model didn't know the actual way to solve it, and so it just banged it head until it found something else. Did it get there? Yeah. Yeah, it solved the problem, it just some method that would have been really easy if you saw something else. I wish I had the specific one, but I remember that being kind of interesting.
There's a lot of the things in the programming competition results. I think somewhere we have the IOI competition programs published where you can start to see that the model doesn't approach thinking quite like a human does, or doesn't approach these problems quite like a human does. It has slightly different ways of solving it. For the actual IOI competition, there was one problem that humans did really pour on that the model was able to get half credit on. And then another problem that humans did really well on, that the model was barely able to get off the ground on. Just showing that it kind of has a different way of approaching these things than. Than maybe a human would. I've seen the model solve some geometry problems, and the way of thinking was quite surprising to me, such that you're asking the model just like, give me this sphere. And then there are some points on the sphere and asking for probability of some event or something. And the model would go, let's visualize this. Let's put the points. And then if I think about it that way or something, so I'm like, oh, you're just using words and visualizing something that really helps you contextualize.
Like, I would do that as a human, and seeing oan do it, too just really surprises me. Interesting. That's fascinating. So it's stuff that's actually understandable to a human and would actually kind of expand the boundaries of how humans would think about problems versus some undecipherable machine language. That's really fascinating. Yeah, I definitely think one of the cool things about our o one result is that these chains of thoughts the model produces are human interpretable, and so we can look at them and we can kind of poke around at how the model is thinking. Were there, aha. Moments along the way, or were there moments where, Hunter, you mentioned that you were not as convinced at the outset that this is the direction that was going to work. Was there a moment when that changed where you said, oh, my gosh, this is actually going to work? Yeah. So I've been at OpenAI about two and a half years, and most of that time I've been working on trying to get the models better at solving math problems. And we've done a bunch of work in that direction. We've built various different bespoke systems for that. And there was a moment on the zero one trajectory where we had just trained this model with this method, with a bunch of fixes and changes and whatnot, and it was scoring higher on the math evals than any of our other attempts, any of our bespoke systems.
And then we were reading the chains of thought. You could see that they felt like they had a different character in particular. You could see that when it got stuck, it would say, wait, this is wrong. Let me take a step back, let me figure out the right path forward. We called this backtracking. I think for a long time I'd been waiting to see an instance of the models backtracking, and I felt like I wasn't going to get to see an autoregressive language model backtrack, because they're just predict next, token, predict next, token, predict next. Token saw this score on the math test, and we saw the trajectory that had the backtracking. That was the moment for me where I was like, wow, this is like something is coming together that I didn't think was going to come together. And I need to update. And I think that was when I grew a lot of my conviction.
I think the story is the same for me. I think it was probably around the same time, actually. I definitely, I joined with this idea of chat. GPT doesn't really think before responding. It's very, very fast. And there was this powerful paradigm of these games of AI's being able to think for longer and getting much better results. And there's a question about how do you bring that into language models that I was really interested in. It's easy to say that, but then there's the difference between just saying that, oh, there should be a way for it to think for longer than actually delivering on that. I tried a few things and other people were trying a few different things. And in particular, one of the things we wanted to see was this ability to backtrack or to recognize when it made a mistake, or to try different approaches. We had a lot of discussions around, how do you enable that kind of behavior? At some point we just felt like, okay, well, one of the things we should try, at least as a baseline, is just have the AI think for longer. And we saw that once it's able to think for longer, it develops these abilities almost emergently that were very powerful and contain things like backtracking and self correction.
All these things that we were wondering how to enable in the models and to see it come from such a, a clean, scalable approach. That was for me the big moment when I was like, okay, it's very clear that we can push this further, and it's so clear to see where things are going. Noam, I think, is understating how strong and effective his conviction and test time compute was. I feel like all of our early one on ones when he joined were talking about test time computing its power. And I think multiple points throughout the project, noam would just say, why don't we let the model think for longer? And we would. And it would get better. And he would just be. He would just look at us kind of funny, like we hadn't done it until that point.
One thing we noticed in your evals is that o one is noticeably good at Stem. It's better at SteM than the previous models. Is there a rough intuition for why that is? I mentioned before that there's some tasks that are reasoning tasks that are easier to verify than they are to generate a solution for, and there's some tasks that don't really fall into that category. And I think stem problems tend to fall into the like, what we would consider hard reasoning problems. And so I think that's a big factor for why we're seeing a lift on stem kind of subjects. Makes sense, I think.
Relatedly, we saw that in the research paper that you guys released, that zero one passes your research engineer interview with pretty high pass rates. What do you make of that? And does that mean at some point in the future, OpenAI will be hiring zero one instead of. Instead of human engineers? I don't think we're quite at that level yet. I think that there's more. It's hard to beat 100%, though. Maybe the interviews need to be better. I'm not sure. I think that the zero one does feel, at least to me. I think other people on our team a better coding partner than the other models. I think it's already authored a couple of PRs in our repo, and so in some ways it is acting like a software engineer, because I think software engineering is another one of these sTeM domains that benefits from longer reasoning. I don't know.
I think that the kinds of rollouts that we're seeing from the model are thinking for a few minutes at a time. I think the kinds of software engineering job that I do when I go and write code, I think for more than a few minutes at a time. And so maybe as we start to scale these things further, as we start to follow this trendline and let o one think for longer and longer, it'll be able to do more and more of those tasks, and we'll see. You'll be able to tell that we've achieved AGI internally when we take down all the job listings and either the company's doing really well or really poorly.
What do you think it's going to take for zero one to get great at the humanities? Do you think being good at reasoning and logic and stem kind of naturally will extend to being good at the humanities as you scale up inference time, or how do you think that plays out? You know, like we said, we released the models and we were kind of curious to see what they were good at and what they weren't as good at and what people end up using it for. And I think there's clearly a gap between the raw intelligence of the model and how it's like, how useful it is for various tasks. In some ways it's very useful, but I think that it could be a lot more useful in a lot more ways, and I think there's still some iterating to do to be able to unlock that more general usefulness.
Well, can I ask you on that? I'm curious if there's a philosophy at OpenAI, or maybe just a point of view that you guys have on how much of the gap between the capabilities of the model and whatever real world job needs to be done. How much of that gap do you want to make part of the model and how much of that gap is sort of the job of the ecosystem that exists on top of your APIs, like their job to figure out? Do you have a thought process internally for figuring out what are the jobs to be done that we want to be part of the model versus where do we want our boundaries to be so that there's an ecosystem that exists around us.
I'd always heard that OpenAI was very focused on AGI. I was honestly skeptical of that before I joined the company. Basically, the first day that I started, there was an all hands of the company, and Sam got up in front of the whole company and basically, like, laid out the priorities going forward for, like, the short term and the long term. It became very clear that AGI was the actual priority. And so I think the clearest answer to that is, you know, AGI is the goal. There's no single, like, application that is the priority other than getting us to AGI. Do you have a definition for AGI? Everybody has their own definition for AGI. Exactly. That's why I'm curious. I don't know if I have a concrete definition.
I just think that it's something about the proportion of economically valuable jobs that our models and our AI systems are able to do. I think it's gonna ramp up a bunch over the course of the next however many years. I don't know. It's one of those, you'll feel it when you feel it, and we'll move the goalpost back and be like this. Isn't that for however long? Until one day we're just working alongside these AI coworkers and they're doing large parts of the jobs that we do now, and we're doing different jobs, and the whole ecosystem of what it means to do work has changed. One of your colleagues had a good articulation of the importance of reasoning on the path to AGI, which I think paraphrases as something like, any job to be done is going to have obstacles along the way. And the thing that gets you around those obstacles is your ability to reason through them. And I thought that was a pretty nice connection between the importance of reasoning and the objective of AGI and sort of being able to accomplish economically useful tasks.
Is that. Is that the best way to think about what reasoning is and why it matters? Or are there other frameworks that you guys tend to use? I think this is a TBD thing, just because I think at a lot of the stages of the development of these AI systems, of these models, we've seen different shortcomings, different failings of them. I think we're learning a lot of these things as we develop the system, as we evaluate them, as we try to understand their capabilities and what they're capable of. Other things that come to mind that I don't know how they relate to reasoning or not, are like strategic planning, ideating, or things like this. Where to make an AI model that's as good as an excellent product manager. You need to do a lot of brainstorming, ideation on what users need, what all these things are. Is that reasoning? Or is that a different kind of creativity that's not quite reasoning and needs to be addressed differently? Then afterwards, when you think about operationalizing those plans into action, you have to shadow about how to move an organization towards getting things done. Is that reasoning? There's parts of it that are probably reasoning, and then there's maybe parts that are something else, and maybe eventually it'll all look like reasoning to us, or maybe we'll come up with a new word and there will be new steps we need to take to get there.
I don't know how long we'll be able to push this forward, but whenever I think about this general reasoning problem. It helps to think about the domain of math. We've spent a lot of time reading what the model is thinking. When you ask it a math problem, and it's clearly doing this thing where it hits an obstacle and then it backtracks, just has a problem. Oh, wait, maybe I should try this other thing. So when you see that thinking process, you can imagine that it might generalize the things that are beyond math. That's what gives me hope. I don't know the answer, but hopefully the thing that gives me pause is that the O is already better than me at mathematic, but it's not as good at me at being a software engineer. And so there's some mismatch here. There's still a job to be done. Good. There's still some work to do. If my whole job were doing Amy problems and doing high school competition math, I'd be out of work. There's still some stuff for me. For right now, since you mentioned sort of the chain of thought and being able to watch the reasoning behind the scenes, I have a question that might be one of those questions you guys can't answer.
But just for fun, first off, I give you props for in the blog that you guys published with the release of zero one, explaining why chain of thought is actually hidden and literally saying partly it's for competitive reasons. I'm curious if that was a contentious decision or how controversial that decision was, because I could see it going either way and it's a logical decision to hide it. But I could also imagine a world in which you decide to expose it. So I'm just curious if that was a contentious decision. I don't think it was contentious. I mean, I think for the same reason that you don't want to share the model weights, necessarily, for a frontier model, I think there's a lot of risks to sharing the thinking process behind the model, and I think it's a similar decision, actually.
Can you explain, from a layman's perspective, maybe to layman, what is a chain of thought and what's an example of one? For instance, if you're asked to solve an integral, most of us would need a piece of paper and a pencil, and we would kind of lay out the steps from getting from a complex equation, and then there will be steps of simplifications and then going to a final answer. The answer could be one, but how do I get there? That is the chain of thought in the domain of math.
Let's talk about that path forward inference, time scaling laws. To me, that was the most important chart of from the research that you guys published, and it seems to me like a monumental result, similar to the scaling laws from pre training. And sorry to be hypey, do you agree that the implications here, I think are pretty profound and what does it mean for the field as a whole? I think it's pretty profound and I think one of the things that I wondered when we were preparing to release zero one is whether people would recognize its significance. We included it, but it's kind of a subtle point and I was actually really surprised and impressed that so many people recognized what this meant. There have been a lot of concerns that AI might be hitting a wall or plateauing because pre training is so expensive and becoming so expensive, and there's all these questions around. Is there enough data to train on? And I think one of the major takeaways about zero one, especially zero one preview, is not what the model is capable of today, but what it means for the future. The fact that we're able to have this different dimension for scaling that is so far pretty untapped, I think is a big deal. And I think means that the ceiling is a lot higher than a lot of people have appreciated.
What happens when you let the model thinks for hours or months or years? What do you think happens? We haven't had zero one for years, so we haven't been able to let it think that long yet. Is there a job just running in the background right now that it's just still thinking about solve world peace? Okay, I'm thinking, I'm thinking, yeah, there's an Asimov story like that called the last question where they asked this big computer sized AI something about how do we reverse entropy? And it says, I need to think longer for that. And the story goes, and then ten years later they see and it's still thinking, and then 100 years later and then 1000 years later and then 10,000 years later. Yeah. There is as yet meaningful, not enough information for meaningful answers? Something like that, yeah. Do you have a guess empirically on what'll happen? Or. I guess right now I think the model has, I've seen some reports like 120 IQ, so very, very smart.
Is there a ceiling on that? As you scale up inference time, compute, do you think you get to infinite iq? One of the important things is that it's 120 IQ on some test someone gave. This doesn't mean that it's got a 120 IQ level. Reasoning at all the different domains that we care about. I think we even talk about how it is below 40 on some things like creative writing and whatnot. So there's definitely. It's confusing to think about how we extrapolate this model. I think it's an important point that we talk about these benchmarks, and one of the benchmarks that we highlighted in our results was GPQA, which is this questions that are given to PhD students, and typically PhD students can answer. And the AI is outperforming a lot of PhDs on this benchmark right now.
That doesn't mean that it's smarter than a PhD in every single way imaginable. There's a lot of things that a PhD can do. There's a lot of things that a human can do, period, that the AI can't do. You always have to look at these evals with some understanding that it's measuring a certain thing that is typically a proxy for human intelligence when you measure, when humans take that test, but means something different when the AI takes that test. Maybe a way of framing that as answer to the question is that I hope that we can see that letting the model think longer on the kinds of things that it's already showing it's good at will continue to get it better. One of my big Twitter moments was I saw a professor that I had in school, a math professor, was tweeting about how he was really impressed with zero one because he had given it a proof that had been solved before by humans, but never by an AI model. And it just took it and ran with it and figured it out. And that, to me, feels like we're at the cusp of something really interesting, where it's close to being a useful tool for doing novel math research, where if it can do some small lemmas and some proofs for, um, like real math research, that would be really, uh, that would be really, really a breakthrough.
Um, and so I hope by letting it think longer, we can get better at that particular task of being a really good math research assistant. Um, it's harder for me to extrapolate what it's going to look like. Well, will it get better at the things that it's not good at now, um, what would that path forward look like and then what would the infinite IQ or whatever look like then? Um, when it thinks forever on problems that it's not good at, but instead, I think you can kind of ground yourself in a. Here are the problems it's good at. If we let it think longer at these. Oh, it's going to be useful for math research. Oh, it's going to be really useful for software engineering. Oh, it's going to be, really? And you can start to play that game and start to see how I hope the future will evolve.
What are the bottlenecks to scaling test time compute. I mean, for pre training, it's pretty clear you need enormous amounts of compute. You need enormous amounts of data. This stuff requires enormous amounts of money. It's pretty easy to imagine the bottlenecks on scaling pre training. What constrains the scaling of inference time? Compute. When GPT-2 came out and GPT-3 came out, it was pretty clear that if you just throw more data and more gpu's at it, it's going to get a lot better. And it still took years to get from GPT-2 to GPT-3 to GPT four. And there's just a lot that goes into taking an idea that sounds very simple and then actually scaling it up to a very large scale. And I think that there's a similar challenge here where, okay, it's a simple idea, but there's a lot that work that has to go into actually scaling it up. So I think that's the challenge.
Yeah, I think that one thing that I think maybe doesn't any more surprise, but one thing I think might have, might have used to surprise more academic oriented researchers who join OpenAI is how much of the problems we solve are engineering problems versus research problems. Building large scale systems, training large scale systems, running algorithms that have never been invented before on systems that are brand new at a scale no one's ever thought of, is really hard. And so there's always a lot of just like, hard engineering work to make these systems scale up. Also, one needs to know what to test the model on. So we do have these standard evals as benchmarks, but perhaps there are ones that we are not yet testing the model on. So we're definitely looking for those where we can just spend more compute on test time and get better results.
One of the things I'm having a hard time wrapping my head around is what happens when you give the model near infinite computes, because as a human, even if I'm Terrence Tao, I am limited at some points by my brain, whereas you can just put more and more compute at inference time. And so does that mean that, for example, all math theorems will eventually be solvable through this approach? Or, like, where is the limit? Do you think infinite computes? A lot of compute near infinite. It goes back to the Asimov story. If you're waiting 10,000 years. But maybe. But I said that just to ground it in a. Like, we don't know yet quite what the scaling of this is for how it relates to solving really hard math theorems. It might be that you really do need to let it think for a thousand years to solve some of the unsolved core math problems. Yeah, I mean, I think it is true that if you let it think for long enough, then in theory, you could just go through, you formalize everything in lean or something, and you go through every single possible lean proof, and eventually you stumble upon the theorem.
Yeah, we have algorithms already that can solve any math problem. Is maybe what you were about to get out of given infinite time, you can do a lot of things. Fair. Yeah. So clearly it gets in diminishing returns as you think for longer, but, yeah, very fair. What do you think is the biggest misunderstanding about zero one? I think a big one was when the name Strawberry leaked. People assume that it's because of this popular question online of the models can't answer how many R's are in strawberry? And that's actually not the case. When we saw that question, actually, we were really concerned that there was some internal leak about the model. And as far as we know, there wasn't. It was just, like, a complete coincidence that our project was named Strawberry, and there was this also this, like, popular reasoning about strawberries.
As far as I can tell, the only reason it's called strawberry is because at some point, at some time, someone needed to come up with a code name, and someone in that room was eating a box of strawberries. And I think that's really the end of it. It's more relatable than Qstar. I think. I was pretty impressed with how well understood it was, actually. Yeah, we were actually not sure how it was going to be received. When we launched, there was a big debate internally about, are people just going to be disappointed that it's not better at everything? Are people going to be impressed by the crazy math performance? And what we were really trying to communicate was that it's not really about the model that we're releasing, it's more about where it's headed. And I think I was. Yeah, I wasn't sure if that would be well understood, but it seems like it was. And so I think I was actually very happy to see that.
Is there any criticism of zero one that you think is fair? It's absolutely not better at everything. It's a funky model to play with. I think people on the Internet are finding new ways to prompt it to do better. So there's still a lot of weird edges to work with. I don't know. I'm really excited to see someone had alluded earlier to letting the ecosystem work with our platform to make more intelligent products, to make more intelligent things. I'm really interested to see how that goes with o one. I think we're in the very early days. It's kind of like, I don't know, at some point a year ago, people started to really figure out these LMPs or these language model programs with GPT four or whatever, and it was enabling smarter software engineer tools and things like that. Maybe we'll see some similar kinds of developments with people building on top of zero one.
Speaking of which, one of the things that we have not talked about is zero one mini. And I've heard a lot of excitement about o one mini because people are generally excited about small models. And if you can preserve the reasoning and extract some of the world knowledge for which deep neural nets are not exactly the most efficient mechanism, that's a pretty decent thing to end up with. I'm curious, what's your level of excitement about zero one mini and the general direction that that represents? It's a super exciting model also for us as researchers, if a model is fast, it's universally useful. So, yeah, we also like it. Yeah, they kind of serve different purposes. And also, yeah, we are very excited to have a cheaper, faster version and then kind of like a heavier, slower one as well. Yeah, they are useful for different things. So, yeah, definitely excited that we ended up with a good trade off there.
I really like that framing because I think it highlights how much progress is like how much you can move forward times how much you can iterate. And at least for our research, like Ilga gets at zero one mini lets us iterate faster. Hopefully, for the broader ecosystem of people playing with these models, zero one mini will also allow them to iterate faster. And so it should be a really useful and exciting artifact, at least for that reason.
For founders who are building in the AI space, how should they think about when they should be using GPT four versus zero one? Like, do they have to be doing something stem related, coding related, math related, to use zero one? Or how should they think about it? I'd love if they could figure that out for us. One of the motivations that we had for releasing o one preview is to see what people end up using it for and how they end up using it either way. There was actually, yeah, some question about whether it's even worth releasing a one preview. But yeah, I think one of the reasons why we wanted to release it was so that we can get into people's hands early and see what use cases it's really useful for, what it's not useful for, what people like to use it for and how to improve it for the things that people find it useful for.
Anything you think people most underappreciate about o one right now, it's like somewhat proof we're getting a little bit better at naming things. We didn't call it GPT 4.5. Thinking mode. Whatever. Well, I thought it was strawberry. I thought it was Q star. I don't know. Thinking mode kind of has a ring to it.
What are you guys most excited about for 0203 whatever may come next? Zero 3.5 whatever? Yeah, we're not at a point where we are out of ideas, so I'm excited to see how it plays out. Just keep doing our research. But yeah, most excited about getting the feedback because as researchers we are clearly biased towards the domains that we can understand. But we'll receive a lot of different use cases from the usage of the product and we're going to say maybe like, oh yeah, this is an interesting thing to push for. And yeah, like beyond our imagination, it might get better at different fields. I think it's really cool that we have a trend line, which we post in that blog post, and I think it'll be really interesting to see how that trend line extends. Wonderful. That's a good note to end on. Thank you guys so much for joining us today.
Artificial Intelligence, Business, Innovation, Technology, Reasoning, Project Zero One, Sequoia Capital
Comments ()