The video explores the current state of deep learning, addressing the question of whether the field is stagnating. Multiple perspectives from professionals like Chris Hay, Kush Varshni, and Kate Sol are presented. Some argue that deep learning has hit a wall, while others believe it's progressing with new techniques and applications, such as increased focus on inference time computations versus training time innovations. This debate is supported by the recent updates from OpenAI's O3 model and an emerging view on benchmarks and pre-training efficiency.

The discussion further delves into recent developments in AI, such as the innovative release of the Deep SEQ V3 model from China, which boasts high performance at a lower cost. This innovation challenges the prevalent notion that high-performance AI systems are inherently expensive to develop. The panelists also touch on the implications of AI governance in an increasingly global and rapidly advancing field, emphasizing the need for international collaboration to ensure safe and ethical AI development.

Main takeaways from the video:

💡
The deep learning field may face challenges, but there are significant ongoing innovations, especially around inference time computations.
💡
New AI models like Deep SEQ V3 illustrate that reduced costs and efficiencies are possible, signaling a shift in AI development norms.
💡
AI governance and safety remain crucial, necessitating a coordinated global approach to address potential risks and ensure technology integrates safely into society.
Please remember to turn on the CC button to view the subtitles.

Key Vocabularies and Common Phrases:

1. unsurmountable [ʌnˈsɜːrməntəbl] - (adjective) - Impossible to overcome or pass. - Synonyms: (insurmountable, insuperable, invincible)

I think there is a wall, but it's not an unsurmountable one.

2. inference [ˈɪnfərəns] - (noun) - The act of drawing a conclusion based on evidence and reasoning. - Synonyms: (deduction, reasoning, conclusion)

So inference time computer is really working.

3. benchmark [ˈbɛntʃˌmɑrk] - (noun) - A standard or point of reference against which things may be compared or assessed. - Synonyms: (standard, criterion, gauge)

That blows out of the water a lot of the benchmarks that people have traditionally used to measure or argue for measuring whether or not we're getting close to AGI

4. lobotomize [ləˈbɑtəˌmaɪz] - (verb) - To make someone or something less independent or creative by removing or controlling the part that drives those qualities. - Synonyms: (deaden, dull, desensitize)

And I want the models to be fun, so don't lobotomize them, you know what I mean?

5. deliberation [dɪˌlɪbəˈreɪʃən] - (noun) - Long and careful consideration or discussion. - Synonyms: (consideration, reflection, discussion)

I mean, I wouldn't say that I would want to spend a lot of time on this sort of safety deliberation either

6. anthropomorphizing [ˌænθrəpəˈmɔrfaɪzɪŋ] - (verb) - Attributing human characteristics or behaviors to a god, animal, or object. - Synonyms: (humanizing, personalizing, attributing human traits)

With the O3 models they're continuing to innovate and what can be done at inference time, having the models essentially think longer to risk anthropomorphizing these models.

7. foreshadow [fɔrˈʃædoʊ] - (verb) - To indicate or suggest beforehand; to predict or provide a presage. - Synonyms: (indicate, predict, augur)

And I think this is where we're heading and O3 is foreshadowed a little bit that you can run these models in a more efficient mode or if you need the maximum performance, you can run them in kind of a compute intensive mode

8. provisional [prəˈvɪʒənl] - (adjective) - Arranged or existing for the present, possibly to be changed later. - Synonyms: (temporary, interim, tentative)

Do you think that's kind of just almost just historically provisional?

9. bifurcated [ˈbaɪfərˌkeɪtɪd] - (adjective) - Divided into two branches or parts. - Synonyms: (divided, split, diverged)

Do you think that's the case? Or do you think this kind of bifurcated architecture is going to be what we'll see going forwards?

10. decentralize [ˌdiˈsɛntrəˌlaɪz] - (verb) - To move departments of a large organization away from a single administrative center to other locations. - Synonyms: (disperse, distribute, deconcentrate)

It almost feels like what they're doing is decentralizing AI development.

OpenAI o3, DeepSeek-V3, and the Brundage-Marcus AI bet

Frequently asked question, is deep learning hitting a wall? Chris Hay is a distinguished engineer and the CTO of Customer Transformation. Chris, what do you think? Oh yeah, totally, Tim. In fact, I think it's getting backwards. I think the models are getting worse and worse and worse. This is the worst it's ever been. It's totally hit a wall. Tim Happy 2025. Chris. Kush Varshni is IBM fellow working on issues of AI governance. Kush, welcome back. What do you think? I think there is a wall, but it's not an unsurmountable one. I think we're making progress, we're changing it up. Instead of just taking some steps, we're doing some rock cl, a little bit more of a serious answer. And Kate Sol is Director Technical Product Management for granite. Kate, happy 2025. What's your take? No, I don't think deep learning is hitting a wall. I think we're finding new ways to apply it in 2025. That's going to have some interesting benefits. All right, all that and more on today's Mixture of Experts.

I'm Tim Huang. Happy 2025 and welcome to Mixture of Experts. Each week Moe offers a world class panel of product leaders, researchers and engineers to analyze the biggest breaking news in artificial intelligence. Today we're going to be talking about the release of Deep Seq V3, a very public wager between an AI booster and an AI skeptic. But first, let's talk about OpenAI's O3. This was the last announcement of OpenAI's 12 Days of OpenAI Marketing event that they did at the end of last year. And it was arguably the biggest announcement. They basically have touted a new model which is now getting sort of limited trial access for safety purposes. That blows out of the water a lot of the benchmarks that people have traditionally used to measure or argue for measuring whether or not we're getting close to AGI.

So on a benchmark that we've talked about on the show in the past, Frontier Math, OpenAI's O3 is doing incredibly well. And I think one of the reasons I wanted to kind of bring this up is that it really does seem like after I think what was a news cycle late last year of people saying deep learning, slowing down, the old methods don't work anymore, pre training is over and a lot of general hand wringing, this really kind of reset the narrative, at least in the circles that I run in, to say actually that there's maybe a lot more room to run on all this. Chris, maybe I'LL turn to you first. You sort of outright made fun of me on the opening question. What's your take on the O3 model? How important is it? Does it really kind of indicate that there's still a lot more progress to run? How do you read it?

Basically I think it's a great thing actually. So I've been playing a lot with the 01 and the 01 Pro models and I've been having the best time with them. So inference time computer is really working. So I'm excited about O3. I'm just kind of annoyed that we don't have it though. That's the real thing. It's yet another, you know, this is coming soon and that's sort of annoying me especially being in Europe because in Europe we don't get anything these days. We didn't get Sora, we didn't get get half of any models that are coming through on the 12 days of Christmas. So I'm excited about O3. As for the benchmark thing, two things in my mind about that. One, you know, my opinion benchmarks are stupid. So I'm not really going to read into that. And then probably the second thing is even if we take the opinion that benchmarks aren't stupid, then it took an awful lot of time to come back with that answers and it was a little bit kind of monkeys and typewriters, right? Which is if you type long enough then you're eventually going to get the answer.

But with that aside, actually I'm so impressed by 01 and 01 Pro that actually I'm super excited about O3 and I think it's going to be a great model and it's really proving sort of inference time compute. Yeah, one follow up there is. I know you're saying you think all benchmarks are stupid, but you think this model is better. So what use case do you have in mind where you're like, oh, actually it seems to be 01's noticeably better than what we've had before. Yeah, yeah. There's probably a few ones. So the main one for me is coding, right? I mean it is completely in a different level. So even Claude 3.5sonnet GPT 4.0, the early versions of 01. Honestly, 01 Pro is on a different level now. Probably the big thing that I've found myself working with the models is Pro just takes quite a long time to come back with an answer. So I end up switching between models all the time. It's like, okay, I want a fast answer on this I think it can handle this. Oh no, it can't handle it. Going to switch from 01 to 01 Pro. So that sort of changing models just to get fast answers back and how much reasoning I want from the models is a sort of technique. But for me, coding is definitely the biggest thing. I don't really care about the math stuff because I'll just use a calculator. Right. But definitely for coding, I see a marked improvement.

Got it. Kate, maybe I'll turn to you. So I think if you're not watching this space super closely, it's easy, I think, to just get bewildered by the number of models and find variations between all these models kind of coming out, I think famously, or it was kind of talked about that the reason they jumped from 01 to 03 was that O2 was I think already used by the UK telecom company. So it was like a trademark thing that got them to be on 03. But I guess, Kate, question for you is if you can help our listeners kind of understand a little bit of what's new with what they're trying with O3 kind of looking under the hood, these models seem to be a lot more performant, but there also seem to be a lot of new things that they're trying underneath the surface. And I think it's worth kind of for our listeners to know a little bit of the flavor of that if you want to speak to that at all.

Absolutely. So I think the most important thing for our listeners to understand when looking at the new O3 model and the O model series in general from OpenAI is that we're transitioning from spending and innovating at the training time of the model and instead saying, okay, let's take a model that's been trained and let's run it multiple times and spend more compute at the actual inference time when it's being deployed out in the world. It seems like with the O3 models they're continuing to innovate and what can be done at inference time, having the models essentially think longer to risk anthropomorphizing these models think longer through different tasks, search through many different potential options and solutions before picking the best one, which then leads to improved performance. But also it takes longer, to Chris's point, you have to wait longer for a response.

One of the things that I think is really exciting about the O3 model and this broader investment and pivot to more inference time compute is that it actually can give you some really nice trade offs. And I think this is where we're heading and O3 is foreshadowed a little bit that you can run these models in a more efficient mode or if you need the maximum performance, you can run them in kind of a compute intensive mode. And I think that's going to be really cool because it gives people the ability to set their compute budget, set their time constraints for latency if they need an answer, a response quickly. And I think we're going to see a lot more of that in 2025 of people playing along that kind of cost performance trade off even within a single model saying, okay, I want my model only to think about this for a minute versus I want my model to give a response immediately versus my model can think about this for five minutes and then give me a response back depending all on how much I'm willing to pay and how important it is that the model gives really strong response back.

Yeah, definitely. Yeah. Some people were joking online. I saw that this is kind of it's the return of the old turbo mode on computers where you're like, we want the computer to work harder. But actually it's a really interesting question about almost like what do users want the computer to think hardest about, which I think is kind of a counterintuitive question about what types of queries and what types of tasks demand that. It'll be really interesting to see. Krish, it's ideal to have you on the line as well because I think one of the most interesting parts of the launch, I think Chris was frustrated by it. He was like, come on, just give me access to the model. But in traditional OpenAI style they've said, well, no, we're being careful with the launch and you can get access to the model if you're a safety or security researcher. And they're allowing people to have kind of requested access to go and red team the model. Curious about how you read that as someone who thinks about AI governance, is that kind of going to be the paradigm for how companies release models going Forwards? Or is OpenAI kind of almost like, do you see this as marketing? Right. They're using the safety thing to be like, give us just a few more months to iron out the loose ends.

I think it's a combination of both actually, because there's this concept of the gradient of release and Irene Solomon from Hugging Face came up with this and it's kind of like maybe take your time. The more powerful the model is, maybe the more kind of the slower you need to roll it out. But I think it's a combination so OpenAI gave their models to the UK safety instit for testing in advance as well. And some of this, I think, is just to be able to say that they did it. Some of it is to actually have some. Some better safety alignment and so forth. So, yeah, I think it's, it's here and there. And one other thing that's in this O3 release that they talk about is a new way of doing safety alignment. They call it deliberative alignment. And I think it's kind of interesting they're saying that they are very much looking at an actual safety regulation, taking the text from that and training the model with respect to it, doing some synthetic data generation that follows along with what the policy says. So something we've been doing for a while as well. Last year we published a couple of papers we call Alignment Studio and we call Alignment from Unstructured Text and so forth. And I think those sort of ideas are kind of carrying through. The, the new part is again, the, the fact that this is spending a lot of time on the inference side, then thinking again and again about am I meeting those safety requirements or not.

And as both Chris and Kate said. Right. I mean, the more time you're spending over on the. The inference side, what should you be thinking about? What should the model be thinking about? And I said this in the. The last episode of the new year, I think that extra thinking is going to be for governance quite a bit. So I think this is where it's going to play and I'm excited to see. Maybe I'll sign up too. So, yeah, do some of the safety testing. Yeah, I think it's kind of two interesting things here in what you said. I mean, I think one of them is the model to date feels like has been you release the model, but then you're also like, we guarantee safety by releasing safety models. Right. Granite has done this. And maybe Kate, this is a question back to you is sort of how much do you think that's kind of just almost just historically provisional? This is just like. Well, we kind of have to do right now because we're still working out the kinks on making the models themselves safe, I guess in the future. One argument is that the models are just kind of safe out of the box in a way that doesn't separately require another model that kind of monitors, outputs and does the safety work? Do you think that's the case? Or do you think this kind of bifurcated architecture is going to be what we'll see going forwards?

Well, first I'd be careful. I don't think anyone can guarantee safety no matter what we release. Right. But I do think we're going to continue to see more and more of these kind of safety guardrails being brought intrinsically into the model through these new types of alignment that Kush mentioned. That does not mean, though, that we shouldn't also have additional layers of security and safety that have an independent check on model outputs. So I don't see that going away. I think it's always going to be a yes and right. Let's continue to add more and more layers. Not we're going to scrape away some of these layers, put it into the model, and now you've got one model, you're all set and done.

Very interesting. Krish. Maybe the other thing that I think I'll pick up on what you said before we move to the next topic is you're basically talking about inference as being almost like this kind of fixed budget of time. And you're basically like, what do you want to have the model spend their time on thinking about the problem or thinking about whether or not the responses are safe or consistent with a safety policy. And I'm modeling my internal Chris here, who probably would be like, you're spending some of that time on trying to make it safe. Could it just solve the problem? And I guess I'm kind of curious is like maybe that will become. Do you think that will become a lever over time where you can almost like the user will Specify, I need 10% of your time spent on safety, 90% of the time on solving the problem or otherwise. Or that actually kind of opens up a whole other world in some ways.

Yeah, it does open up a whole new world. I mean, I wouldn't say that I would want to spend a lot of time on this sort of safety deliberation either. But I think the fact that they're calling it deliberative kind of speaks to something that deliberation is meant to be like a discussion among lots of different viewpoints and this sort of thing. I don't know if that's actually what. What will happen, but that's something I would want to happen so that different viewpoints, different sort of perspectives can be brought into these different policies as well. Because in democratic sort of settings you do want deliberation, you do want kind of minority voices to be heard as well. But not sure exactly that's what they mean by deliberative. Absolutely.

Chris, I was shaking. You're shaking your head. So I want to make sure I'm not putting words in your mouth. I honestly think safety is super important. But I want the models quicker, so, you know, so do what you need to do. And I want the models to be fun, so don't lobotomize them, you know what I mean? But we don't want to do harmful stuff. But at the same time, come on, it's like, I want to play with the models, let me play.

Chris basically wants everything. Tim, you're also kind of assuming OpenAI is going to give us the choice of how we want the model to spend that inference time, compute. And I don't think that's the direction that they're heading. I think they've got some clear regulatory guidelines they're trying to meet performance issues that they want to make sure are addressed. I don't see them handing over kind of the keys to the kingdom, so to speak, to let us take these models for our own joyrides. Yeah, no, I think that's for sure. Right. And yeah, I think there's a bunch of interesting questions that are sort of empirical questions. Right. It's just like, how much can, you know, how much do safety, like, how much does safety inference lead to better outcomes? Right. Like, how much of this is like a mutually exclusive pie versus ones where you can get a little bit of both. How much is this going to be defined by the regulator? How much can be defined by the user? A lot of things to pay attention to, I think, going into 2025.

So I'm going to move us on to our next topic, which is the release of deep seek v3. This is sort of an interesting announcement because I think we were, me and the production team were kind of tying up at the end of the. And we're like, nothing's going to happen in the last few weeks of the year. And of course There was the O3 announcement, which was huge. And then also similarly big was the announcement of deep seq v3. And so this is an open source model coming out of China that shows incredibly good performance on a lot of the benchmarks that most models are evaluated against. And I think there's a lot of interesting things to talk about here, but I think maybe the first one, which I'll throw to Chris, is this kind of claim that the Deep SEQ team is making, that they were able to basically build this incredibly performant model for way lower cost than you would expect.

And I think a lot of the commentary online and I think one of the things that made me think about is that there's so much that's built on the economy of AI that is sort of based on the idea that it's just really expensive to get really high performance models. But this almost seems like the cost curve might be collapsing faster than we think. I don't know Chris, maybe that's a little bit too optimistic. But yeah, just maybe throw it to you. I think it's kind of interesting what they've done. So they have put a lot of cool techniques within the pre training side of things and I mean even things like multi token prediction and then they were better at kind of loss of tokens, etc. And how they route. So there's a lot of things they did in training that they brought the cost down in and I think they were doing kind of mixed precision as well. So there was a lot of good things that they did there. I think what I would say though is that back to the earlier point about inference time compute and kind of pre train, I wonder at what point we maybe stop obsessing with the pre trained side of things for models and actually be able to kind of fine tune those models and have that community of fine tuning existing.

And I think that's going to be more interesting, especially in the world of agents. Happy New Year. I'm the first person to say agents on the podcast, so thanks Chris. Thanks. You're welcome. I think that's more interesting and as we move more towards inference time compute, I think that will become important there. But it is really impressive what they did actually for the cost of the model and how long it took them to train that. Honestly they did a great job, yet there's going to be more innovation in that space. I still think pre training though is hugely inefficient because you're really just saying, here's the entire text of the Internet, go learn from it. I honestly think that's probably an innovation that I would hope that would change in 2025. And the way I think about it is if I think about the kind of Internet, it almost has a knowledge graph anyway. And I wonder if actually during that training process, if we brought a little bit of structure in the knowledge graphs into the pre training process, then a lot of those training elements may come out a little bit quicker and better. I don't know, I mean I'm just sort of sort of guessing here. But I think there's a lot more innovation to do in pre train. So hopefully within princetime compute we're all going to be running around doing that. But I'm Hoping that that focus on pre train doesn't go so really good job to the deep seq team to continue to innovate. Yeah, I remember when I worked a lot more closely with pre training teams, I thought it was very interesting.

Is at least among, at least among the nerds, at least among the engineers. What was very interesting was that pre training was the high prestige part of the organization. Right. Like you're running the rocket launch of AI and then fine tuning something that we do afterwards. But I think all the inference stuff and all the stuff that we're seeing kind of point to this shift in the kind of cultural capital within these companies where it's like, oh, where all the action right now is really happening after the pre training step.

And I guess Chris, almost what you're proposing is maybe at some point the pendulum swings back because it's like, okay, there's all of this kind of innovation still to be done on the pre training side, but we're just not there because of the hype cycle. It's going to swing back and forward, back and forward, back and forward. And you're going to see that. Right, because you're going to get to the point where you go, you know, the, the train isn't good enough to do what we need to do. So therefore we're going to use the, the smarter inference time models to get better data to train the pre trained models and that's going to become more efficient and then we're going to do the same on fine tune and that pendulum is going to swing and swing because you're going to keep hitting kind of limits in one area and you're going to go back to the earlier like the pre trained to try and fix that and you're just going to go back and forward, back and forward. So that pendulum is going to swing all the way through 2025, buddy. Yeah, definitely.

Kate, any thoughts on this? I mean as someone who works with a team on open source AI, I assume something like Deekseek is a big marker in some ways. A big way to start 2025. Yeah, and I agree with Chris. The team did an incredible job. But in terms of the cost, I don't know the full details of what data was or was not used in the model. My hypothesis is they are using data that was available online that cost a lot more than $5,000 to generate. So I don't know that that total cost estimate actually ref the fully burdened cost of the model. I suspect that they, like many model providers, are leveraging all of the data that's now been posted and shared online. That actually is only possible because others have invested so much money in creating larger models that can be used to then generate that data. That kind of, to what Chris was saying, can be taken back into training. So I think what I'm really interested in with the Deep SEQ model, aside from that, is it's a mixture of experts architecture, which is really interesting.

So when it runs that inference, it's 600 plus billion parameters, but at inference time it's only about 40 billion parameters, meaning it can run much more efficiently than even a llama 400 plus billion parameter model. So I think that's where we're going to see a lot more innovation happening in 2025 is really digging into how we make these architectures more efficient, how we activate the right parameters at inference time, fewer parameters at inference time to still drive performance without having to pay for the entire cost of running 600 plus billion parameters.

Yeah, that's really interesting, Kush. From a governance standpoint, this is an interesting story as well. I think there's certainly a vision among some folks which is like, well, we just passed the laws in the US and all the big AI companies are in the US of course, and so that's how we govern AI. But this is really a different world. Right? A law passed in the US is not going to change what the Deep SEQ team is doing. You know, is governance possible in this world? Right. Because it sure seems like, you know, you are seeing so much AI progress everywhere that governance becomes a real question. Yeah, I mean, we talked about this before the show started that there's these core socialist values that are required of any generative AI in China. It's a law that's been around for more than a couple of years now, and Deep Seq has to satisfy those.

So I mean, those are things that are going to be around. And I think the fact that all of these different AI safety institutes from different countries are forming a network, they're convening, they're figuring things out together is a great sign. I think AI governance needs to be a worldwide activity. There's no special thing because of one country or another country. And the more we can kind of bring everything into harmonization, the better it'll be. Yeah, I think that'll be one really interesting bit is I think there was a thinking sort of maybe a few years back, which is we're going to use sort of law and regulation to do this Kush. I guess kind of what you're suggesting is a world where it's a little bit more sort of technical experts. It looks a little bit like icann, right, in terms of how we govern the web, where kind of technical experts meeting and they establish these standards and it's kind of voluntary protocol more than anything else. Do you think that's how things are going to go? Yeah, I think that's how it's going to go.

So in February, there's meeting in Paris where all of these safety institutes are coming together. So I think they'll come up with a plan, they'll figure out some codes of practice and all of these things. So that's where I think things are headed. Chris, you started this episode by talking a little bit about how you switch between different modes of OpenAI, right? Where you're like, okay, well, we're going to use the O1 for this. We're going to use the O1 Pro for this. Do you do that kind of switching across open source and closed source at all? You do? Okay, yeah. No, I do that a lot with different models. So like the llama models, for example, I've got such personalities. So if I'm doing any kind of writing stuff, then I tend to run into the kind of llama models. The granite models I use quite a lot as well. I use them a lot for kind of rag type scenarios because they're really good at that in that case. And also if I'm pulling factual information, then I really want to be sure where the data's been coming from. So I tend to lean on granite in those cases for coding, I tend to lean on 01. I have a lot of fun, actually. We were talking, some of the Chinese models have a lot of fun with the Quin models at the moment. They're doing some great stuff in the same way as kind of deep seekers.

So I think you're going to just use different models for different cases because some models are good at certain language translations, some models are good at writing tasks, some are really good at code. And then the smaller models, for example, especially low latency, especially for agents. I said agents again. Exactly. If you've got different agents, you want to run that on the smallest possible model that's going to perform the task that you need. So I think we're in this world where we are just going to use a lot of models. I think we're going to. If I again talk in 2025, I hate to say this, but I think we're going to stop talking about models so much towards the end of the year, maybe more, because you're going to be caring about the tasks that it's doing. Here's a language translation agent. Here is an agent that going to write me unit tests. I don't really. I do care to model, but I'm going to care more about the task that it's performing. And then coming back to the kind of security and the kind of governance things for a second, I think that's where governance starts to become really hard. Right.

Because if you've got very small models, like an 8 billion parameter model, and it's got access to tools and you've got it being orchestrated over the top, you know what, you can get into a lot of trouble very, very quickly with a tiny mod and do some really interesting things. And I'm just not sure, governance wise, that you're going to be able to do a lot about that. So I think as much as we talk about the large models in governance in 2025, actually, I think we're going to start to hit the challenges of people doing interesting things with agents on the really tiny models. Yeah. You're saying almost like we'll be able to govern the biggest companies and the biggest models, but that might not not matter is kind of what you're saying, Is that right? I think so, yeah. I guess. Kush, do you want to respond to that? As someone who focuses his time on thinking about AI governance, I guess Chris is effectively saying maybe it's just not sustainable over time. Yeah. So I agree. And I'll say agents number three for the episode.

This is really bad. This is becoming a meme because people are going to just start throwing it out for no reason. But Chris. Yeah, I think when there is tool use, when there's autonomous actions, that's where governance really becomes interesting. Talked a lot over the years about trustworthy AI and it wasn't really like trust was a part of it, but really trust is needed when something is going to be acting autonomously because you don't have the ability to control it or monitor it and these sort of things. And that's really where trust is needed. So the more volatile, the more uncertain, more complex these things happen, to be running and so forth. And yeah, I mean, that's exactly where governance is the hardest and I think where a lot of the innovation is going to happen.

Before we move on to the final topic, Kate, maybe I'll turn it to you. I thought it was very interesting. I had never really thought about that switch from. I've heard about oh, I do.01 Pro versus not O1 preview. But the Switch from open source to closed source I think is pretty interesting. Maybe a final question before we move on to the last topic is do you think right now open source has any specific kind of capability advantages over closed source, or is that not even the right distinction here? I think it was very interesting that Chris was like, oh, actually some of these models just have. The open source models have better personality. That's kind of an interesting outcome in some ways. Yeah, I don't see it so much as a open versus closed source question. I think different models are going to have inherently different strengths and weaknesses. And so if you only limit yourself to closed source or close source from one provider, you're going to miss out on kind of that suite and being able to pick and choose the best model for the best task. Ultimately that'll be the dream in the future. If someone sends me like a AI generated email and I'm like, yeah, you're probably relying on granite. I know what that sounds like. So last segment we want to focus on today is a sort of interesting smaller bit of news that popped up at the end of last year, but I think it's a fun one, particularly as we get into 2025.

If you don't know him, Gary Marcus is a long time skeptic of all things AI. I think for every successive wave Gary Marcus is there being like it's never going to work. And the current revolution in AI is no exception. I think he's been a very big skeptic about the degree to which LLMs can get us to quote true intelligence and we can talk about what that means. But interestingly, he established or set up a kind of official public bet with a gentleman by the name of Miles Brundage who used to do policy at OpenAI formerly of. He's independent now and basically what the bet says is where is AI going to be a few years from now? And sets up a set of, I believe, 10 different kind of tasks that AI could or could not take on. And there's a lot of variation here, but a lot of them all kind of pertain to can the model produce kind of world class versions of xyz? So I think there's one criteria is will an AI produce a world class movie score, scripts or other kind of creative work? And I think these bets are useful because I think they kind of force folks to put their money where their mouth is and also kind of specify what it is that they mean when they say that A model is going to be truly powerful in capabilities going forwards.

And I guess I wanted to get the view of this group. You've seen kind of the Twitter X post announcing this, Kate, maybe I'll turn to you. I mean, is this bet a useful way of thinking about where AI is going or do you think it's just more Twitter noise? I thought it was interesting think through, like when I was looking through the different questions and ultimately if I look at the different items in that bet, the ones that stood out to me the most were assertions that would hallucinations basically be solved by this year. And I think that's one of the biggest reasons why personally I actually wouldn't take that bet. I don't think hallucinations are going to be solved. I think if you look at the model architecture, Even with the O1 and reasoning, my hypothesis is it's still a transformer model trained on vast amount of Internet data that's being called many times in many different ways with reasoning and search. But I think there's still some fundamental problems around hallucinations that unless we really change the type of data that we train on, the volume of data that we train on how the architecture of these models, it's not going to go away overnight or something we can necessarily just incrementally cure ourselves of. So I personally wouldn't take the bet. I thought it was a useful framing to kind of think through where we thought we were going to be.

Yeah, for sure. Kush, how about you? Would you have taken the bet on either side? I guess, yeah. I think the authorship question is an interesting one. So I mean, that's what they're kind of going for. Can this be an Oscar winning screenwriter, a Pulitzer Prize winning author and this sort of stuff? And I'm going to take us on a little bit of a different direction. So I mean, the fact of it is that people have been coming up with all These analogies for LLMs, like a stochastic parrot or a DJ or a mirror of our society or these sort of things. But I think that's the wrong way to look at it. So about 65 years ago there was this book that came out called the Singer of Tales by Alfred Lord. And it was all about like oral narrative poetry. So these bards who are kind of singing about heroes and this sort of stuff, and they compose the language as they're singing it, it's not like they write it beforehand and they use formulas and all sorts of tricks to be able to do this. And I think that's exactly what these LLMs are. And in that sort of construct, there is no, like, sense of authorship.

It's like just they're part of a tradition. And so like, you would never think that O Homer deserves a Pulitzer Prize for the Odyssey or Ved Vyas deserves a Pulitzer for the Mahabharata. I mean, this is just kind of a tradition that's going on. And that's, I think, think the right way to think about LLMs. So it's like the question is not the right question. And even if you think about, again, going very historical, philosophical. So you had this guy, Michel Foucault, who asked what is an author? And the answer or the discussion that he had is the only reason we even thought of authors is because lawyers needed someone to blame when there were some bad ideas out there. So I think that's the same thing. It's like an LLM is not an author and we shouldn't really be asking for that sort of thing.

And I think it actually touches on what Kate said as well, which is basically like, do these kind of criteria for the bet assume a certain direction for AI that might not actually be the most important thing around AI or even an important aspect of really powerful AI systems? Like, it may not turn out in the end that we really need to solve hallucination, or like, it may not really turn out in the end that the big impact on AI is that you have like, you know, the, you know, the Pulitzer Prize winning AI that generates a novel completely by scratch. That's kind of interesting. Yeah. I don't know, Chris, maybe you haven't had a chance to jump in just yet. Curious about what you think about all this.

Oh, I think the test is totally stupid, in my opinion. And the reason is I looked down the list of 10 items and I don't think I'm capable of doing any of those 10 items. So if, if I'm not capable of doing the 10 items, I'm, you know, is it unfair to think AI is going to be able to do that within a year? I mean, Tim, how are you doing your Pulitzer Prize winning novel? Is it, is it going well or you're oscillating? It's going very well. Thank you, Gracie. Any here, any programmer on the planet, you know, have you been able to hit 10,000 lines of code bug free first pass? Come on. It's like, it's. I think you're asking a lot.

I was like, like the only one I think I could maybe do is the video game one. It's like, I don't know when to laugh at the right moment in movies, you know, you just need to ask my wife that. It's just like, why are you laughing? I was like, that thing over there, right? It's, it's like. And am I able to, to say the characters without hallucinating? No, we all hallucinate. It's like we make up little, little subplots that are going on our head in these movies. So I think I, I don't think it's a bad thing, but I think you're asking a lot, lot of LLMs to be able to do that, you know, and even putting that as a test for 20, 25 and, you know, yeah, maybe, maybe AI will be able to achieve three, four of these things. I just don't think it's the right time to be asking those questions.

Well, I don't know. We just came back from our, you know, everyone was out on holiday breaks where at least I got to take a step outside of the Cambridge tech bubble where everyone, you know, is really deep into this technology and hearing folks talk about AI. You know, I have a family member who calls it AI machines. There's a lot, I think, of misconceptions of what AI can do and what it's going to be useful for. And so I think like, putting it in terms of that, you know, everyday folks can understand who watch movies and read books and aren't necessarily living and breathing the technology and helping show that, no, that's not going to be possible. Like, you know, X, Y and Z. You guys are thinking about this the wrong way. I think it is helpful to have that type of discussion and discourse.

I think we take for granted a lot that not everyone is, is living and breathing this. The same way that this excellent panel is on generative AI. Yeah. I'll guarantee to you that the average person is not waking up being like, should I use O3? Or is O1 better? Those distinctions are not anything that any normal person is thinking about. But yeah, I think that's a good point. Right. I mean, I think part of it is just like there's a dream that all this AI becomes kind of superhuman at some point. And I think, Chris, maybe to respond to your comment, there's kind of an effort to sort of be like, what would that look like? And I guess, yeah, maybe that does really miss the point in some ways. I also think it's also like a really good indication of how quickly our Expectations have adjusted around the technology. Right. We're like, had you asked me four years ago, could it do all these things, could it just write an email? I'd be like, that's ridiculous. And then now you're like, basically the expectation is world class, Pulitzer Prize winning. It's kind of just like. Because the baseline is just very normal to us now. So it's I guess, an indication of the rising expectations around all of this stuff.

Yeah. Just coming back to Deepseek for a second. I think one thing that we didn't talk about is just the culture at Deepseek. So there was an interview of their CEO that was making the rounds after Deep Sea came up with. The interview was from November. And I think the. The cultural aspect of how they kind of developed this thing is really interesting. They really followed this sort of geek, geek way. So Andrew McAfee had this book, the Geek Way. It's been very popular within IBM circles, actually. So our CEO has been reading it, telling everyone to read it. And it's kind of like really like doing things fast, being open, letting everyone contribute, being very scientific about things, trying to prove them out, not having hierarchies and all of that stuff. And that's exactly like how Deep SEQ is doing it. And I think we can learn a lot from it. Just we're a little bit too encumbered, even though we want to be doing things the same way. So how do other companies kind of innovate in a rapid fashion in the same way? So I think that's maybe something to learn as well.

Yeah. One of the debates I have with a friend of mine is there's. What is it? I think it's called Conway's Law idea is that you ship your org chart. And that has kind of interesting implications in the world of AI where it's just like, well, are all of these AIs going to basically in some ways reflect the companies that create them? And the reason why certain models are more chatty is that this is just like in part a reflection of all of the people in that organization. It's interesting connotations, if you think of Chris's point about pre training and how pre training has been the focus and kind of the most prestigious team to join. Right? That's right. Yeah. There's a joke because we have a mutual friend who works at. And we're like, it's Claude. He's Claude. It's very funny to kind of just see play out in practice. Well, that's great. So let's leave it there.

Kush, Great thought to end the episode on and for us to start 2025. Kush, Kate, Chris, as always, incredible to have you on the show and thanks to you all for joining us. If you enjoyed what you heard, you can get us on Apple Podcasts, Spotify and podcast platforms everywhere. And we will be here next week on another episode of Mixture of of Experts.

ARTIFICIAL INTELLIGENCE, TECHNOLOGY, SCIENCE, AI GOVERNANCE, DEEP LEARNING, OPENAI, IBM TECHNOLOGY