ENSPIRING.ai: Metas Joe Spisak on Llama 3.1 405B and the Democratization of Frontier Models - Training Data

ENSPIRING.ai: Metas Joe Spisak on Llama 3.1 405B and the Democratization of Frontier Models - Training Data

The video explores the adoption of open-source models in the AI and tech industry, highlighting insights from Joe Spisak, director of PM for generative AI at Meta. Joe discusses the launch of Llama 3.1 405 B, its capabilities, and future prospects of AI in various applications. The conversation covers the advantages of open-source ecosystems and the strategic decisions that Meta has taken to foster innovation and advance the capabilities of AI models.

The dialogue further delves into the changing landscape of AI model commoditization, where models like Llama 3.1 and OpenAI's GPD 4.0 mini enter the scene. Joe points out that while AI models are becoming commodities, the true value lies in the data, applications, and interactions with users. The video discusses the potential of integrating these state-of-the-art models into existing company infrastructures and startups, advising that open-source models offer flexibility and control over data and user interaction.

Main takeaways from the video:

💡
The future of AI model development involves the integration of open-source models.
💡
Commoditization of AI models is inevitable, with real value shifting to data and applications.
💡
Open-source models provide flexibility and ownership, essential for startups and established companies alike.
💡
Meta’s advances in AI are driven by a desire to unlock global innovation, as evidenced by strategic partnerships and research investments.
💡
Scaling and data are crucial elements in improving AI models, alongside supporting global languages and multilingual functionalities.
Please remember to turn on the CC button to view the subtitles.

Key Vocabularies and Common Phrases:

1. commoditize [kəˈmɒdɪˌtaɪz] - (verb) - To turn products or services into commodities that have standardized and interchangeable features, often resulting in reduced differentiation among competitors. - Synonyms: (standardize, generalize, uniformize)

1 405 b launch, and we're excited to get his view on questions like where is the open source ecosystem headed? Will models commoditize? Even at the frontier? Is model development becoming more like software development and what's next in agents and reasoning small models, data and more? Joe, thank you so much for being here today

2. distillation [ˌdɪstəˈleɪʃən] - (noun) - The process of refining or extracting essential data, information, or qualities, often used to improve smaller models in AI. - Synonyms: (refinement, purification, extraction)

And we kind of had that plan all along, because when you have a big model, you can use it for improving small models or just distillation

3. lingua franca [ˈlɪŋɡwə ˈfraŋkə] - (noun) - A common language used between people whose native languages are different, allowing effective communication. - Synonyms: (bridge language, common language, trade language)

We saw basically going back to Pytorch, we saw it as this lingua franca bridge to this area of high entropy.

4. ecosystem [ˈiːkoʊˌsɪstəm] - (noun) - A complex network or interconnected system, often referring to technological or business environments. - Synonyms: (network, system, environment)

We're speaking to Joe just two days after the llama 3.1 405 b launch, and we're excited to get his view on questions like where is the open source ecosystem headed?

5. inference [ˈɪnfərəns] - (noun) - The process of deriving logical conclusions from premises known or assumed to be true. - Synonyms: (deduction, conclusion, reasoning)

And unfortunately, you don't actually have access to those lower weights at the end of it, you're forced to use their inference.

6. multilingual [ˌmʌltiˈlɪŋɡwəl] - (adjective) - Able to speak or use several languages, referring here to the capability of AI models to operate in multiple languages. - Synonyms: (polyglot, bilingual, trilingual)

multilingual, I mean, we're a global company, so we released more languages, many, many more to come.

7. saturate [ˈsætʃəˌreɪt] - (verb) - To fill or supply to the utmost level, often implying that no more can be absorbed or added. - Synonyms: (flood, fill, permeate)

Like, and this goes back to like evals and like, what is, what is your eval because we're starting to saturate evals

8. summarization [ˌsʌməraɪˈzeɪʃən] - (noun) - The act of expressing or covering the main points briefly and comprehensively. - Synonyms: (abridgment, synopsis, outline)

You can basically do things like local summarization.

9. iteratively [ˈɪtərətɪvli] - (adverb) - Repeatedly improving or developing something by making successive versions, reflecting a cycle of trial and error. - Synonyms: (repeatedly, cyclically, progressively)

So we want Meta AI improving constantly. So there's definitely a software engineering analog here that's happening.

10. computation [ˌkɒmpjʊˈteɪʃən] - (noun) - The act or process of determining something mathematically or digitally; calculation. - Synonyms: (calculation, reckoning, figuring)

So GQA, and that improves inference time and kind of helps solve the kind of quadratic attention computational challenge.

Metas Joe Spisak on Llama 3.1 405B and the Democratization of Frontier Models - Training Data

If I was a founder right now, I would absolutely adopt open source. It forces me though to look at the engineering complexion of my and think I'm going to need people doing llmops and things like data fine tuning and how to build rag and things. There's plenty of APIs that allow you to do this, but ultimately you want control. Your moat is your data is your interaction with users.

We're speaking to Joe just two days after the llama 3.1 405 b launch, and we're excited to get his view on questions like where is the open source ecosystem headed? Will models commoditize? Even at the frontier? Is model development becoming more like software development and what's next in agents and reasoning small models, data and more?

Joe, thank you so much for being here today. We're so excited to have you. Just two days after the llama 3.1 405 b launch. It's an incredible gift to the ecosystem. We'd love to learn a little bit more about how you what specific capabilities you think the 405 B is particularly unique at, especially in comparison to the other state of the art models.

Yeah, I mean, we're beyond excited and meta. This was something that I think a lot of us have been working on for such a long time. Months and months and months, and we kind of put out that nice little appetizer, I'll call it in April of lava three. And I was actually like, are people really going to be that excited about these models? And their response was through the roof. Like, everyone's excited, but they really don't know what's really coming. And so, yeah, kind of had to hold that back for a while and kind of keep it to ourselves and like, and then kind of build up for this launch.

And the 405 B is a monster. It's a great model and I think the biggest thing we've learned about the 405 B's is just a great, it's like a massive teacher for other models. And we kind of had that plan all along, because when you have a big model, you can use it for improving small models or just distillation. And that's how the eight and seventies became the great models that they are in terms of capabilities.

We listen to the community, we listened obviously to our own product teams because we had to build products for meta. Long context was one of the biggest things people wanted. We have much longer context internally even than what we were released. But we saw just the use cases start to build up multilingual. I mean, we're a global company, so we released more languages, many, many more to come, because obviously meta has billions of people on the platform and hundreds of countries.

And so I think that was like, to me those are like table stakes things, but they're like really done well on the models. We spent a lot of time in post training on our different languages and improving them and safety just, they're really, really high quality.

So we don't just like pre train out like a ton of data and say, look at us, we're multilingual, you know, we actually did a lot of work in our SFT phase and supervised fine fine tuning and a lot of safety work.

I think one of the coolest things that I'm excited about, well, there's a couple things I'm excited about, but one is tool use. Like I think, oh my God, zero stop tool use. This is going to be crazy for the community. I mean, we show a few examples, like we can show like calling Wolfram or Brave search or Google Search and it works really great. But zero use is going to be a game changer.

The ability to call a code interpreter and actually run code or build your own plugin for things like rag and other things and have that really be state of the art. I think it's going to be a really big game changer. And I think just the fact that we released the 405 itself and we changed our license so you can actually use our data. Like that was a big deal.

Like that was a big discussion. We had many meetings with Mark on that and we and ultimately like landing on a place where you know, this was like this pain point for the community for so long. They're like these closed models. Like I can't use the outputs or maybe I can use them, but maybe I'm using them slightly unscrupulously or whatever. Like we actually are encouraging people to do it.

I'm sure that was a tough decision to make. Walk us through the things that you had to consider in actually making that leap to open up licensing in that way. Yeah, licensing permissible. Oh, licensing is like a huge topic in itself. Obviously. We could probably spend the whole podcast talking about it. I don't want to, but we could.

I think we wanted, number one, just to unlock new things. Like I think we wanted to have the 405 and our Llama 3.1 models differentiate, give people new capabilities. Like, we just looked at like what people were really, like, excited about in the community, not only in like in kind of enterprise and products, but also in the research community because we obviously have a research team and, you know, we work with academia and we talk to folks.

I mean, you know, Percy Lang at Stanford texts me all the time saying, you know, when are you going to release it? When are you going to release it? Can I use it? Can I use it? And Percy, like, you know, stay patient. But I think we heard them and we knew what they wanted. And I think ultimately we wanted llama everywhere. We wanted just adoption, maximal adoption, really, the world using it and building on it. And I think Mark even used in his letter, he put out the new standard or standardized.

So I think to do that, you kind of have to enable stuff like that. You kind of have to unblock all these different use cases and really look at what the community wants to do and make sure that you don't have these kind of artificial barriers. And that's what the discussion really was.

And so actually, even beyond that, we started working with partners like Nvidia and AWS, and they started building distillation recipes and even synthetic data generation services, which is pretty cool. I mean, you can start to use those and actually create specialized models from it.

And the data that, I mean, we know how good the data is because we used it in our smaller models. It's really good and it improves our models significantly.

So I'm going to pull on the open source that a little bit more. Sure. And I've read Zuck's manifesto, it was great. But I'm still, I'm trying to wrap my head around, like, what's in it for meta? This is a massive investment, the open source. In some ways, you're leaving a lot of money on the table because you now have a state of the art model that you're offering to everybody for free. And so I guess, is this an offensive move? Is this a defensive move? What's in it for Meta?

Well, first of all, our business model doesn't depend on this model to make us money directly. So we're not selling a cloud service. We've never been a cloud company. We've always worked, I would say, with a partner ecosystem all the way back to the five years I was helping to lead PyTorch and the ecosystem, the community we built around that. We never built a service. We probably could have in some way, but it would have been weird.

We saw basically going back to Pytorch, we saw it as this lingua franca bridge to this area of high entropy, kind of weird way to say it, but like there's all this innovation happening. How do we kind of build a bridge to it and actually be able to harness all that innovation? And the way to do that is to be open and it's to kind of get the world building on your stuff.

And I think that's, that ethos is kind of carried over into Lama and you know, if you look at Pytorch, like that was a huge way for us to kind of pull in, you know, at the time when we really started working on Pytorch in earnest, computer vision and like CNNS and all that, if you remember that old times now. But we actually would see these architectures come constantly.

The people would, and they write code and they'd publish it in Pytorch and we take it internally, we evaluate it. People would open source models and put them out on model zoos and we'd evaluate them and we'd see just how quickly the community was improving things. And we'd actually leverage that, especially for integrity applications where we release hateful memes and some of these other datasets.

We just saw the improvements week over week, month over month. And it was built on something that we were using internally. So it was very easy for us to just take it inside. So I think lava is definitely similar in that regard where academia and when companies start to red team these models or try and jailbreak them, we want people to do that to our models so we can improve. And I think that's a big reason.

And it's like, be careful what you wish for, right? Of course. But it's the same with Linux. Linux is open source and the kernel is open source and people will, it's much more secure when things are transparent and bugs can be pushed faster. And so that helps us a lot. I think it's, there's also the angle of we don't want this to turn into kind of a completely closed environment.

I think just like today, if you look at Linux and windows, in my opinion, there's room for both. There's room for closed, room for open and people use depending on what they need and the applications. I think that there's got to be a world of open models and I think there's going to be a world of closed models. And I think that's totally fine.

What was the primary argument against open sourcing? Was there one? I mean there was definitely like competitive concerns we talked through. You know, do you want to give your technology, put it out there and that. And I think we're like less concerned about that because we're moving, we're moving really fast.

If you look back, I mean I've been, you know, back, I've been in meta like close to what, six or seven years now. And like in the last year or so we've done, we had a connect launch. We released purple Llama last December. We released Llama three 3.1 before that. We released llama two in July, llama one was in February. So just if you think about the pace, incredible.

The pace of innovation that's coming out of our team and our company is just at a crazy pace right now. So I'm not too worried about it. I don't think we're that worried about it. So I'd love to kind of move into your personal views on the broader ecosystem.

I think a lot of the questions that people have center around what happens to the value of all these models, especially as meta open sources, more of them at the state of the art level. With Lama 3.1, with OpenAI launching GPD 4.0 mini, what is your view on do models commoditize even at the state of the art frontier?

This is a great question. I think if you look at just even the last two weeks, 4.0 mini is a really, really good model. I think input per million tokens is something like fifteen cents, sixty cents out. So it's incredibly cheap to run, but it's also an excellent model. They've done an incredible job in distilling and getting to something that's really performant yet really, really cheap.

So I think Sam is definitely pushing on that. And then if you look at what we've done in the last week and pushing out, I would say pretty compelling. See, there are models across the spectrum. I do think it's rapidly getting to a place where the model is going to be a commodity. I think there's this frontier of data where we can certainly gather data from the Internet, we can license data, but at some point there is some frontier of limitations. I think that we're all going to have.

And this goes back to our conversation goes began the bitter lesson of data and scale and compute. Is that enough? It's probably not quite enough, but it's like compute and data becomes if you have enough of both, you can get a first order approximation of the state of the art without anything else is what we've seen.

So I do think the models are commoditizing, I think the value is elsewhere. And I look at meta and I look at our products and I look at what we're building, that's honestly where the value is for us. It's Meta AI, it's our agent, it's all the technology that we're going to put into Instagram and WhatsApp and all of our end products, where we actually are going to monetize, where we're actually going to add real value.

The model itself I think definitely will keep innovating new modalities, new languages, new capabilities. That's what research is. It's pushing the frontier in emerging capabilities and then we can leverage those in products. But the models are definitely pushing in that direction.

If that's the case, and all these existing companies that have massive distribution and wonderful applications that are already out in the wild can just adopt these state of the art models, what advice would you give to the whole wave of new startups that are trying to make it out there, either building their own models, using other state of the art models, and then trying to build applications on top?

Yeah, I mean there's definitely like some model companies or companies that are building, they're training pre training foundation models and it's expensive. It's like, I think we're, I can't say how much llama three costs, but it was very expensive and you know, llama four is going to be even more expensive. And so to me, given kind of the state of play and things, it, to me it doesn't make that much sense. If I was a startup to try and go and, and do a pre training, like I think the, like llama models are absolutely incredible as foundations to build on.

And so I do think like there is, you know, if I was a founder right now, I would absolutely adopt open source. It forces me though, to look at the engineering complexion of my right and think like I'm going to need people doing LLM ops and you know, things like, you know, data fine tuning and how to build rag and things APIs.

There's plenty of APIs that allow you to do this, but ultimately you want control your moat is your data, your moat is your interaction with users. And you may want to deploy these things onto a device at some point and have a mixed interaction or something. You might want to have simpler queries running on your device and have very low latency interactions with your users. You might want to split and have a more cloud based approach for more complex queries, more complex interactions.

And I think the open source approach gives you that flexibility. It gives you the ability to modify the models directly. You own the weights, you can run the weights, you can distill them yourself. There's going to be distillation services that allow you to take your weights, distill them down to something smaller. That's pretty awesome.

We're just now seeing the beginnings of that. I think in my mind, control matters a lot, and ownership of the weights. There are a lot of API services where you'll do fine tuning your model, so you're bringing your own data, you're fine tuning, and they use something called low rank adaptation, or Lora.

And unfortunately, you don't actually have access to those lower weights at the end of it, you're forced to use their inference. So you're like, hmm, let's see. I'm held hostage here. I've given my data, I don't have access to the actual IP that was generated from that data, and now I have to enforce, to use their inference service. That's not a good deal.

So I think open source kind of like brings inherent freedom that I think that approach doesn't. What do you think of Mistral? Large was announced, I think maybe a day after llama three, 3.1. What do you think of them? And I guess more broadly for everybody at the frontier, is everyone kind of pursuing the same recipes, the same techniques, the same kind of compute scale data, etcetera? You know, everyone's kind of going to be roughly similar at the frontier, or do you think you guys are doing something very different?

So, first of all, I'm Nistral. I mean, amazing team. It was one of my old teams and fair they were working on through improving and AI and mathematics. So Guillaume and Tim and the team are Marianne. They're incredible. Joe was just talking about fun banter last night. So, I mean, this was like one of the scrappiest teams that I've ever worked with. I mean, the team I don't think ever slept.

So it was like, basically by day, they're doing like pushing probably even less now, probably even less now. I mean, they would push the state of the art and like AI and they're improving and you know, during the day and we published some work on that, you know, I think a couple of years ago now. Geez. But by night, you know, they were basically like scrappily like grabbing compute to train lama one. And so they were, you know, we were building large language models several years ago in fair and I, that team basically just like they were just really ambitious and they were kind of working by night and that's really where llama one came from.

So the team is great. I mean I think they're doing really good work. I think they're definitely challenged in that they're trying to also like open source models but also make money and you know, like models like 40 mini are not helping them because like, and this is I think why they changed their license, for example, to having a research only license, which kind of makes sense because they were open source models and immediately their own ecosystem is competing with them in a lot of ways because they'll release a model, they'll host it, use this model, but then they have together and fireworks and lepton and all these companies that provide sometimes a lower cost per million token offering.

So it's really tough business right now in terms of large two. I think it's a really good model just on paper. I haven't evaluated it. We haven't looked at it internally yet. I think if you look at artificial analysis, I think they added up in it was under, I think the 70 B model in terms of quality. But that's, you blend a bunch of benchmarks to make that distinction.

But on paper it looks really good. We're going to evaluate it. I think for me anyway, the more, the merrier. The more models are out there, the more companies are doing this, the better. We're not going to be the only one. And I think that's good that we're not the only one. And I think more generally, the Jenny eyes base, you wake up every single day and you expect something like this. You expect them all to be released or something groundbreaking to happen. And that's the fun of being in it.

Totally, totally. Do you think everyone at the frontier is comparable though? Like, are you all pursuing comparable strategies? Yeah, this is actually a good question because if you read the llama three paper, which was I think, 96 pages, you ended up at lots of citations, obviously. Lots of sharing, lots of sharing, lots of like contributors and core contributors and that so like it was a detailed paper and Lawrence and Angela on the team spearheaded writing that and I think that was one of the hardest things, developing the model was relatively easy compared to writing the paper. It was a lot of work putting that paper together.

I think if you look at lava three, there was a lot of, I would say, innovation that happened, but also we also didn't take on a lot of research risk either. Yeah. So I would say, like, the primary things we really did with Lama, with the 405 B especially, was really pushing scale. I mean, it was still, we used group courier attention, for example.

So GQA, and that improves inference time and kind of helps solve the kind of quadratic attention computational challenge. We trained on over 15 trillion tokens. We did post training, we use synthetic data, which improved the models, the smaller models, quite a bit. We trained on over 16,000 GPU's on our training runs, which is something we hadn't done before. It's really, really hard to do that because GPU's fail and, you know, sell the table. Yeah, I mean, it's, everyone's like, oh, I'm just going to train 100,000 GPU's like, good luck, right.

You better have a really, really great infra team, a really great MLS team. You better be like, ready to innovate at that level. Because this is non trivial. Everyone says it's easy or says you can do it, it's down trivial. So I think, like, I almost look at llama three is very similar to like, the GPD three paper.

So if you ever talk to, like, Tom, he was a laid off of Tom Brown now at anthropic. And there's a reason why Tom was the first author on that paper is because like, a lot of the innovation was really scale. It was really like, how do I take something that's an architecture and like, push it as hard as we can push it? And that involves a lot at the MLS kind of layer and infra layer and how do I scale the algorithm?

And so I think that was really the mentality we had with Llama 3.3 and Llama 3.1. And I mean, internally, obviously, we have great research team, we have fair, we have research in our, and we're looking at lots of different architectures and Moe and other things. And so I think, who knows what level will be? We have a lot of candidate architectures and we're looking at it, but it's kind of a trade off.

It's a trade off between how much risk you take on for research and potentially how much reward or the ceiling and the potential improvements, versus just taking something that's relatively known and pushing scale and getting that to improve even more. So ultimately this becomes a trade off.

I think this is such an interesting point. I actually also think it makes llama and meta quite unique in the strategy it's taking. The words that I like you used yesterday were, is model development becoming more like software development? I'm curious to hear if you think. I think unlike what many of the other labs have been doing on pushing more of the research, you guys have been focused on just executing on strategies that, you know, work.

Do you see that representative of the continuous strategy you think as you extend Lama out? 45678 and then also, how do you think the other research labs and maybe some of the other startups in the ecosystem will react? Will they kind of switch and veer a little bit more to the strategy that you've been taking?

I mean, it's a really great question. We don't have all the answers for sure, I think, but there's definitely some, somewhere in the middle right now is kind of where I see things landing wherever. We'll continue to push on execution, we'll continue to push models out because we want our products to iterately improve as well. So we want Meta AI improving constantly.

So there's definitely a software engineering analog here that's happening where you can imagine something like a llama train and new features, new capabilities. Get on that train and we have a model release. It's much easier when you start to componentize the capabilities too. We're doing that with safety right now. And you saw in the release we released prompt guard and a new Lum guard, and you can iterate on those components externally and it's great. Obviously the core model is much more difficult.

I do think we'll start to include or start to push on the research side as well because the architecture like is going to evolve. I mean, you've seen like what AI two, for example, has done with their jamba and their mamba, and everyone kind of thinks Mamba has got a new architecture that could have promise. I think what's interesting though is to truly understand the capabilities of the architecture, you kind of have to push the scale.

And I think that's what's missing right now in the ecosystem is if you look at academia, and academia is like a lot of absolutely brilliant people there, but they don't have a lot of access to compute. And that's a problem because they have these great ideas, but they have no way to truly execute them at the level that's needed to really understand will this actually scale because the Java paper and model was really interesting and the benchmarks look great, but they didn't scale it beyond, I think, under 10 billion parameters.

So you're like, okay, what happens when we train this in 100? Do you still see those improvements or not? And no one really, at least outside of these labs, knows the answer yet. So I think that's like one challenge.

So I think, like, to me we're going to get in this hybrid space of, you know, we are going to push definitely on architecture. We have a very, very smart and well accomplished research team, but we also are going to be like, you know, we are going to be executing and I think that's when we start to get like a recipe. You know, we're going to push it to the limits and, you know, we are going to start, you know, release or continue to release more models on it.

But in parallel to that, we have to push on architecture. And I think it just makes sense because the next breakthrough, at some point you're going to reach a theoretical limit and you need to evolve the architecture.

I see a little bit of an in between and obviously we're really good at execution. I think we're pretty good at execution, but we're also good in research and we just need to marry those two. So it makes sense because research and products are very different. Right. Like one should be pretty deterministic, the product side and one is inherently non deterministic. Right.

It's like, is this going to work? I don't know. It's a really big bet if it fails, it's research. It should have a non zero chance of completely blowing up in our face. We just need to go in another direction. But that's what research is. I'm curious about one branch of where a lot of, I think model research is happening right now. Agentic reasoning.

And you all have announced really great results in reasoning. I'm curious, maybe at a very basic level, how do you define reasoning? And then are you all seeing reasoning fall out of kind of scale during pre training? Is it post training and is there a lot of work left to do on the reasoning side?

Yeah, reasoning is a bit of a loaded area. I mean, you could argue it's things like multi step. And I think unfortunately the best examples we have are kind of like the sort of semi gimmicky, you know, Bob is driving the bus and like he picks, you know, like those kind of like things, right? And you, if you, if you troll local llama, you'll see a billion of those, right?

So but those actually force the model to take multiple steps to respond to you and think through and logically kind of respond. I think coding is actually really like, you know, when you look at like pre training. And so I like to answer your question directly. Like, reasoning improvements come in both post training and pre training.

So what we've learned, which is now, like, everyone's like, oh, of course this is the case. But definitely like the last year or so, everyone's kind of learned that, you know, code, having a lot of code in your kind of pre training corpus really improves reasoning.

But that's what you think about it. Like, of course, duh, it's step by step. It's very logical. It's, you know, code is very, is just logical by nature and kind of step by step. And like, if you incorporate a lot of that in your pre training, your model will reason better. And then we of course, look at examples in post training and SFT to improve as well.

So we look at the pre trained model. It depends on how you balance things as well because you can balance how well your model reasons with how well it responds in different languages. Ultimately, in post training, everything is a little bit of a trade off. Like you can super optimize things for coding if you want to. And we did that with code llama. It was really great. But of course the model will suffer in other areas.

And so ultimately it becomes what kind of pareto frontier of capabilities we want to bring out if it's a general model. And I think, yeah, I mean, ultimately it's a trade off. So anyone can kind of pick a benchmark or some capability and say, I'm going to super optimize for it and say, by the way, I'm better than GPD four.

Well, great. Anyone can do that. But is your model as generally capable as GPT four or llama 3.1 or whatever? That I think is a different story. What do you think are the future levers to unlock reasoning for anyone going forward?

The obvious answer is data. The more data, the more code and supervised data you can get, I think is a natural answer. I mean, I think we need to find applications as well for how we like define it and that would help us. Like once you've kind of start finding like those kind of killer applications, then you can, like, then you kind of know where to kind of focus in terms of your other, your gated, exactly what you're solving for.

Like, and this goes back to like evals and like, what is, what is your eval because we're starting to saturate evals. And so we tend to, as a community, we define a benchmark or in a metric and we just optimize a little hell out of it and it's great.

But then you actually look at the model in an actual environment and you're like, oh, well, that model has a better MMO score. Great. But how does it actually respond? Well, it doesn't respond as well, but it has a better MoU score. And so I think we need better evals and better benchmarks that allow us to, I would say, find clear line of sight to actual interactions.

And I think the live, what is it called? The abacus benchmark? The live bench, I think it's called, I can't remember the name of it is pretty good. I was looking at that. And of course, Lmsys and chef out arena, these are more natural, even though it's still not perfect, but it's moving in the right direction of things that are more human interactions versus a static dataset or static prompt set that is not that helpful.

I think once we start to find these other, what reasoning use cases makes sense, we're going to start to generate more data and you're going to start to improve the model. There. Hopefully that has, again, line of sight to a benchmark or an eval that actually feels like it improves an end product. And a lot of this actually depends on the end product, of course. What is my application?

Out of curiosity. Within large research labs, coding and math have always been two primary categories in trying to unlock reasoning in the startup ecosystem. Now we're seeing more folks who really want to go from the math angle. Do you have a perspective on whether or not that has led to interesting unlocks?

I mean, I mean, the answer is yeah, I mean, I think we, if you look at our data or look at least our models, we've, like, coding and math have been, I would say, the primary levers. So, I mean, it's, it's, yeah, I mean, I think that's like, having more obviously is better because obviously math is also very logical and very like step wise. So obviously you can see the pattern here.

The more data you have, like that kind of follows that sort of pattern, the more your model is going to be able to reason. And you can see that in how actually models respond. Like, if they start and you ask them to like respond and like, step me through your thinking process. Right. And it'll actually do that.

And some models do better than others. So anything like, anything like that I think scientific papers, like, also there's like, you know, we had a, we had some like, projects in out of fair that like, trained on the, you know, like archived papers. And you can see like, not only is like code and math, like pure mathematics, but also like, scientific paper, which is like, scientists are very logical in how they write things and how they stepwise and how they, like, create images of their, like, charts and stuff and like that also, I think we've seen like, just general scientific information, like, helps as well.

So galactic, sorry, Galactica was our project. Yeah. So Robin Ross from the papers of code team led that. Still, in my opinion, like one of the coolest projects ever. It got a lot of bad press, but wow, they were ahead of their time, in my opinion.

I'd love to talk a little bit about small models. Given the scale of capital and the compute that many startups have, the eight B and 70 B models are an incredible gift to the ecosystem. And it's funny that you called them appetizers at the start, because I think they're super powerful for that, for that set, but they're also really powerful for a number of different applications where you want smaller models.

And so I'm curious to hear, what do you hope to see developers use the eight B and 70 B models for, given that they are best in class for their size of model?

So it's interesting though, when we released, we released April Llama three, we released an eight and a 70. The appetizers, as we call them, the eight b was actually better than the llama 270 b by leaps. So we were, you know, I had to look at the chart and I was like, is this right? Yeah. Like, is that really the case? And we're like, yeah, like, it really is. It was that much better.

What's the intuition for how that happens? I mean, it was more data. We had what, seven x more data, obviously. Like, we, we put a bunch more compute at it as well. So going back to computing data, we're pushing on those, I think we saw, it's almost like every generation, which is, again, the generations are accelerating, you start to see the benchmarks for a large model basically get pushed down into the smaller size regime.

70 becomes an eight. And internally we have models where the eight is like, much even smaller than eight, actually. We're starting to see, like, really nice benchmarks on even smaller models.

So you continue to see, like, and that, you know, that the models improve at smaller scale. And that, I think, is just, we're pushing the architecture we're pushing scale and we're starting, we haven't quite saturated it yet. And I think that's really interesting.

So for me, one of the biggest reasons that I think it like a small architecture is useful is obviously on device. Everyone loves to talk about on device, and Apple's talking about that. And Google has Gemma models and Gemini running in Android devices. So I think on device makes sense. I think safety is kind of interesting because one of the things we have our own internal versions of Lamaguard, which we used that are orchestrated for our applications internally at the company, at Meta.

And today they're built on an eight B model, which is kind of expensive to run. If you think about a safety model, that's kind of like the secondary model. And so I do think internally we've been experimenting with much smaller models in that regard, and it creates efficiency, it lowers latency, because really those models are really just classifiers.

They're not really autoregressive chat like interfaces. They really just classify the input. A prompt of does that violate whatever category in the taxonomy, in the output, the model when it generates, does it violate that kind of stuff? So you can actually push those even further.

I think that there's also really interesting cases though, on device, where you almost have, when you think about privacy, anything about data you want to have like your data stay on device. You can think about a rag like architecture on device. So you have data, even your chat history that's on say, WhatsApp or other things.

You can imagine that model having access to data, aggregating it, and then running some type of almost like a mini vector database where you're using rag and doing your fuzzy search or fuzzy matching and with your small model, and that becomes this kind of own system in itself.

And you can basically do things like local summarization. I don't know, I get so many text messages like, hey, summarize my last 15 messages, please, because I've been in meetings and I haven't looked at my phone and that's super useful. And then I don't have to send data up to the cloud or anywhere else.

So there's those kind of use cases, I think, where small models actually are going to be really compelling. And then for super complex queries and things, obviously you have a big model in the cloud that can always service those. But for many things, I think on device or even in the edge and on prem, these small models actually can do pretty good.

You talked about scaling up computes and data as the two fundamental vectors to improve performance. I guess there's been a lot of chatter about how we are going to hit a wall, or maybe we're not going to hit a wall on data, and maybe synthetic data is the answer, et cetera. I'm curious your perspective on that. Is there an impending wall that we're going to hit? Most likely of cheap, accessible data. What do you think? How do we scale beyond that?

I think we've shown with this release that synthetic data does help a lot. I think in pre training we train on 15 trillion tokens or give or take. In post training, we generated a ton of millions of annotated synthetic data, a lot of it generated by the 405 b. We obviously paid for annotations as well. I do think synthetic data is a potential path forward.

I think it's going to, we know now, and the proof is in the models. It's great to talk about it in that. I do think data is going to be a challenge at some point for us. And this is why I think companies are licensing a lot of data these days to get access open ads. Licensing data. We're licensing certainly data. I think having access to services that generate data to improve models is important. I think that inherently is an advantage for a lot of companies. Google has YouTube, I'm sure is a value to them, which implies that bigger companies have an advantage, which is not something that's anything new.

We've been talking about this for a long time in terms of a data wall. I don't know. I mean, we're not there yet. I would say like, let's talk another. Just do, let's schedule this for like a year out and let's see where we are next year. I'll save my calendar for one year exactly from now and Meta AI. But let's talk in a year and see where we are. But we haven't hit it yet and we're still scaling and we're still gathering a lot of data and we're generating data and our models are still like continue to improve.

So, yeah, let's close it out with some rapid fire questions. Sure. Sounds great. And what year do you think will surpass the 50% threshold on sweebench? Good question. If I've learned anything, it'll be faster than whatever, whatever answer I give you, because I think any benchmark will, as soon as you zero in on it, people are going to go and figure it out. So I don't have an answer. It'll be fast.

One of the questions we have been asking people is in what year will an open source model surpass the other companies on the front, the other models on the frontier? And we have to take out that question now. Thanks to you, all this. I mean, it's. It's true. We're almost there. I mean, I think 405 B is incredible. It's definitely. It's definitely in that class. Yeah, absolutely. Which is incredible.

Will men always open source llama? I mean, I think Mark's pretty committed. He saw his letter. I mean, we've. We've open sourced for years and years. Now back to Pytorch, to fair to Lava models. This isn't something that's a flash in the pan for the company. The company's been committed to open source for a long time, so I wouldn't never say never, but the company and Mark are really committed.

Amazing. Jo, thank you so much for being here today and also for all the work that you're giving to the entire ecosystem. I think the entire AI community is very much grateful for all the work that you've done with pushing out llama and the advancements to come. It's a huge team. Check out the paper.

Look at all the acknowledgments. We spent all of yesterday reading it. We need, like, the Star wars, like, scrolling text of all the contributors because it was an incredibly big team. I was thinking about that same. So my hats off to the team. This was a total, I mean, this is, this absolutely took a village to get Lama out there and so proud and excited to represent the team here. So thank you.

Artificial Intelligence, Innovation, Technology, Open Source Models, Model Commoditization, Meta'S Llama, Sequoia Capital