ENSPIRING.ai: Zapiers Mike Knoop launches ARC Prize to Jumpstart New Ideas for AGI - Training Data
The discussion revolves around the potential future of AI and primarily focuses on the Arc AGI prize, which aims to push the boundaries of AI capabilities by measuring intelligence in terms of efficiently acquiring new skills. Mike Knoop, a co-founder at Zapier, alongside Francois Chalet, established the Arc AGI prize as a competitive benchmark for AI, which uniquely challenges AI systems to adapt and solve novel tasks without relying on memorized data or brute force methods.
The video highlights how AI, especially those like Zapier's, which prioritizes simplicity for non-technical users, has progressed to perform numerous tasks, increasing efficiency remarkably. It also discusses the distinct challenge and potential of the Arc AGI initiative, stressing the necessity for new ideas in AI development, as current models are limited by their reliance on existing data and memorization.
Please remember to turn on the CC button to view the subtitles.
Key Vocabularies and Common Phrases:
1. mythical [ˈmɪθ.ɪ.kəl] - (adjective) - Imaginary or unreal to the point that it seems almost legendary. - Synonyms: (legendary, fanciful, imaginary)
Right now, I think what I see happening is there's sort of this mythical story of a very bad outcome once we get to superintelligence.
2. empirical [ɪmˈpɪr.ɪ.kəl] - (adjective) - Based on observation or experience rather than theory or pure logic. - Synonyms: (observational, experiential, factual)
It's not grounded in empirical evidence.
3. interwoven [ˌɪn.təˈrəʊ.vən] - (adjective) - Blended or combined together intricately. - Synonyms: (intertwined, integrated, mingled)
And the way that you guys were sort of early to that and the way that it's now kind of interwoven in the product has been really interesting to watch.
4. parrot [ˈpær.ət] - (verb) - To repeat something in a mechanical way without understanding it. - Synonyms: (imitate, mimic, echo)
...the Arc prize, which is one of the most unique benchmarks in AI that measures a machine's ability to truly learn things intelligently versus just to parrot patterns in the training data.
5. synthesize [ˈsɪn.θəˌsaɪz] - (verb) - To combine various components or ideas into a coherent whole. - Synonyms: (combine, integrate, consolidate)
These are sort of things that even that emerge very early in childhood development and be able to use those core knowledge priors and recombine them and synthesize them into new programs in order to solve tasks with exacting accuracy that that system has never seen before, never been exposed to in its training data.
6. unfettered [ʌnˈfet̬.əd] - (adjective) - Free or unrestricted; not confined. - Synonyms: (unrestricted, unrestrained, free)
That's what you're going to see. You're going to see, basically, the application layer of AI get amazingly good at accuracy, consistency, low hallucination rates, which is going to allow us to use it in a much more unfettered way, in a much more trusted way.
7. algorithm [ˈæl.ɡəˌrɪð.əm] - (noun) - A set of rules or a procedure for solving a problem, often used by computers. - Synonyms: (formula, procedure, process)
They have to tear it all down, rethink of new algorithms, new architectures, new ideas, of course, new training data as well, often new amounts of scale in order to beat that next game
8. hallmark [ˈhɔːl.mɑːrk] - (noun) - A distinguishing feature or characteristic, especially a positive one. - Synonyms: (distinction, characteristic, emblem)
And that degree of novelty and that degree of not having ever been seen before is what makes Ark a really, really strong benchmark for trying to distinguish between, you know, this more narrow AI that can be beaten largely through memorization techniques and AGI, which is, you know, a system that can sort of very, very rapidly and efficiently acquire the skill at test time.
9. stewart [ˈstjuːərt] - (verb) - To manage or look after something responsibly. - Synonyms: (manage, oversee, supervise)
But we're going to need. The main thing that is missing is user confidence and the exacting nature of what it can do and what it can't do. This is what Zapier core classic gives us. It's a deterministic engine to execute automation.
10. innovate [ˈɪn.ə.veɪt] - (verb) - To introduce new methods, ideas, or products. - Synonyms: (create, invent, pioneer)
And that means we'll never have AI systems that can invent and discover and sort of innovate alongside humans and really help pull forward and push forward the frontier in, I think, a lot of really interesting ways, like understand more about the universe and then discover new pharmaceutical things and discover new physics, discover how to build AI
Zapiers Mike Knoop launches ARC Prize to Jumpstart New Ideas for AGI - Training Data
Right now, I think what I see happening is there's sort of this mythical story of a very bad outcome once we get to superintelligence. It's a very theoretical, driven story. It's not grounded in empirical evidence. It's basically based on reasoning our way to this outcome. And I think the only way that we can really, really effectively and truly set good policy is by. You have to look at what the systems can and can't do and then regulate or decide, make decisions at that point about what it can or can't do. I think anything else is sort of like, you know, you're cutting off potential really, really good futures way too early.
We have with us today Mike Knoop, co founder of Zapier. Mike has recently stepped up to co found and sponsor the Arc prize, which is one of the most unique benchmarks in AI that measures a machine's ability to truly learn things intelligently versus just to parrot patterns in the training data. We're excited to ask Mike for an update on how things are going with the Arc prize two weeks in and to hear his views on why we need radically different approaches and benchmarks to achieve true general intelligence. Mike, thanks for being here today.
So we're excited to talk about the Arc AGI initiative. Before we get into that, I'd love to spend a few minutes on your background at Zapier because I think Zapier has emerged as probably one of the best examples of what an existing application company can do with the power of AI. And the way that you guys were sort of early to that and the way that it's now kind of interwoven in the product has been really interesting to watch. Maybe can you just say a few words on what is Zapier and what has your approach to AI been at Zapier?
Yeah, Zapier is a workflow automation platform. We support 6000 different integrations, things from Salesforce to Gmail to basically SaaS software that you can imagine, we connect with. I think the unique thing about Zapier is it's intended to be very easy to use for non technical users. The majority of Zapier customers, users would not self identify with being a programmer or being technical, being an engineer or something like that. Even though I think of myself as an engineer and I find lots of interesting use cases for Zapier, but the large majority of our users don't.
And I think that's what is quite special about Zapier and why people tend to fall in love with it is the feeling of power and leverage that you get as a non technical user, being able to have software do work for you, basically. I think that's, in an interesting way, the exact same promise of AI, that's what people want of AI. They want software that's just going to do more work for them. And so in many ways, I think the mission of Zapier and sort of the mission and purpose of AI intersect.
And I've been sort of like, I guess, call me AI curious. I've since all the way back to college, and I gave like a whole all hands of zapier. I can't remember what year, but like when the GPT-3 paper came out and showed that to low company. And so I'd been sort of tracking and following along the progress, but it really wasn't until I think, January of 2022, when I think this was the chain of thought paper, when that one came out, that I saw that and really surprised me, because I thought I had priced in everything that AI language models could do up to that point.
And this idea of let's think step by step, this chain of thought technique, breaking down these language models as just tools for reasoning instead of just a one shot completion or chat engine, felt very special and something that I think most people didn't expect that they could do, even though the technology had been out there for over a year at that point. That actually that moment caused me to give up my exec team role. I was running all of our product engineering at the time, and I went back to basically be an individual contributor at the company as an AI researcher alongside my co founder, Brian, and have to talk more about that journey.
But yeah, I think that's what caused Zapier to be relatively early in terms of AI. What are some of the things that you've put into the product that you're most proud of at Zapier in terms of AI features? At this point, I think Zapier is pretty. I'd say there's probably two main places where we've gotten a lot of value from AI. The first is that over half the company now at individual level, uses AI on a daily basis. And I know this because we're actually measuring Zapier's own platform usage of our own company.
We have over half the company is building zaps, building automations that use an AI step like a chat GPT step in the middle, either to do so, content generation or data extraction over unstructured text. All sorts of really interesting use cases that we can sort of talk about. In fact, one of the top internally use cases is probably getting us about, I think, like 100x labor enhancement rate.
Wow. Which has been phenomenal. What is that? Yeah. You want to talk about it? Yeah, yeah. I mean, 100x improvement. Yeah, got to talk about that. I think it's our personal high watermark for what we've been able to achieve using AI internally. For like an operational perspective. Zapier has these things on our website called Zap templates. They're effectively recipes that help users figure out what can Zapier do and help them get started.
These templates are in order to make them, they've all been historically handmade because they require a bit of right brain and a bit of left brain. They're very creative. You have to inspire the end user, the customer, the would be user of what could Zapier do for you? What's the outcome, what's the ROI you might get? Then? There's also a very technical way. They have to be crafted and built as well. You have to map JSON fields from one integration app to another integration app to make sure it actually works. And together that's what actually helps users get started and activates them into the product.
We had a whole of maybe a million of these things that we knew we wanted to build, that we hadn't built yet, because they take so much effort. The rate of production for our contractors was about ten a day up until last summer. And we had a person, a member of our, I think it was our partner marketing team, actually a background, by the way, in freelance writing, who built a zap, a system of actually several zaps using OpenAI, some middle steps here, and built an end in system that whenever a new integration got launched on Zapier, it would automatically try to figure out what are the most interesting zap templates that could be built.
Write the inspirational use case behind it, because we've told that we have millions of these things already today. There are lots of training data and then also do the exacting field mapping as well. And we moved the human in that workflow from the do loop to the review loop. So instead of having those contractors now actually generating them, thinking really hard about what the use cases should be in building the sort of field mappings, now, they're basically reviewing output from the system in a spreadsheet and saying, yes, this one is good, this one's bad, this one's good, this one's bad.
And the funny thing is, because the cost of production of generating these things is solo, we dont even try to fix the bad ones. We just throw them away and say, well, just generate another stack and throw them on top. And so that rate of production is about 1000 a day now. So weve gone from ten a day to 1000 a day per sort of contractor. And weve been chipping away steadily at that whole of a million and keeping up with the launch of new integrations on Zapier.
I think one of the main things that that showed me was probably the space you want to look for in businesses. If youre thinking about how to deploy I is really up, like top of funnel or bottom of funnel tend to be like, you want to get something that's really close to, like, an important conversion rate for your business. And then, you know, I think if you can identify, um, any manual work that your organization does that has high volume, um, where human is doing the work, uh, I think those are sort of opportunities to, like, introspect and say, okay, is there an opportunity to get that human out of a do loop and, and sort of craft a system that can do the work, but then put the human in the review loop, which is still quite needed at the sort of maturity level of the technology today.
But I still phenomenal from an ROI perspective, are there any metrics you can share on the impact Zapier AI has had on the overall Zapier business? The biggest one today is we're just about to hit 10 million AI tasks per month. Now. I think we're at a run rate for about 10 million AI tasks per month. And I think I would love to be shown wrong or if corrected, if, you know, examples. But I think at this point, Zapier might be the biggest automated AI platform in the world in the sense there's a lot of researchers, entrepreneurs, builders who are trying to build these agentic AI systems where the AI is working without human in the loop.
And, yeah, I think at this point, with 10 million antitasks a month, Zapier may be the biggest example of that in the world right now. Really cool. Can we talk about Arc AGI? Let's do it. Maybe start with a recap of what is Arc AGI? Why did you and Francois set out to establish this prize? Yeah, this was a follow on to my AI curiosity. The reason I gave up my exec team role back in 2022 was I kind of wanted to know for myself, are we in path for AGI or not? I felt very important to know for Zapier's mission, but also just as a human, I was very curious and wanted to know, is this going to happen? There's definitely some interesting scaling that's happening.
Is that sufficient to get to, I think what naively I had in my head as this super intelligence AGI, and surprisingly what I learned is the answer is no. I revisited. I actually got to first know Francois Chalet, who is my co founder on Arc prize. I first heard about him and got exposed to his research back during COVID Actually it was during 2020. He did, I think, another podcast where he was explaining his paper, 2019 paper on measure of intelligence, where he tried to formalize a definition of what AGI actually is. I thought it was interesting at the time, but I kind of parked it with lots of other stuff going on.
It's appier and doing our own AI product building. But as Zapier got more into building with AI, I built my intuition of what language models could do. But the apparent limits were I started getting more into AI evals and trying to understand where the limits were. What could we expect from a product building perspective? Where are our products going to tap out? Where should we invest our engineering and research effort, versus just wait for the technology to keep scaling and mature? And the thing I found was that most AI evals were saturating up to human level performance and it was accelerating.
And when I went back to look at the arc eval, back from 2019, expected to see a similar trend, and instead what I found was basically the opposite. Not only had it not reached human performance yet, but it was actually decelerating over time. This was like really supremely sort of surprising to me, and maybe it's worth defining. We use these terms of AGI, but like, what's the actual correct definition for it? Right? I think there's a kind of popular definition in the world today. Actually, there are probably two schools of thought.
I think one school of thought that I see is AGI is undefinable and we shouldn't even try. This is a quite popular perspective. And I think the other school of thought is that AGI is the system that can do the majority of economically useful work humans do. This was popularized by OpenAI and the Microsoft deal. This is actually like in their, their deal together. Once this is achieved, OpenAI retains all the future. IP is very interesting. I think the node coast actually might get credit for coining that definition, but nonetheless, I think because of OpenAI success, that definition has become accepted by a lot of people, and as a target and goal we should shoot for.
The challenge is, I think it's a fine goal by the way, and I think current model architecture may be within spitting distance of it. I think it says way more probably about what the majority of humans do for work, if it's a true goal, than what AGI actually is, though. And Francois defines AGI as the efficiency of acquiring new skill. That's it. Here's a quick thought experiment I think you can use to grok this. We've had AI systems now for many years, five plus years, that can beat humans at games like go chess, poker, diplomacy, even. The fact remains that you cannot take any one of those systems that was built to beat one of those games and simply retrain it with new data, new experience, to beat humans at another game.
Instead, what researchers and builders and engineers have to do is they have to, like, go back to zero. They have to tear it all down, rethink of new algorithms, new architectures, new ideas, of course, new training data as well, often new amounts of scale in order to beat that next game. And yet this is in complete contrast to how you two both learn. Right? I could, like, sit you both down here, teach you a new card game in probably about an hour. I could probably show you a new board game and get you up to proficiency within a couple hours. And that fact is what makes, I think it's highly representative, what makes you generally intelligent.
It's your ability to very quickly and efficiently sort of gain skill in order to accomplish some open ended or novel task that you've, like, never encountered before in your life. And that's what's special about Arkansas. So, Arcade GI is an eval that tries to take that definition and actually measure it. And it was designed specifically to resist the ability to memorize the benchmark, which is very different from most other AI evals that are out there. Every task is completely novel, and there's a private test set that no one's seen outside of a handful of people that have taken it to verify that all of the puzzles are solvable.
And that degree of novelty and that degree of not having ever been seen before is what makes Ark a really, really strong benchmark for trying to distinguish between, you know, this more narrow AI that can be beaten largely through memorization techniques and EGI, which is, you know, a system that can sort of very, very rapidly and efficiently acquire the skill at test time. What is the definite. What is the definition of efficiency? I imagine there's a compute component, a data component. What's the definition of efficiently and efficiently acquire new skill? Yeah, Francois.
I'll probably do a bad job trying to, like, summarize his research, if you want to read more. By the way, his on measure of intelligence paper is like the source of truth for all of this stuff. I think it's really, really good, and I think one other, before I get to the answer, I think one other important thing to sort of see is that Ark has been unbeaten since 2019, and I think its endurance to date is probably the strongest set of empirical evidence that the underlying concepts of the definition are correct, which is why I think its worth paying attention to and why its such a special eval and special set of research.
So I think Francis Wattle would sort of describe efficiency as the ability for a system to sort of translate from core knowledge priors to being able to attack the search base or task space around it. A very weak, generalizable system is only going to be able to take on very near term adjacent tasks to the core data, the core knowledge that system was trained on, whereas a highly generalizable system is going to be able to have a much larger field of tasks and novelty that it's able to attack and be able to be able to effectively do with a small set of training data.
That's what we hope to see with the eventual solution for arcade GI as well, is that if someone's able to beat it, the goal is to get 85% on eval. Today's state of the art is, I think, 39% as we record this. And I think what's special is if someone can actually beat Ark at the 85%, that would mean that you've created a computer program that can be trained on this very small set of core knowledge priors. Things like goal directiveness, objectness, symmetry, rotation. These are sort of things that even that emerge very early in childhood development and be able to use those core knowledge priors and recombine them and synthesize them into new programs in order to solve tasks with exacting accuracy that that system has never seen before, never been exposed to in its training data.
That would be a really, really important thing, particularly at the application layer, where, like, the number one problem today is like hallucination accuracy and consistency. And I that results in this low user trust, which limits deployment of real AI. Right now, you have some peculiar rules for competing for the price. I think there's a limit on how much computer you can use. You can't use the Internet. I don't know. If you can use GPT four and closed models, why put those limits in place? Yeah, so the two big ones are. You're right.
The competition shows on Kaggle and Kaggle Enforce has no Internet and you have limited compute. So specifically you get one P 100 for 12 hours. And no Internet means you can't use frontier closed models that are available through APIs like Claude Sonnet or Gemini or GPT 40 or 4.0. Maybe a little Tickman order. I think the compute one is maybe more interesting. The reason for the compute limit is to target efficiency first and foremost, because if there wasn't any compute limit at all, then you could simply define AGI as saying it's just a system that can acquire skilled with no degree of efficiency attached to it.
And if that was true, that would mean that this system could brute force basically every possible program, think through every possible future outcome here, generate every possible single archetype of puzzle, and use that in order to sort of win the challenge. And we know that's not actually what happens in human general intelligence. You can read more in sort of francois paper about why, but the way that I think about it is you can think about it. You can introspect even yourself why you're taking the arc puzzles and see that when youre trying to solve one of these, that youre not brute forcing every possible transformation from the pattern, trying to recognize the pattern and apply it to the test instead.
Youre using your intuition, youre using your prior experience to try and identify maybe three, four, five possible possibilities of what the pattern is. And then you check them right in your head. And I think this shows that humans are the sort of efficiency humans have is not brute forcing every possible solution and checking its actually degree efficiency. So the compute limit, it sort of forces researchers to reckon with that definition. Now I think it is worth important acknowledging we don't know exactly how much compute is necessary to beat our kit and we're going to keep upping the compute bar over time is what I expect.
For example, we already over two xed it from prior versions of the competition. So I think it prior usually got somewhere between two and 5 hours to run on the GPU. We bumped that up to twelve. Interestingly, all of the state of the art techniques right now are actually maxing out that twelve hour runtime as well. So I do expect we'll continue to increase it over time, but I think it is an important tool in order to force the generality out of the full solution that we're looking for. And then new Internet is a little more of a practical reason.
We're trying to reduce cheating, reduce contamination, reduce overfitting, not be able to leak the private test set and largely just increased confidence that when we reach the 85% grand prize mark that someone has actually be NARC and be able to sort of say that with some sense of sort of authority and confidence. That's a true statement. One of Francois and my goals for ARC prize is to establish a public benchmark of progress towards, or maybe towards AGI, or maybe the lack of progress towards AGI and have it be sort of a trusted public tool that policymakers, students, entrepreneurs, venture capitalists, employees, everyone can look at to get a sense of how close or far are we away from this important technology existing, and then using that insight in order to help try to drive more AI researchers to work again on exploring new ideas, which is something that's unfortunately fallen out of favor in the last several years, is LLMs have taken off.
What have you seen? Or maybe what do you expect to be true about the efforts that are successful or more successful toward arc AGI that makes them different from what we're seeing out of the frontier models and the big research labs. Yeah. So it gets into the details of how does an LM work? Because that's the bet. Most frontier AI researchers lives have been taking less. But yours is we're going to scale up language models and that's going to more scale, more data is going to get us to AGI. And even though that's the dominant story, I actually don't think it's what most of the labs actually believe internally. Most of them are working on new ideas.
So I think there's like an interesting story there. But it is definitely in their interest to sort of promote a very strong narrative of like, scales. All you need. Don't compete with us. Yeah, you know, we're just going to steamroll you not to see here. Yeah, I think there's true competitive dynamics that have emerged in the market that are unfortunately, I think, shaping a lot of attention, investment, effort away from exploring new ideas. And if it is true that new ideas are needed, which I believe it is, and I think DJI and Arc price show that at least some new idea is needed, then, due to the competitive dynamics that emerged in the market over the last couple of years, were kind of headed in the wrong direction, right?
Theres. Like all the frontier research has basically gone close sourced. The GPT four paper had no technical detail shared in it. The Gemini paper had no technical detail shared on its longer context, innovation, things like that. And yet this is in direct contrast to the history of how we even got here today. The innovation set that led the chain of research that led from Ilya sequence to sequence paper at Google, out to Jacobs University, back to Google, then to Alec Radford, and back to Iliad OpenAI. Theres a six or seven year chain of research that only happened because of open sharing, open progress, and open science.
And I think thats a bit unfortunate that we dont really have that right now. Again, somewhat just due to the market dynamics and its commercial success of language models forcing a lot of that closed frontier research up. One of the goals of Arc prize is to help balance a lot of those things. You were asking about the difference between, yeah, what does that look like? What you said resonates, because it seems like a lot of the foundation model companies are going down very similar, somewhat clearly defined paths.
And I'm sure that internally there's all sorts of work being done to find the next breakthrough in architecture. But in terms of what's working today, they're all fairly similar paths. And I imagine our base, all ops. Yes. And I imagine that what works for the sake of ArcGI is going to be a little bit of a different shape. Yep. And I'm wondering if you're starting to see what shape that may take. Got it. And have a sense for what may be different about this more general architecture than what we're seeing out of the foundation models.
Great. So I think a useful shortcut on how to think about language models is that they are effectively doing very high dimensional memorization. They're able to train and memorize tons and tons of examples and apply them in slightly adjacent contexts. I don't want to dismiss language models too much, because I think that they are very special, something very magical and something very that has lots of economic utility. Zapier is an existence proof of that fact alone. So I don't want to throw it under the bus too much.
I think there's some really good things that it has unlocked, and as the technology goes, but there are limits to it, and the sort of limits are not being able to effectively leverage its training data to compose it or combine it at test time to go attack and accomplish novel tasks that had never seen before in its training data. That's what arc sort of shows, is that this is a skill that these language models don't possess. I think it's maybe useful to look at the history of the Hive score so far and maybe where we expect it to go.
So from when the eval was first introduced in 2019 2020, there was a small Kaggle competition that ran to get a baseline on it was 20% and from 20% to 30%. The techniques that worked were effectively, researchers crafted a handcrafted, domain specific language by looking at the puzzles that were part of the public test set. There's two test sets. There's a public set and then the private one, that's the state of the art measure on. And they looked at the public test set and they tried to infer and write down programs and python code or c sharp or whatever.
What are the individual transformations that you do in your head to go from one puzzle to the next? And so they called this a DSL, and then they wrote a brute force search to try and search through all possible permutations and combinations of those sub programs in order to find the general pattern and then apply it at real time. And that got to about 30%. What's gone from 30% up to close to 40% now is a slightly different technique. This is Jack Cole, and his approach is effectively using a code based open source language model and doing test time fine tuning.
So he has some pre training done on the code gen model, and then at test time taking the puzzles, they get the novel puzzles that's never been seen before and permutating variations of it, and then training this code gen based model in order to write that program then, and find a program that that's the pattern, and then apply it at test time. And that's gotten to 40%. I suspect that we probably have. I bet we have the ideas in the air already to get to like the 50% mark, maybe even a little beyond the 50% mark without a lot of new innovation.
I bet just ensembling or combining these sort of existing idea sets that have already worked towards arc probably gets you about halfway, I think, to get to the 85% market or beyond. I think the ultimate solution probably looks more like the shape of, at least to solve arc, something that looks like a deep learning guided DSL generator where you have some sort of, instead of hand coding and handcrafting the DSL ahead of time by trying to infer from the public test set what those sub programs would be, you need somewhere to generate that DSL dynamically by looking at the puzzles in real time and being able to learn from past puzzles and apply that towards future puzzles.
This is also another important thing humans do when they're going through the arch set. Sometimes the first or second puzzle are actually a little trickier because you're orienting yourself around what am I doing? What task am I doing? What does the possible solution space look like? And then as you get further into the task set, they tend to get a little easier because some of the space of possible transformations is just finite. So you start recognizing patterns there and then combining that with some sort of deep learning based program synthesis engine.
Something that can not brute force all possible programs of how to combine those dsls together, but something that has some sort of deep learning approach to shape which program traces do you try to generate or test and try against the pattern? Then it goes back to this human introspection of how we take the puzzles, which is we're not brute forcing all possible programs in our head. Instead, we're trying to identify just a handful of likely candidates and then testing those deterministically in our head of find the one that works.
It's really interesting that code generation and program synthesis kind of underlies all of the methods you just talked about. And there's something very special there. Like program synthesis is very general, allows you to actually get closer to that definition of generalized intelligence that you mentioned at the beginning. It's very exacting, and I think this is one of the reasons why the solution to Arc AGI is going to be useful very quickly.
So you were talking this before. There's a history of toy AI benchmarks over the last 1015 years that looked like Ark. There were games, there were puzzles, and really never amounted to much in terms of being beaten. They all got handily beaten as scale emerged, and they really didn't add to our understanding of how to build useful AI systems.
One of the common questions I've been feeling the last couple of weeks is what's different about Arc? Is that likely to just happen here again? And I think the reason why we're likely to see something much more useful, assuming we get a really good solution to Ark from the first grand prize win, is that the number one problem at the application layer? And we see this with Zapier too, with our new AI bots that we launched a couple months ago.
Been surprising to me in how that has gotten adopted actually by our users. There's like, how I kind of describe it, there's a lot of concentric rings of use cases that you can use AI automation for. And what we're seeing is people are sort of restricting the use cases for the AI bots, where they're sort of fully automated. Totally hands off to the use cases where there's sort of a very low need for user trust, or where the sort of. Let me say that a different way.
If it goes wrong, it's not catastrophically bad. So they deployed for use cases like personal productivity or team based workflow automation. Things where if it's wrong or it's right only nine out of ten times, or it takes me maybe a couple of days to really work with the system to do the prompt engineering to steer it towards getting maybe 95, 99% reliability.
That's acceptable, because the risk of being wrong is just quite low. In order to get much higher up and expand the number of concentric rings to moderate risk to high risk deployment scenarios, where we want these systems working autonomously, we're going to need. The main thing that is missing is user confidence and the exacting nature of what it can do and what it can't do. This is what Zapier core classic gives us. It's a deterministic engine to execute automation.
So, you know, once you build it and set up, it's going to do the exact same thing every single time. But that's also what makes it fragile and hard to use. And on contrast, these AI core based LM systems that are totally autonomous have the opposite set of trade offs. They're much easier to use, steer them, guide them, and fix them entirely through natural language. But because the accuracy is still inexact, confidence is low.
And I think that's what Ark gets us. A solution to arc at 85 or 100% means that you've written a. Again, you've written up a computer program that can generalize from, like, very simple core knowledge, priors to solve with exacting accuracy, 100% reliability, these, like, site unseen puzzles. And I think that that tool as a. That will be a new tool in, like, the programs toolkit, basically, in terms of building products and building systems that can achieve that same thing. We're two weeks in, I think, to. When you launched Arc AGI prize, what have you learned so far? What types of people are working and competing on this? Is it the pedigreed researchers or the big labs?
Is it scrappy hustler types who's competing? How many teams are submitting solutions? Yeah, let's see. The response, by the way, after launch was phenomenal, was much bigger than we expected. I think we were, like, trending on Twitter twice during the launch week. The number one kaggle competition in the world. Over a million social views, I think, over all the launch channels, so just very phenomenal. I'm really thankful for everyone who helps sort of promote and help share Ark.
Hopefully, we actually can get a solution here in some short time. I think the most interesting thing about the folks that are working towards Ark, there's probably a historical answer and then what I've seen over the last two weeks. So the historical answer is most of the people that have worked on Ark are outsiders to the field. This is not actually the first year that there's ever been a contest about it. There was a past competition called Arkathon.
It was much smaller. It was hosted out of this lab, 42, AI lab in Switzerland. And so last year there was actually 300 teams that worked on trying to beat Ark. And again, no one had sort of beat it. And almost to my knowledge, all of those teams were effectively individuals or outsiders in some way. You know, people at, like, big AI labs, they're folks with backgrounds in, you know, engineering, mechanical engineering or video game programming or physics.
Folks that just kind of, like, got curious and interested in the problem at hand. And I actually think that's more likely than not. Where the breakthrough for Ark is going to come from is I think it's going to come from an outsider, somebody who, like, somebody just thinks a little bit differently or has a different set of life experience where they're able to, like, you know, cross pollinate a couple of, like, really important ideas across fields.
That's one of the reasons why I put as much money as we did at Arc prize. I felt like the progress was idea rate limited, actually. And one of the best ways to sort of increase the amount of ideas is to try and blow up awareness, which is what the launch kind of did. Over the last two weeks, I've kind of seen two probably camps of people, at least on Twitter, emerge. I think there's one camp of people who are sort of the. They're in it for the mission.
They agree with the underlying concept. They think that we do need some new ideas. They're excited to try and figure out what those are. And then there's a second group of people that are sort of like, I'm going to prove you wrong. Lms are definitely enough. Scale is definitely what we need. And I'm going to do my best to go, like, beat this benchmark just using existing off the shelf technology and sort of prove you wrong.
So I'm actually quite happy for both those camps to exist. One of those approaches is currently up in the leaderboards. Right on. So, yeah, we can break some sort of news here. So this week, this Thursday, we're launching, or I guess when this comes out, it'll have launched just a couple of days in the past, a brand new public task leaderboard. So we talked about how Arc doesn't allow Internet access and there's compute limits.
I know personally how unsatisfying that is to not be able to use frontier models, though. I also want to know how good can GPT 40, how good can cloud Sonnet do against this benchmark? And also because no compute owner, it also is a bit of a barrier to entry. You have to use open source models, you have to do quite a bit of engineering work you have to do before you can even start just testing and experimenting.
So we're gonna be. We're launching a new public task leaderboard. It's gonna be a secondary leaderboard. We're gonna be committing about $150,000 for a reproducible fund towards this secondary leaderboard. It won't be officially part of the competition this year. We wanna maintain that aspect of assurance on cheating and contamination overfitting with the private test set. And that's also the test set that has sort of the most empirical evidence against it over the last four years.
But the secondary leaderboard is going to allow folks to basically submit scores towards it against the 400 public task set. And we'll verify and reproduce the scores locally to sort of ensure good fitting with the approach, and we'll publish that. And you're right, I think the top score, or one of the top scores on that, is this guy, Ryan Greenblatt. He came out a couple of days after the competition launched with a pretty interesting, novel approach, actually towards beating it.
And he's using GPT 40, but not just GPT 4.0. I think the interesting thing is he created an outer loop around 4.0, where he is using 4.0 to generate programs or sample from GPT four. These programs, these reasoning traces to beat the tasks or identify the patterns, then testing these patterns against the demonstrations, and then finding the one that works on applying it. And that approach seems to be getting in the low forties, maybe 40%, 41%, somewhere in that range.
And it's pretty interesting because I think someone might look at that and I think, and sort of at first blush say, well, isn't that evidence that skills all you need? And I do think there is something interesting there, right? It's like just showing that, hey, the more training data these things have, the more sort of programs that they can spit out. That might be kind of right, but also shows that I think that new ideas are needed still. Like this outer loop is novel.
Like, that might actually be frontier LLM reasoning that Ryan published. And we're going to make all the approach whenever we put similar to arc prize, we're going to open source all the code for all the reproducible solutions so folks can, can take these and apply them and try to reproduce them and sort of using private closed or open close open open source models for the sort of the closed private data set. But yeah, I do think it's pretty interesting how much innovation you get when you, how much innovation we've gotten over the last few weeks is just a result of putting even just the awareness against the public test.
Are the folks at the big research labs, why are they not working towards this benchmark? Because it almost like, when you explain the benchmark, it seems so clear that obviously this is the thing you want to solve. You don't want to solve the memorizing the textbook use case. Why do you think the folks at big research labs aren't trying to solve this benchmark? Or are they? So I am aware of a handful of big IO labs that have tried in the past, several years ago.
So like, this was perhaps at smaller scale with weaker models and things like that. One of the things I would hope is that actually more do in the future, actually love to see if we could make Arc AGI an actual measure on some of the model cards that get reported against future models. I think that would be like a really cool thing. We're willing to do it. So if anyone is listening to this and wants to reach out and make that happen, more than happy to work with them and find some way to do that.
If I had to guess, well, let me say, let me not guess unless say what I have more sort of confidence in. I've been serving. Once I got exposed to Francois work again and was served for deepening. Thinking deeply about ArcGI, I started surveying a lot of my friends and researchers in SF in the Bay Area about had they heard of Francois and had they heard of Ark and Francois. Pretty good name recognition because he's really, really big on Twitter. Been big on Twitter for many years. Probably nine out of ten people I talked to like, knew who Francois Cholet was.
Maybe one in ten. Two in ten had heard about the arc AGI eval and probably half those were confused because there's like five other AI evals called Arkansas. And I had to sort of do some. I had to disambiguate with them. So I had really low awareness. This was one of the first things I asked Francois about when I met him for the first time in person this year. I asked him, why do you think that is? Why do you think you have such high awareness but Ark has such low awareness.
And his answer, effectively was that it's hard. The way that benchmarks gain popularity and notoriety is we make progress towards them. Right. Researchers are working against it. Somebody has an idea, they have a breakthrough, they publish that in a paper, that paper gets picked up and cited by others. That generates awareness and attention. Other researchers say, ooh, interesting. Okay, something might be possible now on this really hard benchmark. And so you get the snowball effect of attention.
And because Ark has sort of endured with very low rates of progress, in fact, decelerating progress over the last four years, I think anyone in the lab looking at that would just say, well, maybe the time is not right for you. Maybe we dont have the idea set in the world. Maybe we dont have the scale we need yet in order to sort of beat this thing. And it looks like a toy and it doesnt. Like, I dont fully understand why I dont get the necessary importance or how its qualitatively different.
I havent just spent that much time. Ive got a million other benchmarks I could use, and I think thats somewhat of the dynamic that has existed in the past and. And is one of the reasons why we launched dark prize. I think there's lots of market tools you can use to shape markets and shape innovation, and I think prizes do have a narrow, there's a narrow spot where prizes can be outrageously effective, and it's where the idea is small and it's idea rate limited. One person or a small team can make that breakthrough, and it's very quickly and easily inspectable, reproducible, built, and you can build on top of it rapidly.
All those sort of boxes got checked and yeah, one of the reasons why I decided to go to archives and you've mentioned curiosity around AI or AGI dating back to college, that was sparked a few years ago in the context of Zapier and has kind of been nurtured ever since. Beyond the curiosity, I'm curious why this is important. Meaning if you could paint the picture of what life looks like for the world, postage EI, where we've defined it as the ability to efficiently acquire new skills, what do you think that version of the future looks like? Why is this an important thing to solve? The thing that I feel like I have a unique insight into at this point.
Having spent a lot of time thinking about Ark and this AGI definition is, I suspect the advent of AGI is going to look very differently than most people expect, especially of the group who are in the camp that, like AGI is undefinable because it's so mythical and, you know, scary or big or awesome that, like, we can't even hope to ever define. It's just gonna be this, like, magical, special thing. And, you know, it turns out, like, you know, something that I believe quite deeply is the definitions are really important because definitions allow us to create benchmarks, and benchmarks allow us as a society to measure progress and set goals towards things that we care about and want to happen.
And this idea of efficiently acquiring skill, one of the, we've talked about a handful of times today, but one of the direct near term things that you get from that is you get systems that can do exacting accuracy, generalization from a small set of core priors and apply it towards novel solutions. That is, again, the number one problem that rate limits AI adoption for more real world use cases today. That's what you're going to see. You're going to see, basically, the application layer of AI get amazingly good at accuracy, consistency, low hallucination rates, which is going to allow us to use it in a much more unfettered way, in a much more trusted way because of the underlying way in which it's built. The reason I think that's important, I think that's the reason why I think that's important is, you know, I think there's a lot of. We don't know what that set of capabilities is going to build on top of into the future.
Right. There's lots of unknowns, I think, of how AI AGI evolves beyond the actual inception moment of a system that can efficiently acquire skill. But I think it's going to be a much more gradual and incremental rollout where there's a lot of contact with reality as we build and engineer these systems, which is going to give us as like a society a lot of time to update based on what those capabilities, what it can do, what it can't do, and make decisions at that point about how do we want to, like, deploy this technology, where might we as a society, say we don't want to deploy for this set of use cases.
I think that's one of the reasons why I've been sort of such a proponent, I think, of open source AGI progress with Arc prize is like, right now, I think what I see happening is there's sort of this mythical story of a very bad outcome once we get to super intelligence. It's a very theoretical driven story. It's not grounded in sort of empirical evidence. It's basically based on sort of reasoning our way to this outcome. And I think the only way that we can really, really effectively and truly set good policy is by, you have to look at what the systems can and can't do and then regulate or decide and make decisions at that point about what it can or can't do.
I think anything else is sort of like, you know, you're cutting off potential really, really good futures way too early. And that's sort of what's happening. I think with a lot of this early AI regulation where I'll try to paint the good side of this picture, it's like, hey, I care. Maybe the risk of this bad outcome is so high in the future that we should pause here. I think the risk of that is you've trimmed off every possible good path of the future way too early. And the reason it's way too early is because we still need new ideas. We need new ideas from researchers, we need new ideas from students, we need new ideas from young people, we need new ideas from labs.
Otherwise, there's a chance that we've never actually reached the high degree of useful AGI that we actually want. And so that's my nuanced take, I think, on probably what the advent of AGI looks like. I think it's much less likely to be a moment in time and much more likely to be a stair step of technology that we build on on top of past technology. And that creates a lot of moments to sort of update beliefs based on what it can and can do.
Do you have any predictions on when we'll cross 85% on arc price? You know, before the competition started, there was a. The first data scientist we hired at Zapier gave me this idea a long time ago and stuck with me. He said, the longer it goes, the longer it goes. And so it's this idea that, like, the longer something takes, the more you should update your prior advisor, it's going to take longer. So coming into this year, my expectation was like, hey, at least three or four years, probably, before we get to the grand prize mark. Based on sort of like past, based on the past track record. Uh, having seen what we've seen over the last, like, two or three weeks, though, uh, I think it is quite likely we get to 50% during this competition period.
Um, I would be very surprised, you know, I'd be surprised in a good way, if we actually get to the 85% grand prize in this competition period. I. But I think it is not unlikely that we crest the 50% mark before the end of or middle of November, which is when the contest period ends for 2024. And is there a good why now? Because people have been trying at this for five years now, and you're galvanizing interest around it. And now a lot more researchers around the world are interested in AI and solving hard problems.
But is there a good why now in terms of enabling techniques, technologies, et cetera? That's different now than. Than five years ago when Francois first kind of defined the benchmark. If it is true that deep learning is an important part of the solution, a deep learning guided program synthesis engine, or a DSL that is generated on the fly through deep learning technique, if that's true, the world has a lot of experience now on building and engineering and scaling such systems over the last three or four years, and there's a lot more compute online, which brings the cost down into a territory where some of these things may have just been out of practical cost before.
For example, actually, Ryan Greenblatt's solution right now is maxing out our cost limits we're going to have against the public. Leaderboard costs $10,000 to generate. The 8000 reasoning sample traces from GPT 4.0 that he then deterministically checks. That would have been a technique that would not have been possible three or four years ago in any way. As regards if it is true that there's a minimum amount of scale that's necessary to be dark, I think, hey, we've gotten more of that in the last three or four years than we had when the first competition, and then I think the other why now is just largely due to awareness, if it truly is.
Actually, let me answer the opposite. I think the risk that it is the reason we launched dark prize is that it is actually not why now. Actually it's not going to happen, is the problem. It's not a why now. It's not like, oh, the ideas are in the world, we just need to get people to work on it. The risk is that it's actually not why now and not why now is, I think, a much more interesting story right around this LM driven, focused attention on LM solutions only the closed research due to the competitive dynamics.
All of these things have shifted and shaped attention away from the new ideas and towards scale, towards LLMs, towards, like, you know, application layer AI. And I think that we think we need some shaping reshaping back towards the new idea set. So hopefully the why now is because Ark has now lots of attention to be seen. Do you think LLMs will be part of the solution. I'm curious what you think of it. Seems like in the big research labs right now, all of the frontier research is around. Let's merge kind of LLMs with the insights that you get from the inference time compute and the Q star alphago stuff.
I'm curious what you think of that kind of direction of research. There's some pretty interesting research I've come across with transformers. That transformers are capable as an architecture of representing very deep deductive or easting change with 100% accuracy. I think this is interesting, and the challenge with them is actually, we just don't have the learning algorithm. Backpropagation is an ineffective learning algorithm in order to teach a transform architecture, a set of network weights that can do deductive reasoning with 100% accuracy.
And so I think it's possible that the systems that we kind of, or at least the core concepts that underlie language models, have sufficient capability in order to do this type of reasoning. And we have not yet discovered the like algorithm that can train the model in the right way. We haven't quite discovered the right outer loop around the transformer that is going to do the program synthesis engine or the DSL generator. Um, I feel more confident in saying that, like, deep learning is almost certainly going to be a part of the grand prize in some way.
I bet it's not. It won't. I'm pretty confident it wouldn't be just like a, you know, a pure deterministic program is going to be how we. How it gets solved. Um, I think transformers are effectively the technology that has the most, has the highest degree of awareness and research, like literature. There's a lot of hardware now that's going towards accelerating transformer wasteland I models. I think, actually, just really, there was, like an ASIc that got announced recently that's trying to accelerate the transform architecture.
So to the degree that actually, like, some degree of, like, scale or compute is necessary to be dark, I think those are like, things that I would say I'm bullish on sort of the transform architecture, though. I would point out that the search base of alternative architecture is quite rich, right? We've had maybe like nine or ten now mainline architectures for transformers to LSTMs, CS, CNNS, RNN's, XLSTM state space models.
This would suggest that the search space of those architectures is quite rich, actually. And they all have slightly different varying properties. I think it's certainly possible that someone comes up with some innovation there. I'm less confident or bullish that LLMs in their exact form are going to be part of the 85% solution, though I would think it probably like a sub component of the architecture instead of the entire application system itself. When somebody does ultimately hit the 85% threshold, what do you hope they do with the solution? What would you like to see out of that person other than submit it to the leaderboard?
So this is one thing we didn't talk about a ton, but one of Arc prize's goals is to accelerate open progress towards AGI. We are going to be requiring that in order to claim the prize money, you do have to publicly share and reproduce, publicly share, reproducible code and put it into the public domain. And this goes for both the public leaderboard and the official competition leaderboard as well. This is with the spirit of trying to reaccelerate open progress again so that we have research in small ways out in public that other researchers can build on top of and hopefully stir supper with actually building AGI and not getting stuck in the plateau that we're in today. I think that's probably my first pickup.
I've actually seen a handful of people online that have said, hey, if you've got a solution in Arcade GI, I'll give you this million dollar offer. My company. Yeah, exactly. Which on one hand I'm like, okay, that's kind of interesting. But on the other hand, I'm like, I think that's great awareness, and it shows, actually, the importance of solving this.
I think more people are starting to become aware of the lack of progress, lack of frontier progress. I think Ark is kind of becoming a lightning rod for folks that want an actual measure towards this. I think growing sentiment in the field today. Should we close out with some rapid fire questions? Yeah, let's do it. Okay. Who do you admire most in AI?
I mean, Francois Gillet is a bit of a cop on answering. I wouldn't have co founded Arc prize if I didn't admire and respect his work over the last four years. I mean, I think the two people that I have learned the most from, like, directly, and have inspired my own belief and work, rich Sutton and Chalet, both of them published papers in 2019. Right? Rich Sutton published the bitter lesson, which I think is fairly well known now in the industry at this point.
I think his idea set is quite right there with maybe the one asterisk that the one aspect that has not yet had scale applied to it is architecture search itself. We certainly have unbiased search and learning applied on the inference side and the training side, but every architecture still has a very human handcrafted story and journey to it, which is an important insight. I think about that and then on measure of intelligence from Chollet in 2019. And I think both of these papers are in like, or I guess maybe sentence was more of a blog post.
But both of these pieces of writing, I think are very important because history has proven them right. As time has gone on, language models transformer scale has sort of shown Sutton's ideas to be even more true than they were in 2019. And I think the endurance of Ark has shown Francois definition of AGI to be more and more true as time has gone on. What is your most contrarian point of view in AI? I feel like everything we talk about today, most people don't agree with me on, so. All right, we'll count it.
Scale is not all you need. New ideas are needed. What's your favorite AI app other than Zapier? Let me look and see. What do I have? I have a handful, I think. I'm not going to surprise you with anything.
So I've got chat to BT perplexity and claude, and I'm a paying user of all three of those services. One interesting thing, actually, I'll add, I have gotten way more value out of language model based tooling over the last, like, call it six months than I ever did in the first aspect when I was starting to start working on it at Zapier. And it's because one of the things they're really, really good at, the thing they're perhaps best at, is summarizing tons of unstructured text and helping be like an educational tutorial for you.
So it's significantly ramped up my learning rate on actually building with AI, built learning these fundamental different architectures. Starting to actually do model training, that's something Zapier hasn't done, but I started to do myself over the last six months to get a sense of that type of work. And yeah, language models and AI tooling has definitely accelerated my learning process. Awesome. All right, last question. Let's do something optimistic, something that we can all dream about.
What change in the world are you most excited to see over the next five or ten years as a result of AI? I think the. I've always wanted to, like, live in the future. I think that's maybe something that has always driven me towards, like, working on, like, frontier tech. I've always, you know, bought the latest gadget, always tried the latest app. I think it's led me to work on Zapier and AI, and it's one of the reasons I'm working on AI right now is because I think it's like the biggest thing that you can, I can potentially have an influence on trying to pull forward. Pull forward.
That future I personally get. I think one of the things that feels very limited from AI right now is that with the narrow form of AI that we have, if we never get to HGI, what that will mean is that we will always be rate limited on developing things by the human that's in the loop. And that means we'll never have AI systems that can invent and discover and sort of innovate alongside humans and really help pull forward and push forward the frontier in, I think, a lot of really interesting ways, like understand more about the universe and then discover new pharmaceutical things and discover new physics, discover how to build AI. We're always going to be rate limited by the human today.
And I think if you just sort of care about living in the future and you want to pull forward the good aspects of the future, some form of AGI is necessary to do that. Awesome. Thank you, Mike. Thank you both for having me.
Artificial Intelligence, Innovation, Technology, Zapier, Agi, Automation, Sequoia Capital
Comments ()