ENSPIRING.ai: Factorys Matan Grinberg and Eno Reyes Unleash the Droids on Software Development - Training Data
This video features a discussion with Matan Grimberg and Ino Reyes, founders of Factory. They explore their backgrounds, with Grimberg discussing his initial career in theoretical physics and the cold calls that led to significant changes in his career path. Reyes shares his immigrant family background and experiences that shaped his journey in the tech industry. Together, they cofounded Factory, which develops autonomous software engineering tools known as 'droids.' These droids are designed to automate repetitive software development tasks, aiming to align with the current needs of enterprise engineers.
The founders elaborate on the technological and strategic approaches Factory takes, including their use of existing large-scale foundational AI models and their focus on enhancing software development processes rather than performing flashy demonstrations. They stress how Factory's approaches differ from other AI companies by prioritizing enterprise value and developer friendliness within the realm of software automation. Reid Hoffman’s statement about building meaningful real-world applications rather than just sticking to technical finesse is emphasized throughout.
Main takeaways from the video:
Please remember to turn on the CC button to view the subtitles.
Key Vocabularies and Common Phrases:
1. stochastic [stəˈkæstɪk] - (adjective) - Having a random probability distribution or pattern that may be analyzed statistically but may not be predicted precisely. - Synonyms: (random, probabilistic, unpredictable)
Agent is synonymous with unreliable, stochastic demoware vaporware.
2. affiliated [əˈfɪliˌeɪtɪd] - (adjective) - Officially attached or connected to an organization. - Synonyms: (associated, connected, linked)
An academic institution right next to Princeton University, but not technically affiliated with it.
3. ambitious [æmˈbɪʃəs] - (adjective) - Having a strong desire for success or achievement. - Synonyms: (aspiring, determined, enterprising)
Being a young, ambitious undergrad, I decided might as well see if I could snag him as an advisor.
4. leverage [ˈlɛvərɪdʒ] - (noun) - The exertion of force by means of a lever or an advantageous position. - Synonyms: (advantage, clout, influence)
You know, to Inna's point, it might be, you know, you're like a software curator or cultivator or orchestrator, but by positioning ourselves this way with the developer, wherever that role goes, we will be there side by side to allow them to have this higher leverage
5. intellectual [ˌɪntəˈlɛktʃuəl] - (adjective) - Relating to the intellect, using reasoning and understanding objectively. - Synonyms: (cognitive, rational, cerebral)
It was intellectual love at first sight, you could say.
6. autonomous [ɔːˈtɒnəməs] - (adjective) - Having the freedom to act independently. - Synonyms: (self-governing, independent, self-directed)
Factory is building autonomous software engineering agents, or droids.
7. incremental [ˌɪnkrəˈmɛntl] - (adjective) - Relating to or denoting an increase or addition, especially one of a series on a fixed scale. - Synonyms: (gradual, progressive, step-by-step)
Goals often centered around incremental improvements.
8. nuance [ˈnjuːˌɑːns] - (noun) - A subtle difference in or shade of meaning, expression, or sound. - Synonyms: (subtlety, distinction, refinement)
Right? But if it comes into your engineering organization with all your nuances and all your engineering best practices
9. exponential [ˌɛkspəˈnɛnʃəl] - (adjective) - (of an increase) becoming more and more rapid. - Synonyms: (accelerating, escalating, expanding)
The impacts of that being compounding exponential...
10. iterate [ˈɪtəˌreɪt] - (verb) - To perform or utter repeatedly. - Synonyms: (repeat, restate, reiterate)
And we get evals and feedback from these developers about how these droids are performing.
Factorys Matan Grinberg and Eno Reyes Unleash the Droids on Software Development - Training Data
I would have thought this would be different 13 months later, but this is still very much the case where agent is synonymous with unreliable, stochastic demoware vaporware. And I think something very important for us is we want to build these systems that aren't just cool examples of what is to come, but rather valuable today. And not just valuable for a hacker on a side project, but valuable to enterprise engineers today.
Hi, and welcome to training data. We have with us today Matan Grimberg and Ino Race, founders of Factory. Factory is building autonomous software engineering agents, or droids, that can automate everything from the drudgery of maintaining your documentation to actually writing code for you. In doing so, they are building the ultimate compound lever.
Last week, factory also announced some impressive results on the key AI coding benchmark Sui, bench beating state of the art by a wide margin. Stay tuned at the end of the episode for more context on how they built it. We're here with Bataan Greenberg and Ino Reyes, founders of Factory. Gentlemen, thank you for joining us. Thank you so much for having us. Yeah, thanks for having us.
Let's start with a little bit of personal background. And, Matan, maybe we'll start with you and then go to Ino. So, Matan, one thing that I believe you and I share in common is an affinity for a well executed cold call. I know at least two cold calls that have had some bearing on your life. Why don't we start with the one that you did as an undergrad at Princeton to somebody who my partner, Sean McGuire tells me, is quite a famous physicist. Can we start with that cold call?
Yeah, absolutely. So while I was at Princeton, I was studying string theory, and the most famous string theorist happened to be working at the Institute for Advanced Study, which is an academic institution right next to Princeton University, but not technically affiliated with it. Part of the allure of going to the IAS is that you don't have to take on graduate students, much less undergrads. That said, there's a professor there, Juan Maldacena, who is by far the kind of leader of the string theory movement.
And being a young, ambitious undergrad, I decided might as well see if I could snag him as an advisor. And so, with some advice from some graduate students, sent him an email, asked if we could meet. And the thing about Juan is the way he works with people. He'll take a meeting with anyone, basically, and we'll spend about 2 hours at the chalkboard with you. And in this two hour chalkboard session, he'll subtly drop a problem that you basically have 24 hours to solve. Get back to him with the solution, and then you'll officially be a student of his.
Luckily, I was warned about this, about this Rite of passage, so I was paying close attention to any hints he was dropping. You had the problem. Indeed, yes, yes. So found the problem, ended up spending basically the entire night working on it, and luckily ended up having him as an advisor. We were able to then publish a paper together, which was very exciting. Yeah. Typical undergrad experience. Yes, yes, exactly.
So there's a second cold call that I want to ask you about. Before we get to that, why don't we go to Eno? So, Eno, you similarly went to Princeton. You have a CS degree. From there, you spent some time as a machine learning engineer at hugging face, which is where we first intersected, spent some time at Microsoft. But like a lot of great founders, your story before then started with some humble beginnings. Could you say a word about the stuff that doesn't appear on LinkedIn that has helped to shape who you are today?
Yeah, absolutely. And I think the. You know, my family on my dad's side came from Mexico in the late sixties to San Francisco, and my grandparents were both working for a bit. But when my dad was born, they started a Mexican restaurant in Los Altos, and that was in the seventies. They moved it to haight and coal in the eighties. We're a very kind of, like, San Francisco immigrant story. They actually ended up leaving to Georgia, where I grew up. But really, I think it's the drive that they had to give my dad a successful life in America.
And it was my dad and my mom that drove that same kind of mentality into me growing up. And I think it's really cool because this story is one that I think a lot of Americans share and something that makes it really exciting to be back in San Francisco, kind of building. Building something to potentially make the world a better place for everyone. Very cool. That is the dream. Natan.
I want to get back to that other cold call because I think it leads directly into the forming of factory. So our partner, Sean McGuire, who I mentioned earlier, who I believe shares a similar academic background to your own, received an email from you a year or so ago that led to a walk, and very shortly thereafter, factory was formed. So I am curious, what caused you to cold call Sean McGuire? And this is less of a Sean McGuire question, because we know plenty about Sean McGuire. This is more of like a, you're on a very good path.
You're doing really good research. You're on track to get a PhD in physics, and something inspired you to go in a different direction. And I'm curious, what was it that inspired you that led to that cold call? And maybe tell us a quick story about what happened shortly thereafter. Yeah, absolutely. So, you know, I was, like you said, I was doing my PhD at Berkeley. About a year in, though, I realized that I was only doing theoretical physics and string theory because it was hard and not because I actually loved it, which is obviously a bad reason, a bad reason to do anything.
And I had such tunnel vision on this path that when I came to this realization, it was kind of earth shattering and looked at the paths ahead of me, and there were basically three options that seemed realistic. And so it was either going into quantitative finance, going into big tech, or going into startups. And by this time, I had already kind of switched my research at Berkeley from being purely physics to an ML in physics, and then slowly more ML, and then mostly AI. So it was kind of quickly, quickly cascading there.
At the time, I think I saw a video of Sean speaking, I think, over Zoom to some founders at Stanford or something, and I recognized his name from string theory research because I had read his papers way back in the day. And it was particularly shocking to me because I'm not sure how much time you've spent with string theorists, but normally they're quite introverted, not, you know, not. Not the most social. Yeah. And so Sean is, you know, this very different example.
And so to me, I kind of, like, I looked at his background, and it was just. It was just shocking to see someone who was, like, so deep and, like a bona fide string theorist, then go in and like, you know, start his own companies, invest in some of the best companies, join sequoia, and be, you know, a partner there. And to me, it was just like, oh, my God, this seems like someone who is of my kind of background, of my, uh, nurturing, I guess. And so sent him an email, and I was just like, hey, you know, we both were string theorists. I don't want to do string theory anymore.
I'm thinking about AI. Would love to get your advice. Like you mentioned, that then turned into a walk. It actually was supposed to be a 30 minutes walk. We ended up going from the sequoia offices in Menlo park all the way to Stanford and then back, and so it ended up in 3 hours. He missed a lot of his meetings that day, so it was pretty amusing. And basically, at the conclusion of the walk, he. So one thing was for sure. He was saying, you must drop out of your PhD. There's way too many exciting things to do.
And he kind of left me with the advice of, you should either join Twitter right now, because this is just after Elon took over, and he was saying it's, you know, only the most badass people are going to join Twitter right now. Two, you should join a company of mine as just, like, a general glue, is what he said. This was foundry, by the way. Yep. Or three, if there's some ideas that you've been thinking about, you should start a company. And I was like, you know, very grateful of all the time that he spent. And, you know, we kind of left off there beautifully.
In parallel, Ino and I had just reconnected at a laying chain hackathon, and he was in Atlanta the weekend prior, and he basically got back the next day. So that next day, eno and I got coffee, and I think we got coffee at noon, and then basically every hour since then until now, Ino and I have been working together, talking constantly about cogeneration and what became factory, I guess. Do you guys know each other? In undergrad, we had, like, the maximal overlap of mutual friends without ever having had a one on one conversation. Yeah, it's pretty funny.
We were in eating clubs at the time, opposite from each other, and we had just so many mutual friends. And it really wasn't until I moved to the Bay area that we had a face to face combo. And it was a very fruitful conversation, for sure. It was intellectual love at first sight, you could say. Absolutely, I love that. And it so stands up this with a lang chain connection. How did you guys decide on. I'm curious.
I mean, you're both brilliant, and I think for a lot of founders starting out in AI right now, a lot of them find it hard to resist the sirens call of training a foundation model. So, like, how do you decide to build in the application layer? I'm curious. And then, why software engineering? Yeah. So I think from my perspective, like, going deep from academia, I think throughout all the years of spending time on math and physics, the thing of beauty that I learned to be drawn to was things being fundamental and spending time doing AI research.
It was so clear that code is fundamental to machine intelligence. And so I was just naturally attracted to the role that it plays there. And I think that kind of joined quite well with. Enos... attraction to the space. You've referred to it a couple of times as a compound lever. Can you unpack that for us and let us know what that means? Yeah, so there's the famous Archimedes quote about software, or, well, his quote is rather if you have a large enough lever, you can move the world.
And then I think that's been co opted for software engineering, that software is a lever upon the world. And for us, we see AI, and in particular AI code generation as a lever on software. The impacts of that being compounding exponential and. Sorry, you know, I think I cut you off. I think you were maybe mentioning how you got to the founding inspiration for factory. Oh, yeah, absolutely. I mean, I think Matan's story is really indicative of kind of the energy at the time, at hugging face, working on training, optimizing deploying LLMs for enterprise use cases, was actually working with Harrison on early lang chain kind of integrations.
And it was so clear that the work that was happening in open source was directionally moving towards modeling human cognition with LLMs, where the LLM was just one piece of the system. The idea of chains, and I, I think Harrison calls them cognitive architectures, or the Langshang folks call it that. And seeing that happening and seeing that within the code gen space, the most advanced players were basically looking at autocomplete. It felt like there was a huge opportunity to take that to the next step and take some of those lessons that were happening both in the fringe research and open source communities and applying them towards kind of massive organization.
I realize we haven't said explicitly yet, what is factory? So, Matan, what is factory? And then maybe what are a couple of the key decisions that you've made about the way factory is built? And, for example, one of them is to start by benefiting from all the ongoing improvements in the foundation model layer. One of them might be the product itself, but can you just say, what is factory? And what are some of the key decisions you've made that have shaped factory today? Yeah, absolutely.
So factory is a cutting edge AI startup. Our mission is to bring autonomy to software engineering. What that means more concretely, we are automating tasks in the software development lifecycle, and in particular, tasks like code review, documentation testing, debugging, refactoring. And as I list these off, you'll kind of hear quickly that these are the tasks that engineers don't particularly enjoy doing, and that's very much intentional. Right? Like, obviously, we are doing code generation, and that's really important.
But I think an equally important thing to, you know, generating some inspirational and, like, forward looking demos. It's also important to understand what engineers are actually spending their time on. And in most organizations, it's not actually fun development work. In most organizations, they're spending a lot of their time on things like review and testing and documentation. Normally, they'll do these things way too late, and then they're suffering because they're missing deadlines. Right.
And so our approach is we want these tools to be useful in the enterprise. And so to do that, we need to kind of meet engineers where they are with the tasks that they are very eager to automate away. Um, we call these autonomous systems droids. And like Ino was alluding to earlier, these are kind of, uh, there's a droid for each category of task. And in this kind of a paradigm where, um, we want to frame these problems as games, it's very convenient that software development has a clearly defined software development lifecycle.
Um, and so for each kind of category of tasks or each step in the software development lifecycle, we have a corresponding droid. So that's kind of a, kind of a first pass there. I think there was a second part of your question that I missed. We'll get into the rest of it. Where did the name Droid come from? It's a pretty catchy name. It's very memorable and distinct to factory. Where'd that come from?
Yeah. Keep in mind, when factory started, this was, like you mentioned, about a year and a month ago. And, you know, I actually, I would have thought this would be different 13 months later, but this is still very much the case where agent is synonymous with unreliable, stochastic demo ware vaporware. And I think something very important for us is we want to build these systems that aren't just like, cool examples of what is to come, but rather valuable today. And not just valuable for a hacker on a side project, but valuable to enterprise engineers today.
We felt very strongly that agent just doesn't really capture what we're trying to deliver. And so, fun fact, we were originally incorporated as the San Francisco droid company. But upon legal advice and given, I guess, the eagerness with which Lucasfilm pursues its trademarks, we, we changed our name to factory. Fair enough. So is it fair to say, then, that a droid is sort of like a job specific autonomous agent that actually works? Is that a reasonable way to think about it? Yes. Okay. Exactly.
You just said the words cognitive architecture, and I know my partner, Sonya Huang, well enough to know that this is her love language. So I'm sure that Sonia's mind just lit up with a whole bunch of questions for you. So I don't want to get in the way. Sonia, have at it. We just had Harrison on the podcast who talked about custom cognitive architectures as well, I guess. What are you doing on that front? And how do your implementations dovetail with the multi droid strategy that you're taking? Yeah, absolutely.
I mean, it's a great question, and I think the way that we think about reasoning and cognition within the standpoint of these systems, there are clearly huge innovations happening on both layers, the foundation model layer, as well as on the orchestration or application layer. The way that you can think of our technical approach on this is that traditionally, labs like DeepMind and some of these orgs that are really focused on solving problems that you can model like a game where you have rules and an environment and feedback loops, you can build out systems which model the reasoning of humans and even outperform them.
They did this with the Alpha series of models, protein folding, go code. And for us, most of the reasoning work we do is similarly focused on inference, time, reasoning, search through decisions, and. And what we kind of think of as, maybe it's something of intuition, maybe it's something of planning. But we aren't training foundation models yet, and I think a lot of the innovation that's going to happen at the foundation model layer will be things like latency and context window and performance on some subset of tasks.
But anytime that you need action and environmental feedback and kind of long term planning, it's going to be really difficult to build a single foundation model that does that. And I think it's really the application layer where those types of innovations are going to happen. Yeah, I thought the Princeton SWE agent paper that came out last week or so was really interesting as an example of that, of you can get a incredible agentic reasoning performance on code tasks from small open source models. I thought that was really nice. The proof point of what you're saying. We love that the whole team that put that together, and the SWE bench work, I think, is a popular benchmark in the space.
I think it's clear that a lot of the effort towards building these systems relies on not just any one benchmark or eval or set of tasks, but rather collaboration across a bunch of different areas. Whether it's the model layer, whether it's the tasks themselves, it's what data are you using to evaluate and ultimately, like the overall architecture. And, yeah, they're a really great team. We're super pumped to see their work. Okay, last question on this, and then I will pause myself. Any favorite cognitive architectures? Like, is it. The tree of thought stuff. Chain of thought stuff. Any favorite cognitive architectures that you think are especially promising or fruitful in your field?
Yeah, I think that's a great question. I mean, I think kind of what I alluded to previously, when you have almost like, the game like problem space where there are kind of simulatable, analyzable, and optimizable boundaries, then that means that you can search through those decisions. And there's a bunch of techniques like Monte Carlo tree search, language agent tree search that people have talked about in research papers that I think are interesting approaches here. I think that in my mind, there isn't a singular cognitive architecture that makes sense for all tasks.
And a lot of the benefit of breaking down the software development lifecycle into semantically meaningful segments is that developers, when they have these workflows that move from one step to the next, they've defined the boundaries of the game, so to speak. A lot of the work we do is figuring out which cognitive architecture or what design makes sense for a given task. You reminded me of the rich Sutton bitter lesson, search and learning are the two techniques that scale. Yeah, absolutely. And I think you definitely need both. And then you were talking about this a bit, how the sort of the reasoning layer on top of the foundation model is really the focus for a lot of the fundamental research and a lot of the fundamental work that you guys are doing. Matan, you had a line a couple months ago when we were talking that was.
And hopefully this doesn't come across as starkey, because it's not meant to, but it was something to the effect of, there are 800 engineers at OpenAI working on my margins for me. Can you say a word about that? Because I thought, first, that was incredibly well put. And then second, pretty good insight in terms of how you're building the business and really benefiting from the work of the foundation models. Can you just say a couple of words about that?
Yeah, absolutely. You know, there are. There are a lot of companies, um, a lot of startups that are, you know, pursuing training foundational models or fine tuning models. And there are a lot of huge research labs like OpenAI and anthropic, uh, who are also putting a ton of resources behind making these foundational models, um, better, cheaper, faster, um, and from our perspective. Right? Like, we don't want to run a race that's, you know, not suited to our abilities. Right. Or we don't want to. We don't want to fight a battle that we know we won't win.
Training foundational models, we are not going to win that battle. And similarly, I also don't think it's a particularly unique battle at this point. I think these companies were incredibly unique and innovative, clearly based on what they're delivering. But now I think the stage is set in terms of training foundational models. And I think similarly, with a lot of the infrastructure for fine tuning and that sort of thing, what has not really come to fruition yet is actually making products with AI that people are using.
There's so much talk about all these foundational models, all this infrastructure, and there's still very few real products that use this AI. In the analogy that VC's like to talk about a lot, we have a ton of picks and shovels and no one's actually going for gold. And so the thesis behind how we're building this company is let's first use these beautiful black boxes that OpenAI, anthropic and Google are spending billions of dollars and hundreds of engineers to make. Let's use these black boxes and build a product that people are actually using.
And once we do that, then we can earn the right to do the fancier things like fine tuning and training. If you're unable to build a product that people are actually using with these incredible models, then chances are fine tuning and training will not save you, and it's probably just not a good product. And so that's kind of the approach that we're taking there. And so we do get a lot of improvements when new models come out. But, yeah, we are very much grateful for the work that's being done at these cutting edge research labs.
You've said a lot about how what you're doing is kind of like making AI immediately practical for engineers in, like, an enterprise setting. And so I want to throw another, I think it's a Matan quote, and I'm not sure if you were quoting somebody else, but you said you, you were talking us last time. You said, if Jeff Dean shows up at your office and he doesn't understand your code base, he won't be productive and unfact that for us. Like, what does it take to kind of make a coding agent that's not just good for anybody that boots up a computer, but somebody that's a full time engineer at a real software company?
Yeah, totally. And, yeah, so, so the analogy here is that, you know, uh, Jeff Dean is the analog of a really, really good foundational model, let's say, like GPT six, with incredible reasoning. Right? But if it comes into your engineering organization with all your nuances and all your engineering best practices. Just having that good inference and good reasoning is not enough to actually contribute and automate these tasks reliably. Some given isolated task, sure you can solve like give it some like leetcode problem and this give Jeff Dean elite code problem, I'm sure he will solve it. But if you have some, you know, 20 year old legacy code base, some part of it is dead code.
The other part of it, the last person who was contributing to it just retired and so no one else knows what's going on there. You need deep understanding of the engineering system, not just the code base, but like why you made certain decisions, how things are being communicated, what top of mind priorities are for the engineering organization. And it's kind of these less sexy but incredibly important details that we're really focused on in order to deliver this to the enterprise. What about the. I think a lot of these AI coding companies are kind of focused on the individual developers productivity.
How do you think about the individual level optimization versus maybe the system, the system wide optimization? I think the important thing to think about with respect to the whole is when a vp of engineering comes into the room, they're not really focused on whether or not an individual completed one task an hour faster. They're concerned about how many tasks are being completed and aggregate metrics of speed. But if that person completed that task an hour faster, but it's 40% worse code, right? It's churny code where people are going to rewrite on top of it.
Or that person took that task and they did it an hour, but it took them 4 hours to plan that and they were blocking five other engineers. And so when you start to actually add the nuance of what does it mean to be successful measuring an engineering, you start to bump into a lot of challenges with understanding what needs to be improved and what is a bottleneck and what is just a secondary metric. I think a lot of the initial attempts at making AI coding tools are really focused on first order effects, how quickly is somebody tabbing to autocomplete statement, or how quickly is somebody completing an individual task.
But I think that at factory, a lot of what we're trying to do is understand from an engineering leader's perspective, how are you measuring performance and what are the metrics that you look at to understand, hey, we're doing really well as an.org or hey, we need to make improvements and targeting those. And I think metrics like code churn, end to end, open to merge time, time to first answer within the NGorg, all of these things are much more impactful to an organization's speed of shipping code. And so that's kind of how we think about it.
I think this really ties into what Eno was just saying quite well, which is clearly, we were talking about products earlier as well. Clearly, the AI product that has penetrated the enterprise the most is copilot, right? Unfortunately, with a tool like Copilot, the things that are kind of the metrics that are really held up as success are things like autocomplete acceptance rate. And the problem is exactly to your point. If you're a CTO or a VP of engineering, how do you then go to the executive team and say, hey, look, our auto complete acceptance rate is this high.
They don't know what that means. They don't understand how that translates into business objectives. Also, Eno was alluding to this. There's a hidden danger to some of these autocomplete tools, which is orgs that use tools like this end up increasing their code churn by anywhere from 20% to 40%. There's some studies that look into this, there's some problems with these studies, but directionally, what's clear is that as the percentage of AI generated code increases, code churn. If you're not doing anything different in your review process, code churn is going to go up.
And so our reason for focusing on wide metrics is that it divides out all of these concerns. If we look at things like, how fast are you completing your cycles, what is your code churn across the or across these different repos, that divides out these smaller intermediate metrics and gives you a sense of, hey, we are shipping faster and we're churning less code. That's really how we talk about this with these engineering leaders. At the end of the day, the three kind of main axes we look at is saving engineering time, increasing speed, and improving code quality.
And ultimately, so these are three. And again, there's kind of different complexity of metrics for different parts of the. These are the three that we discuss with engineering leaders, but we want to arm them with information when they're talking to, let's say, their CFO. And so really, we kind of break that down into one main metric, which is engineering velocity. And that's really what all of these droids are targeted towards, is increasing engineering velocity.
Let me try to recap a couple parts of the story thus far. So, in some ways, this is a compound lever, meaning AI is a lever on software. Software is a lever on the world. And so building an autonomous software development system is one of the most impactful things you can possibly with your lives, which is pretty cool. There are a few unique angles to the approach that you guys have taken, or maybe not unique, but distinctive, you know, one of which is the decision to write on top of the foundation models, which means that you get to benefit from all their ongoing innovation.
It also frees you up to really focus on the reasoning and the agentic behavior on top of those foundation models, which is part of the reason why you can deploy your product as a series of droids, which are basically job specific autonomous agents that do something like test or review end to end in a way that is practically useful to an engineering organization. And instead of focusing on just producing more code, you're actually focused on the system wide output, which requires you to have really detailed context around not just the code base, but all of the systems and processes and nuance around the entire environment.
And having done so, you can increase velocity for an organization. I think that's a bunch of the story that we've talked about so far. Let's talk a bit about the results. Are there any good customer examples you can share of factory in action and the results that you've been able to have for people? Yeah, I think some of the main things that we're seeing across the board, and we're not super public on case studies just yet, but something that we see across the board is, I think our average cycle time increase is around 22%.
On average. We are lowering code churn by 13% tools, and I guess we haven't even gotten into the specific droids, but tools like the test droid end up saving engineers around 40 minutes a day, which is pretty exciting. And yeah, I think kind of going back to what we were talking about in terms of benchmarks, one of the most exciting things about having thousands of developers who are actually using these tools is that we get this live set of benchmarks and we get evals and feedback from these developers about how these droids are performing.
And so, like Ino mentioned, we are huge fans of Sweebench and what that's done kind of for the general community and giving people like an open source benchmark to really compare these models. But strategically, for us, having this deployed in the real world has allowed us to dramatically increase our iteration speed in terms of quality for these droids. What have you guys learned along? Since you have a bunch of people using this in the real world? What have you learned along the way? And have there been any big surprises. Engineers love ownership.
Yeah. All right. Say more. Absolutely. I mean, I think it really is that when you're building an autonomous product and the goal is to take over a task, you have to deal with developers who are fickle for good reason. They're constantly bombarded with developer tools and automations, and anything that's kind of being enforced from a top down perspective needs to be very flexible, making sure that when we're building these products, we think about what are the different preferences or ideas that people have about how this task should be done, and then building as much flexibility into that. I think a great example of this is the review process.
Everybody has a different idea of what they want code review to look like. Some people want superhuman linters. Some people want really deep kind of analysis of the code change. Some people don't even like code review. They get annoyed by it entirely. Matan has a great quote about what code review is like. I don't know if you want to share that. Yeah. In general, we've kind of internally realized that the code review process is very much like going to the DMV, in that no matter how clean the DMV is, no matter how fast the line is, no one loves code review, because at the end of the day, someone's criticizing you, someone's going in and looking at what you did and saying, better ways you could have done it.
So in general, the review process, it's the type of thing that, as an engineering leader, it's great to see moving the needle on these organization wide metrics. As a developer, it's maybe not the most fun thing, whereas something like the test droid, which is generating tests for you so you don't spend hours writing your unit test. That's incredible. As a developer. But, you know, for the engineering leader, it's slightly less obvious how that connects directly to business metrics. So I think this is part of why it's important for us to have this fleet of droids, because we're not just building this for the engineering leader, nor are we just building this for the developer, but rather for the engineering organization as a whole.
Part of what I heard there was that I don't have to go to the DMV anymore. You can just send me my driver's license in the mail. Yeah, basically. Yeah. Love it. It's a good way to sum it up. Have you guys seen pad drive? I don't think they should be sending him a driver's license. I'm not at Seattle. We have Waymo for that. Speaking of Waymo, how far out do you think we are from having fully autonomous software engineers?
If you talk about Waymo, it felt like it was going to come really fast, and then it felt like we went through a valley of despair and now it's the future is coming at us super fast again. Which inning are we in for the kind of fully autonomous software engineer cycle? And when do you think we'll have fully autonomous? Jeff Deans this is a great question, and I think one that we get a lot. I think one thing that's worth is kind of like reframing what a fully autonomous software engineer will do. There have been many moments where techno technical progress has led to kind of, you know, labor dynamic changes and increases in the level of abstraction in which people work. And I think that historically enabling people to operate or impact the construction of software with, you know, at a higher level of abstraction, with less domain knowledge has generally led to huge increases in demand for software.
I think that what we're seeing with the customers we're working with today is that when you free people up from these types of secondary tasks, like generating unit tests, that map to a pull request, or writing and maintaining documentation on a code base that 95% of people know but that documentation comes into play for that 5% that doesn't, they start to shift their attention to higher level thinking. They think about orchestration, they think about architecture, they think about what is this PR actually trying to do? And less about? Did they follow the style guide?
I think that what we're seeing is that this is happening today already because of AI tools. And over time, as they get better and better, we'll see that shift towards software engineers becoming a little bit more like architects or designers of software. In the future, I think there's going to be ten times more people involved in software creation where every individual has the impact of maybe 100 or 1000 people. It just may not look exactly like the individual steps of the development lifecycle that we see today. That reminded me of a quote that you guys have on your website which said, and I'm going to read this, it says, we hope to be a beacon of the coming age of creativity and freedom that on demand intelligence will unlock.
And that really resonated with me when I read it because it sort of implies a very positive and optimistic view of the world that we're heading into. I wonder if you guys want to say a couple more words on that or sort of what you think the relationship between man and machine will be in the fullness of time. This kind of goes back to our original approach, which is, it's very tempting to go after the sexiest parts of software development, in particular, building an app from scratch. But that's also the sort of thing that will make a developer defensive, because that's the part that they enjoy.
Right. And so in a world where you automate the development, then an engineer is just left reviewing testing and documenting, which is like a depressing hellscape if you were to ask any, any software engineer. Right. So for us, it's very important that we position ourselves aligned with the developer instead of, you know, going into these organizations and being antagonistic with them. Right. Like, by going in and automating the things that we don't want to do, or rather by going in there and automating the things that developers don't want to do, we are positioning ourselves with them. Right.
Five years from now, I don't think anyone really knows what software engineering will be or even if it'll be called that anymore. You know, to Inna's point, it might be, you know, you're like a software curator or cultivator or orchestrator, but by positioning ourselves this way with the developer, wherever that role goes, we will be there side by side to allow them to have this higher leverage. And so, yeah, completely agree to your point. Like, this is one of the most incredible things that is going to happen to, you know, our ability as humans to create. And I think for us, it's just incredibly important that we are aligned with the users of this product and not antagonistic trying to replace them.
How far do you think we are from having these reliable kind of maybe call it intern level engineers? Is it a year out? Is it really here today? Is it decade out? I think. I mean, it depends on the. The task for things like code review and testing. I think we're here. We're already there where we're able to operate at a level that, for many, there's feedback from one organization that we got in particular where we brought them the review droid. And this was pretty early on, and they said the review droid is the best reviewer on our team.
I think that every once in a while, you hear something like that, and it gives you a lot of kind of confidence that directionally, we're definitely moving towards something that is valuable. And for tasks like, hey, we've got to decompose our Monorepo into a ton of microservices and the type of thing that you might arm, like a staff level engineer armed with a team of engineers under them. I think that we won't see a binary moment of, oh, well, now this is done by an aihdeende.
I think that their responsibilities will slowly start to get decomposed into the tasks of planning and implementing the refactor, going one file at a time. And when they start handing off those subtasks to AI, I think that role will start to be called something different. Because when you're no longer as focused on what is the individual line of code that I'm writing tomorrow and more focused on what is our mission or what is our goal as an engineering team, you really are more of an architect and less of an implementer and a concrete example of us eating the food that we're creating.
We were dreading for months creating a GitLab integration. Some of our customers use GitLab. We want to build cool AI stuff. We didn't want to spend time building a GitLab integration. We had our code droid fully spec out what the steps of building a GitLab integration would look like, and then it actually implemented every one of these sub tickets. We were, of course, monitoring it just to make sure it wasn't breaking anything. And we now have a GitLab integration.
And so this is something that we genuinely were considering getting an intern to do, because we just, we were just, we really didn't want to do a GitLab integration and, you know, shout out GitLab. Yeah. But materially, the droid saved us hours of time. None of us had built a GitLab integration before. And also it's just relatively complicated to abstract away the source code manager. And so that was materially intern work that we did today. So to answer your question, it is now. It's just kind of slowly climbing up more and more.
The level of complexity of these tasks future is really here it is. I have a question about competition, and not specifically the competition in your space, but I think how you more generally think about navigating competition. I think you guys are the type of founders that a lot of companies in the application layer really look up to, because you're insanely ambitious building a real company of meaning you're doing a lot of smart decisions, like writing on other people's models. I think the obvious kind of scary other side of that is every other competitor in the space has access to the same models as you.
I'm curious how, maybe just mentally, and then I guess overall, you think about approaching competition in this space. Do you think it's more elevated in this, in the application layer? AI market than in other startup markets historically. And how do you think about navigating that? Totally. Yeah, I think that's a great question. And I think that's really, you know, our approach to that has defined how we've built out this team. And really, I think there are a lot of ways you can respond to competition and, like, mentally kind of justify your existence versus competitors.
I think for us on the team side, we are just a team of people who are more obsessed, uh, than anyone else out there. And I think that is like something that just has compounding benefit of. I am willing to bet everything that the people that we have assembled are just more obsessed than everyone else working in this space. I think a kind of a corollary to that is the only way you can win is by executing faster. There was everything else is all just like sprinkles on top.
The only way you can really win is by executing faster and being more obsessed, and that is what our team is. And I think, I guess one last thing is having a group of people who respond to kind of external pressures as, like, more motivating and, you know, responding in that way, also being very mission driven. Right. Like, if, you know, a competitor does something big and then suddenly you're deflated. Well, if you're truly obsessed with a mission, it's irrelevant. If you're truly obsessed with our goal of bringing autonomy to software engineering, all of that is noise.
What we need to do is execute as fast as possible in this direction that we've set, and the rest will sort itself out. Love it. Really well said. Maybe a few final questions to close us out. If you weren't solving the autonomous software engineering problem, what problem would you be solving? I guess I have to be banned from coding agents for this, perhaps robotics. I find robotics very interesting. I think a lot of the time, the team here, a lot of the team comes from backgrounds working on autonomy and robotics, and we talk about how what we're building really kind of resembles that in many ways.
I think multimodal function calling LLMs are here, and the robotics companies decreased hardware costs that are coming out of are clearly making progress. So it feels like a fun area. So you'd be making physical droids. Exactly. It's on the roadmap without you a ton. Yeah. I think this is one of my blind spots where I just suffer from severe tunnel vision. I genuinely cannot fathom working on anything else. I'm just genuinely obsessed with our mission to bring autonomy to software engineering.
If I wasn't working on this I'd figure out a way to work on this. I know that's a cop out answer, but I genuinely. It does not compute. That is in fact a cop out answer, but it is a fantastic cop out answer. So we will take it. One of the questions that I always like to ask is, who do you admire most in the world of AI? And tell you what, Matan, because of your background, we'll let you look at the superset of AI and physics, if you like. I would say a name that comes to mind when you say that is Jeff Dean.
I think we mentioned him earlier already, actually, but his impact in research is one huge side of that. I think Tensorflow and the work that whole team has done at DeepMind and related. But I've also heard he's a nice guy. And I think that the thing is, having responsible, I think leadership in the AI community, I think, is really important. And there's a lot of folks who are on Twitter all the time clashing. And I think that seeing folks who are outside of that side of it, I think is pretty great.
Yeah. And I think, not to give you guys a double cop out, but at factory we very highly emphasize collaboration. And I think, like in AI in particular, everything has been done by groups of people. And so it's hard to really think about one individual. I think physics, there are a lot of more, like solo genius doing something crazy. But I think a team recently that I think we really admire at factory is mistral and how kind of quickly they basically came into open source and brought those models to basically the cutting edge in a super short amount of time.
And I think, you know, I speak not just for myself, but I think all of our team really admires both the mission that they have and the speed with which they executed on that. Um, so, yeah, I would say Mister Oslo. All right, last question. If you had to offer one piece of advice to founders, or would be founders hoping to build in AI, what piece of advice would you offer them? We're. We are in a land of picks and shovels and no one has struck gold yet, clearly. So I would say go for gold.
I would say try to build something that you think is going to get ten x better if OpenAI releases GPT, six or seven. I think internally we think of our product as something that will multiply in value and uniqueness when new models are released. And I think for us, it's always like we were listening to the OpenAI announcement yesterday, and everyone is excited, everyone's pumped when a new model comes out when open source does something great. If you're stressing about new product releases or demos, it might mean it's worth adjusting your product strategy.
Congratulations on launching factory and beating state of the art on Sweebench by such a wide margin last week. It's incredible. Just for our audience, can you maybe quickly recap what sweetbench is? Yeah, absolutely. And thank you. All credit goes to the factory team for making it happen. Swebench is a benchmark designed to test an AI system's ability to solve real world and software engineering tasks. So it's around 2300 issues which were taken from contributions made to twelve popular open source Python projects.
And typically these issues are bug reports or unexpected behavior that people reported on these open source projects. And the idea is all of these real world issues were addressed by other humans. And so you have a ground truth of what a human software engineer would do when faced with an issue. And the benchmark is trying to test, can your AI system go through each of these issues and generate a code change that properly addresses it and comparing it to the human solution with tests that a human wrote? And so there's a lot of asterisks, but it is a somewhat useful approximation of your system's ability to take natural language and then turn that into code.
And I think the previous high watermark on Sweden was 14% or so from cognition Devon until last week. And you put up a really impressive new results at 19%, which is such a wide margin. This is such a competitive field right now, and such a competitive benchmark that everyone is trying to beat, which makes your results even more impressive. Could you maybe share a little bit about your approach and how did you get there?
Definitely one of the main reasons we were interested in Sweetbench is that there's a lot of companies and research labs that made submissions. You can see Microsoft Research, Amazon, IBM, ByteDance, and I think that's a testament to the Swebench team's effort in making this benchmark a household name, which is great. I think one of the reason we were able to out compete the kind of like well funded tech giants and other AI Cogen startups is that we're honestly not building the code droid for a benchmark, but rather to support real world customers. And we've always said customers are the best benchmark.
I think this is some great evidence for the success of that approach. There's a few areas our technical report goes into around planning and task decomposition, environmental grounding, code based understanding. But overall, I think that the thing that matters most when your team's working on these types of general software problems is, what is the North Star? What are you iterating against? Having a real world data set can make a huge difference. And we just had Harrison on the podcast last week actually talking about cognitive architectures.
To what extent did prompt engineering and cognitive architectures play a role here in your results? I would characterize our research as continuously pushing the question of how can we model each droids architecture to more closely resemble the human cognitive process that takes place during the task. It's funny, we actually have internally been referring to the flow of data an LLM calls as the droid architecture, basically since the first droid.
And when Harrison first wrote about cognitive architectures, it really became apparent that concept cognitive architecture is a great mental model for how to characterize the systems that have complex LLM interactions and data flow. And so for us, I think the meta problem of designing a good cognitive architecture is balancing flexibility with rigidity in the actual workflow. You want very rigid entry points and certain calming trajectories, like error recovery, need to be really consistent, but then you want the flexibility and the dynamics during the majority of the problem solving process. And so it's a challenging balance.
But I think it's one of the most interesting problems when building the droids is how do you know when to add structure and when to let the droid, so to speak, handle it? Really cool. So every droid has its own cognitive architecture that mirrors as closely as possible what the human equivalent of that task would be doing. Yeah, exactly 19% is amazing compared to piracy of the art. It also still feels quite far away from, you know, reliable code droids that people will just trust to run wild in their code base.
What do you think is the threshold at which engineers will actually start to use these code droids reliably and just let them run? Are we there yet? What is the threshold? Yeah, for sure. I think that one thing to keep in mind is that the percentage on a benchmark like sweebench is one of many possible measures, because the answer, I think, really is that they are already using it in production. But the use cases that might highlight what the code droid is designed for may not necessarily have a ton of overlap with what is tested in a given benchmark.
So if you take human eval or some of the other coding benchmarks, that maybe tests your ability to pass a coding interview, but it doesn't really test your real world software engineering sweetbench, I think actually does test a lot of real world software engineering, but in the particular context of debugging or kind of unexpected behavior identification, there are some feature requests and there are a lot of not explicitly debugging style problems. But tasks like a migration, a refactor, a modernization that take place over multiple changes that oftentimes have humans very heavily collaborating are really a pretty different problem.
And our internal evaluations are much more focused on those customer tasks, and so we have way higher reliability rates for those style of tasks. And I also think that a huge part of the role of human AI interaction design is acknowledging where the systems are currently falling short and building into your interaction pattern accommodation for the weak points of the AI system. This isn't going to, 100% of the time, perfectly capture the intent of what you were doing.
So how do you kind of have failure trajectory handling? How do you introduce the ability to edit midway as the code droid is working to observe and have some interpretability into why a code droid is making the decision, so that when it does something, the human being actually can step in or at least understand what went wrong.
And so I think that those allow you to say, well, we may not be at 100% on something like sweebench, but we can still use this and get kind of real productive gains in the meantime. Totally makes sense, and I hear you, that sweetbench is not the be all, end all, but since you have a good crystal ball into this space, do you have a prediction at what point we'll get to 80 or 90% on sweetbench?
I think that the pace right now is really, really fast. There's a kind of interesting question of will we get to 80% to 90% on sweebench, or will there be a better benchmark that kind of comes out before we can really meaningfully start hill climbing past the 50, 60%? There's honestly a lot of tasks in sweebench which are, I wouldn't say impossible, but it almost feels like getting them right would almost only indicate that you're cheating. It's like they test for really, really specific claims or a string match.
And so I think that before we see 80% to 90% on sweebench, what we'll actually see is kind of like sweetbench two and sweebench three. That focuses on trying to think deeply about how can we evaluate when a piece of code that is correct but also ideal or useful for a given code base. The sweet bench folks actually have a lot of really great thoughts about how to make these benchmarks better, but I think probably in the next two, three years, we'll see that.
Yeah. And they're Princeton guys as well, right? Yeah. Yeah, they are. We actually shared a thesis advisor. Oh, no way. That's very cool. Well, Ino Matan, thank you so much for the conversation. Congratulations again on these results and on launching factory. We are so excited. Thank you. Thank you very much.
Artificial Intelligence, Technology, Innovation, Autonomous Agents, Software Engineering, Startups, Sequoia Capital
Comments ()