ENSPIRING.ai: Why Vertical LLM Agents Are The New $1 Billion SaaS Opportunities
In this discussion, Gary, Jared, and Diana introduce Jake Heller, founder of Casetext, who successfully created an AI-driven legal tool. The conversation revolves around the revolutionary impact of GPT-4 on the tech industry, as Jake's initial legal annotation firm skyrocketed to prominence and was acquired for a significant sum due to the integration of this advanced AI technology. The narrative explores Jake's strategy and tenacity in pivoting his company toward AI innovation, achieving massive growth and a lucrative exit.
The speakers delve into Jake's journey, emphasizing his early engagement with AI technology and his foresight in betting the company on its potential. They discuss the challenges of transitioning a whole team to focus on AI development within a tight timeframe and how industry perception changes shifted the market dynamics. The video highlights the potential of AI to fundamentally change workflows within the legal sector and beyond, making it a pivotal topic for anyone interested in the future of AI and business strategy.
Main takeaways from the video:
Please remember to turn on the CC button to view the subtitles.
Key Vocabularies and Common Phrases:
1. godlike feeling [ˈɡɒdlaɪk ˈfiːlɪŋ] - (phrase) - An awe-inspiring or superior feeling, often relating to something extremely impressive. - Synonyms: (divine sensation, superior awe, overwhelming admiration)
This is our first ever experience talking to this godlike feeling, AI that was all of a sudden doing these tasks that would take me when I practiced a whole day.
2. revolution [ˌrɛvəˈluːʃən] - (noun) - A sudden or significant change in conditions, attitudes, or operation. - Synonyms: (transformation, upheaval, change)
You were one of the first people to actually realize this is a sea change in revolution.
3. annotate [ˈænəˌteɪt] - (verb) - To add notes or comments to a text or diagram to explain or comment on it. - Synonyms: (comment, note, elucidate)
And the first versions of it were actually sort of annotated versions of case law, actually.
4. exonerate [ɪɡˈzɒnəˌreɪt] - (verb) - To relieve someone from blame, fault, or wrongdoing. - Synonyms: (absolve, clear, acquit)
...find the piece of evidence that was gonna exonerate my client.
5. altruistically [ˌæltruˈɪsdɪkli] - (adverb) - In a manner that shows a selfless concern for the well-being of others. - Synonyms: (selflessly, benevolently, unselfishly)
And altruistically, lawyers bill by the hour.
6. incremental [ˌɪnkrəˈmɛntl] - (adjective) - Relating to or denoting an increase or addition, especially one of a series on a fixed scale. - Synonyms: (gradual, step-by-step, progressive)
A lot of what we did were, relatively speaking, kind of incremental improvements on the legal workflow.
7. existential [ˌɛɡzɪˈstɛnʃəl] - (adjective) - Relating to existence; concerning or relating to existence, especially human existence. - Synonyms: (philosophical, ontological, essential)
I mean, we saw people go through, like, existential crises, live on Zoom calls.
8. proprietary [prəˈpraɪəˌtɛri] - (adjective) - Owned by a private individual or corporation under a trademark or patent. - Synonyms: (exclusive, patented, owned)
In our case, proprietary datasets like the law itself and our annotations to the law that we added automatically.
9. totem [ˈtoʊtəm] - (noun) - An object or symbol representing a person, family, or community, often with spiritual significance. - Synonyms: (symbol, emblem, icon)
Like one totem or like, one example along the way is when GPT 3.5 came out.
10. hallucinate [həˈluːsɪneɪt] - (verb) - To see or perceive something not present; often used metaphorically in AI contexts to describe the generation of false or inaccurate information. - Synonyms: (imagine, envision, perceive incorrectly)
You can’t hallucinate, you can’t even make the wrong kinds of assumptions.
Why Vertical LLM Agents Are The New $1 Billion SaaS Opportunities
This is our first ever experience talking to this godlike feeling, AI that was all of a sudden doing these tasks that would take me when I practiced a whole day, and it's being done in a minute and a half. The whole company, all 120 of us, did not sleep for those months before GPT four. We felt like we had this amazing opportunity to run far ahead of the market. That's why you're the first man on the moon. Yeah.
Welcome back to another episode of the Light cone. I'm Gary. This is Jared and Diana. Harj is out, but he'll be back on the next one. And today we have a very special guest, Jake Heller of Casetext. I think of Jake as a little bit like one of the first people on the surface of the moon. He created case text more than, I think, 1112 years ago, actually. And in the first ten years, you went from zero to $100 million valuation, and then in a matter of two months after. After the release of GPT four, that valuation went to a liquid exit to Thomson Reuters for $650 million. So you have a lot of lessons about how to create real value from really large language models. I think you were, of our friends in YC, one of the first people to actually realize this is a sea change in revolution. And not only that, we're going to bet the company on it. And you were super. Right, so welcome, Jake. Happy to be here.
One of the cool things I think about Jake's story and reason why we wanted to bring him on today is that if you just look at the companies that good founders are starting now, it's a lot of vertical AI agents. I mean, I was trying to count the ones. In s 24, we have literally dozens of the YC companies in the last batch were building vertical specific AI agents. And I think Jake is the founder who is currently running the most successful vertical AI agent. It's bye far the largest acquisition, and it's actually deployed at scale in a lot of mission critical situations. And the inspiration for this was we hosted this retreat a few months ago and Jake gave an incredible talk about how he built it. And we thought that it would be super useful for people who watch the light cone, who are interested in this area, to hear directly from one of the most successful builders in this area how he did it.
So how did you do it? Well, first of all, like a lot of these things, there's a certain amount of luck. Over the course of our decade long journey, we started investing very deeply in AI and natural language processing, and we became close with a number of different research labs, including some of the folks at OpenAI. And when it came time for them to start testing early versions, we didn't realize it was GPT four at the time, but what was the. What was GPT four? We got a very early kind of view of it. And so months before the public release of GPT four, we, as a company were all under NDA, all working on this thing. And I'll never forget the first time I saw it. It took maybe 48 hours for us to decide to take every single person at the company and shift what they were working on from the projects we were then working on at the time to 100% of the company, all working on building this new product we call cocouncil, based on the jeep d four technology.
How many people was that? We're about 120 people at the time. So you took 120 people and completely changed what they were all working on? Yes, yes, yes. In 48 hours? Yes. And for the people watching case techs, originally, I mean, had always been in the legal space. You're a lawyer, and you built something for yourself. And sort of the first versions of it were actually sort of annotated versions of case law, actually. Yeah, that's exactly right. So, in the very early origins of the company, the mission of the company, what we're always focused on is how can we build something that brings the best of technology to the legal space? As a lawyer, I actually liked the job a lot. The parts of my job that I hated the most was when I had to interact with the technology that lawyers have to use regularly to get the job done.
I remember thinking, and this is like 2012, when I was at a law firm, if I wanted to do something really trivial, I had, like, a new iPhone at the time. I can go on Google and find, like, movie times, or where's the closest open thai restaurant with vegetarian options? That was super easy. But if I wanted to find the piece of evidence that was gonna exonerate my client and make it so he doesn't have to go to jail for the rest of his life, or the key legal case, they'll help me win a billion dollar lawsuit. Well, that's gonna be like, five days in a row until 05:00 a.m. every day. I was like, there's gotta be a better way.
What is the process as a lawyer? You would have to read the stacks and stacks of documents. Pretty much, yeah. Right before I started practicing, before everything went virtual or, like, online, you would literally be in a basement with bankers boxes full of documents, reading them one by one by one to try to find all the emails in a company like Pfizer or Google to see if there was potential fraud. And then if you wanted to find case law slightly before my time, you'd literally go to the library and open up books and just start reading. And new products were coming out that were some of the first web based research tools, but they were pretty clunky.
It was just hard to find the relevant information. You couldn't do control f for any of this stuff. Basically not. Basically not. Yeah. And what was interesting about your background is you also happen to be the rare breed of having also computer science training. So this must have driven you nuts. Yeah, exactly. I mean, in the law firm, I'll never forget, I was building browser plugins to go on top of the tools I was using just to make my life more efficient and effective. Actually, one of the reasons I left the law firm to start a company and apply to YC was I got in trouble with the general counsel who thought like, hey, why are you spending all your time doing this tech stuff? And also made at the time, very clear that my law firm owns all that technology. So I decided to do something different.
So do you want to tell us a little bit about the first ten years of case techs, the sort of like long slog in the pre LLM era? One of the lessons here, I think, that I took away from that time period is that when you start a company, you may not get the exact right, you may have like the right kind of general direction. You know, there's a problem, you're trying to solve it, but it could take a very long time to figure out what the solution is for us. For example, you know, we saw that there was this kind of combined issue of like bad technology in the legal sphere, but also like this very, like a lot of lawyers use content to do things like research and understand what the law is. And so we thought, okay, well, we can do the technology better, but how are we going to get this content? And we spent like a couple of years trying to get, as Gary said, lawyers to annotate case law and to provide information.
So it was like a UGC site, like a user generated. Yeah, that was a big focus of ours. Like the kind of one two punch of better technology, but also better content. At the time, our heroes were like stack overflow and Wikipedia and GitHub and other kind of open source or UGC kind of websites, and it was a total failure. We could not get lawyers to contribute their time and information. And I think these are just different populations. The typical Wikipedia editor has more time on their hands than they know what to do with. And so they're adding, not all, but many do, and they're adding content for free. And altruistically, lawyers bill by the hour. Their time is incredibly valuable. They're always running out of time. They had no time to contribute to some UGC site, so we had to pivot.
We started investing very deeply. At the time, it was not called AI. It's just like natural language processing and machine learning. And saw that, first of all, we didn't need to cradle this UGC to replicate some of the best benefits of what our competitors had in these big content databases. Some of it you can basically do even then, automated basis. Then also we were starting to create these user experiences that were, you know, a lot better than what our competitors could offer based on then at the time, what seems kind of quaint, like AI stuff, like the same recommendation algorithm that powers Pandora and Spotify's recommended music you can use. They look at basically is how this song relates to that song.
People listen this, also listen to this and this and this, right? Similarly, we looked at, okay, cases that cite to other cases, they all reference earlier opinions. They kind of build out this network of citations. And we found ways that we can check a lawyer's work. They'd upload their work so far and be like, well, everybody who talks about this case talks about this case too. And you missed that. So cool experiences like that. But the truth is, until the very end, until co counsel, a lot of what we did were, relatively speaking, kind of incremental improvements on the legal workflow. One of the things that's kind of weird about this is when there's just an incremental improvement, it's actually pretty easy to ignore. A lot of our clients, they would never say this literally, but you get this impression.
You walk into their room, their office, and you try to pitch them a product and you say, this is going to change everything about the way you practice. And they go, well, I make $5 million a year. I don't want nothing to change this technology pull up. I do not want to introduce anything that has the opportunity to make my life at all worse or potentially worse or potentially more efficient because they bill by the hour. It was really only after much later when chat GPT came out. At the time, we were privately and secretly working on GPT. Four chat GPT came out and all of a sudden every lawyer in America probably in the world saw, oh my God. I don't know exactly how this is going to change my work, but it's going to change it very substantially.
Like they could feel it. And the same guys and gals were telling us, I make $5 million a year. Why would it change anything about my life? Were like, I make $5 million a year. This is going to change something. I need to be ahead of this. The technology itself, and we'll get into a second really changed what we can build for lawyers, but also the market perceptions of what was necessary really changed as well. And for the first time in our ten years, even before we launched cocounsel publicly based on GPT four, they were calling us, we know you work on aih, we need to get on top of this. What can you show us? What can we work on? And I think it's because the change was not incremental anymore, it was fundamental. And all of a sudden they had to pay attention. They could not ignore it.
I guess the mental model I have for you is there's this concept of the idea maze. The founder goes in the beginning of the maze and they're just feeling around, actually in the arena, talking to customers, learning where are the walls, which path to go? Should I go left or right? As is actually common for startup founders, in the idea maze, you will actually reach a dead end. And then usually you have to pivot. And then I think you have a very interesting story because you were sort of towards the end of maybe like one of the parts that weren't going to get you all the way to product market fit. But then LLMs drop, and then it's like the maze got shaken up. Yeah. And then you were actually much closer to product market fit than absolutely anyone else. And so that's why. What a crazy time.
Yeah, it's exactly right. That's why you're the first man on the moon. Yeah, I think there's really something to that. And the thing is, each time we progressed through that maze, it felt like maybe now we're at product market fit. We were making real revenue before we launched coconut, and we had real customers and they said really great things about us. I keep on thinking about this article written by Mark Andreessen in the early two thousands. I think it's called the only thing that matters. In it, he describes what it feels like to have product market fit. He lists things like, your servers will go down, you can't hire support people and salespeople fast enough, you're going to eat for a year free at Bucks, the kind of famous woodside diner where a lot of VC's will take you. The process.
I read that early on in my career, and I was like, okay, well, that's hyperbolic. But when we launched code counsel, it was literally exactly that. Our servers were going down. We could not hire support people fast enough. We couldn't hire sales people fast enough. I ate a lot of bucks before. It was a really big day. If we're in the ABA journal or some other legal specific publication, we were on CNN and MSNBC. All of a sudden, everything changed. And that's what real product market fit looks like. Marcus, even in 2005, whenever the article came out exactly right about what looked like in 2023, can you talk about that crazy time?
Because it was only two months from when you launched co console to getting bought for $650 million. So what happened in those two months? Well, to be clear, the transaction only closed six months after we launched, but it was two months the conversation started. So we started building cocounsel. And just to kind of background purposes, the idea we came up with again, like 48 hours, like a weekend after seeing GPT four was, and it's something that is not still sound crazy today, but it felt crazy at the time, which is this AI legal assistant, by which we mean it's almost like a new member of the firm. You can just talk to it, not unlike how you might talk to something like chat GPT today and give it tasks like, I need you to read these a million documents for me and tell me if there's any evidence of fraud happening in this company.
And then within a couple of hours, it's like, I've read all the documents. Here's what the summary is, or summarize documents, or do legal research and put together a whole memo after researching hundreds or thousands of cases, answering the lawyer's initial research question. And so in that sense, it was this really powerful extension of the workforce of these law firms. That was the concept from the beginning. And we made a very early initial version of it. And we started because under our agreement with OpenAI, we could not be public about this product. But they did let us extend the NDA to a handful of our customers, and so we started having our customers use it. And so for months before GPT four was launched publicly, we had a number of law firms. They had no idea they were using GPT four, but they were seeing something really special.
This is actually even before chat GPT. So this is their first ever experience talking to this godlike feeling AI, that was all of a sudden doing these tasks that would take me when I practiced a whole day, and it's being done in a minute and a half. As you might imagine. It was nuts. First of all, the whole company, all 120 of us, did not sleep for those months before GPT four was publicly launched, therefore could publicly launch the product. We felt like we had this amazing opportunity to run far ahead of the market. Something really beautiful happens when everybody's working super, super hard, which is you iterate so quickly past. And actually, I still see some companies out there, they're stuck where we were in the first month of seeing GPT four, right? And I think it's because they're just not like, as intensely focused and engaged as we were able to be during those, like, couple, about six months or so before the public launch of GPT four.
You kind of, to do this transition, you had to shake the company. You kind of went into deep founder mode, because there's a lot of pushback from employees, is like, oh, this thing was working. Why should we go into, throw ourselves into the deep end of AI? Tell us about that founder mode moment for you. And so, first of all, this is especially true if you're running a business for ten years, because they've seen you wander through that maze and bump into dead ends. And a lot of those folks have been there for most or all that time, watching me as the founder, saying, we're definitely going this direction, it's definitely going to work. And sometimes it doesn't. And you only get so many of those with employees.
Right? So this was maybe my last one that I had with some of these folks. And they're like, here Jake goes again with this crazy new technology and some idea we're gonna invest deeply in. And, yeah, it took some job to convince people. And if you imagine, like, what some of the different roles are, if you're in the go to market role, if you're selling or marketing a product, and we're making, you know, we're growing 70, 80% year over year, we're between 15 and $20 million in ARR. Things weren't, like, terrible, right? That's great. Yeah, we're great. Yeah. But, like, so they were like, what? Why are we blowing in the board? You know, some of the members, like, I get this immediately, some of them had to be persuaded, right? And about the founder mode moment, like, one thing that really worked for me is I led the way through example, I built the first version of it myself.
Even with 120 person company with like a whole bunch of engineers and lawyers and stuff. Like, before that you opened up your ide and actually built the thing yourself. Oh, yeah. And part of it was the NDA only extended at first to me and my co founder, and that was it. That was a blessing, then. Yeah, exactly. It turned out to be perfect. And even after the NDA got extended a little bit, we kept it pretty small at first. For the first little bit of time, I made my mind within 48 hours, the whole company's going to do this. But we actually only told the company, I think, a week and a half after we first got access. And during that week and a half, we built the very first version, prototype version of this. And again, I'll never forget this.
The timing is just so funny. Like, we saw it on like a Friday. We had it all weekend long. We're working with it. And then Monday was an executive offsite where everybody came, all my executives came, and they expected that we're going to be talking about how we're going to hit our sales target for the next quarter. How are you guys. We're talking about none of that. We are talking about something totally different right now. Let me show you something on my laptop. So, yeah, I built the first version myself, but going through that process, me and then a handful of other people, I think, was really helpful. And we also brought in customers early, and that helped convince a lot of people.
As soon as a skeptical sales or marketing or whatever person or even engineer was on the other line, at the end of a Zoom call, where a customer was reacting to the product in real time and giving us their honest reactions and seeing the look on their face again, you have to imagine, it's almost hard to imagine that the world was like pre chat GPT, but some of these people are seeing that exact idea for the first time, and they were just blown away. And that really changed minds quickly. I mean, we saw people go through, like, existential crises, live on Zoom calls. Like, oh, you could see their expression change exactly. In all kinds of ways. It's like, what am I going to do? The very common reaction amongst the senior attorneys we showed it to was like, well, they got a retired suit. Like, you know, I'd have to deal with this.
And some of this was really driven by GPT. Four coming out, like, you had access to three, you had access even to two. I think we were in a close relationship. We went with a lot of the labs, but including OpenAI, and they kept on showing us stuff kind of early on in its development. And they're like, well, can you build something with this for legal? And every time we're like, no, this sucks. By time got to three and 3.5, it was like, okay, well, this is plausible sounding English, and it sounds kind of like a lawyer, so kudos to that. But it is just making stuff up wildly. It's very hard to connect it to a real use case, especially in legal, where it's so important that you actually get the facts right.
You can't hallucinate, you can't even make the wrong kinds of assumptions. And we had to do a lot of work with those earlier models to even get them close to usable. And they just weren't really, I mean, like one, like totem or like, one example along the way is when GPT 3.5 came out, the study was run, and it showed that GPT 3.5 got a 10th percentile on the bar passage. So it did better than some people, actually. But the 10% of them, yeah, probably the ones were just filling it out randomly. Basically, when we got early access, GP four, we're like, let's run the study again, too. And we worked with OpenAI. We're like, we want to confirm this test is not in the training set. And it wasn't totally new test to it.
And the test we ran, it did better than 90% of the test takers. So this is a big difference. And also, we started running some tests like, okay, here's four or five cases to read using those cases. Write a memo responding to this question. And we did a lot of prompt work to get it to essentially just do it accurately, to cite the actual things in context that we gave it, and not make things up. And we're like, okay, well, this is very different than we saw before. So it was a big moment for us.
And honestly, I'm not sure what the mindset was of the researchers we were working with, but almost felt like by the time we were having that meeting, it felt like one of those other meetings we had had in the past where we were getting ready to say, this is not going to work for legal. Keep on trying. I think they saw us go through maybe some form of the existential crisis on that call that our customers did, we were like, oh, wait, this is super, super, super different. I guess today we have zero one. We have chain of thoughts, reasoning. I think a lot of people look at it as it's not merely the text itself, but also the instructions that lead up to the workflow. But way at the beginning, nobody knew any of this stuff. How did you start?
You had your tests that you had written for previous versions of the model. They outperformed. But then there's this moment where you say, okay, well, now it's something, but what do we do next, and how do we do it? So the process that we started with then, it's actually not too dissimilar to what we're doing today. It started with a question of, okay, well, what problem are we trying to solve for the user? The user wants to do research, legal research, and they want a memo answering their question with citations to the original source. That's the end result. Then we're like, okay, well, how do we go from that end result? Working backwards, almost, what would it take to get there?
And what ends up happening a lot with the things that we built for cocounts, we call them skills, which felt very unique at the time. I think a lot of companies now call their AI capability skills. So when you're building these skills, it turns out it usually takes a lot of work to go from, say, the customer inputs something, say, a set of documents or a question or what have you, to the end result that they're looking for. The way that we thought about it was, how would the best attorney in the world approach this problem? In the case of research, for example, the best attorney would get the request, say, from a partner, and then break that request down into actual search queries that run against these platforms. Sometimes they use special search syntax.
It looks actually like SQL, almost from the english language query. You have to break it down to these different search queries, maybe a dozen different search queries. You're being really diligent. And then they'd execute the search queries against these databases of law, and they come back with, say, like 100 results each. And then the most diligent, best attorney would sit down and just read every single one of these results that come back, all the case law, statutes, regulations. And you'd start to do things like make notes and summarize and kind of compile an outline of what your response might be, like line by line or paragraph by paragraph, actually, yeah, 100%. And you start just taking out those insights you're getting from what you're reading. And then finally, based on all of that work and all the citations you've gathered, et cetera, then finally you put together your research memo.
We're like, okay, well, each one of those steps along the way, for the vast majority of them, those were impossible to accomplish with previous technology, but now they're prompts. Think step by step. Yeah. Think step by step. Yeah, exactly. But we actually broke it down each. So getting to the final result may be a dozen or two dozen different individual prompts, each of which might, by the way, be things step by step themselves for each of those prompts as part of this chain of actions you take to get to the final result, we had a very clear sense of what good looks like. We had a series, like a battery of tests before, but this got way more intense where we'd write at first maybe a few dozen tests and then a few hundred and a few thousand for every single one of those prompts.
So if the job to be done in the very beginning of this research process, for example, is taking the english language query and breaking it down into search queries, we had a very clear sense of what good search queries look like and wrote gold standard answers for. Given this input, this is what the output looks like. Our prompt engineers and I was one of them at the very beginning, we all just in it together, were writing these english language prompts to try to write the test first, basically, and wrote these english language prompts to try to get it. So of 1200 times it got the right answer, 1199 times, or what have you sort of like test driven development really approached from doing software engineering to prompt? That's exactly right. And the funny thing is, I never really believed in test driven development before prompting. I was like, the code works. It does it, it's fine, you'll see it.
But with prompting, actually, I think becomes even more important because of the kind of nature of these LLMs is they might go in crazy directions unexpectedly. And so you might very easily add in a set of instructions to solve one problem you're seeing with these sets of tests, and then to break something with these sets of tests. And so that exact kind of theory of test development applies ten x more, I'd say. In the world of prompting, there's a lot of sort of the naysayers saying that a lot of companies are just building GP wrappers and there's not a lot of IP getting built, but actually there's a lot of finesse to how you explain all of this. Can you tell us about all of that and how much more there's to be built? Oh, yeah. I mean, I think the thing is, when you're actually trying to solve a problem for a customer and actually doing the job, in our case of like, what a young associate might do and do it really well, there are many layers of things you have to add in to actually get the job done.
And by the time you add that all up, you're not like a GPT wrapper. You're a full application that may include in our case, proprietary datasets like the law itself and our annotations to the law that we added automatically. It may include connections into customer databases. In our case, in legal, they have these very specific, legal specific document management systems. So connecting into those is very important. It may include something as subtle as how well you OCR and what OCR programs you use and how you set those up when you're doing that task of. One of the tasks that the co counsel does, for example, is reviewing large sets of documents. Once you start working a lot of documents, you see stuff that's handwriting all over it and they're tilted in the scan. And there's this crazy thing that they do in law where they print four pages on one page to save room, and all OCR is going to read it directly across, but actually goes 1234.
By the time you've dealt with all the edge cases, frankly, not even before you hit the large language model, everything else up to the large language model, there may be dozens of things you built into your application to actually make it work and work well. Then you get to the prompting piece and writing out tests and very specific prompts and the strategy for how you break down a big problem into step by step by step kind of thinking, and how you feed in the information, how you format that information the right way. All of that also becomes your ip. And it's very hard to replicate, very hard to build, and therefore very hard to replicate, which is all the business logic, which is all, even all the very successful SaaS companies with very specific domain. You need very, very custom, esoteric niche integrations, like plug into this esoteric law database. Yeah, absolutely.
Two things I think about all the time. It's like basically all SaaS for a while was just like a SQL wrapper, right? If you think about very successful companies like Salesforce, they built that business logic around basically just databases and connections between tables and a database, and sometimes bridging that gap between something that either a very technical person can do, but most people can't, and making it accessible, or bridging that gap between something that almost works. You can do a lot of cool demos in chat GPT without building a line of code, but that almost works and works 70% of the time. But going to 100% of the time is a very different task. People will pay $20 a month for the 70% and maybe 500 or $1,000 a month for something that actually works, depending on the use case. Right.
So there's a lot of value gained going that last mile or 100 miles or whatever it is. Yeah. Can you talk about how you went from 70% to 100%? Because I think the other knock on this technology that we hear a lot is like, oh, these lms hallucinate too much. They're not accurate enough for real world use. But as you said earlier, the use case that you're working on is a mission critical use case. There's, like, a lot at stake. If the agent gives bad information to lawyers working on important court cases, how did you make it accurate enough for lawyers who are conservative by nature to trust it? This test driven development framework, first of all, goes a long way, because you can start seeing patterns and why it's making a mistake, and then you add instructions against that pattern, and then sometimes it still doesn't do the right thing.
And then you really ask yourself, okay, well, was I being super clear in my instructions? Am I including information it shouldn't see or too little information for it to really get the full context? And usually these things are pretty intelligent. And so usually you can kind of root cause why you're failing certain tests and then build to a place where you're actually passing those tests and just getting it right. And one of the things we learned is if after passes, frankly, even like, 100 tests, the odd that it will do on any random distribution of, like, user inputs, the next hundred percent accurately is, like, very high?
One of the things that strikes me that is tricky, like, many founders we work with are very tempted to just raw dog it. No evals, no test driven. We're just like vibes, only prompt engineering. And maybe, I mean, you switched over to this very quickly then. Was it just obvious from the beginning? You're like, we just can't do it that other way. We should not raw dog any of these prompts. Yeah. I think the biggest thing, first of all, it depends on the use case. For a lot of things that we were working on, for better or for worse, there was a right answer. And if you get the wrong answer, lawyers are not going to be happy about it.
I had been a lawyer myself, but also been signed lawyers for a decade. Every time we made the smallest mistake in anything that we did, we heard about it immediately. I had that voice in my head. Maybe as I was going through this process, I was the learning from the ten years of slogging through pre lms. You were like, no, it has to be 100%. Oh, yeah, that's probably true of way more domains than we realize, actually. It could be because the other thing that we're thinking about a lot is you can lose faith in these things really quickly, right? You have one bad experience, especially if it's your first bad. Your first experience is bad. And you're like, you know, maybe I'll check on this AI stuff a year from now, especially if you're like a busy lawyer, not a technologist.
So we knew you had to make that first encounter the first week really, really work for the lawyer, or else they're not going to invest in it deeply. So let's talk a bit about OpenAI zero one, because it is very different model. I mean, up to this point, with GPT four and all that previous generation, the analogy in terms of the intelligence is sort of the kind of system one thinking and the Daniel Kahneman type of intelligence, right? He has this whole economic theory, you want the Nobel Prize around it. System one thinking is just very fast, is kind of these decisions that humans make very intuitively and based on patterns. And lms are fantastic at that, but they're terrible at the executive function, because what I'm hearing with all the stuff that you're describing is kind of, you're just giving the LLM, like executive function. It's like, how do you think it's.
Right. How do I manage you? It's really that slower thinking. And I think a one is exciting. We haven't seen things built yet because it just got announced a few days ago. Right. I think it's getting to that system two thinking. And I think this has been a big area of research, which I saw a lot in Neoreps a year ago, where a lot of the researchers were excited to unlock this, because this is the missing piece to our AGI. Let's talk about what are your thoughts on zero one and how this changes. So, first of all, I think zero one is a very impressive model. Like with other things, we gave it the kinds of tests that we knew were failing to.
And the degree of, it's not just math, degree of thoroughness, precision, intelligence applied to some of these questions. And sometimes it's the stuff that you wouldn't expect. You need a super smart model to do. Like in one of the tests that we run, we give it lawyers real legal brief, but we edited very slightly some of that lawyer's quotations to the case to make it a wrong quotation or wrong kind of summarization. Of his case. So it's like 40 page legal brieftain. You alter things with just adding the word like not can change the meaning of something entirely, right? And then we give the full text of the case as well to the AI, and we say, well, what did, you know, what did the lawyer get wrong about this case, if anything?
And literally every LLM before that would be like, nothing. It's perfectly right. And it's just not a precise thinker about some of the very nuanced things that we altered about the brief to make it slightly wrong. And no one got it immediately. Like you said, it thinks actually for a while, it sits there for a minute, you're like, is this thing on? But then it starts answering and it's like, oh, well, change an and to a neither nor. So those are the kinds of tests that you kind of expect, even, frankly, earlier AI lms to be able to pass but just could not. And all of a sudden l one is even doing these things that take precise, detailed thinking.
Obviously, we don't have the internals on how zero one really works. We have this broad idea of chain of thoughts. Seemingly. We know that if OpenAI had a giant corpus of internal monologue of people thinking through doing things step by step, zero one would be even a lot better. It sort of rhymes with the thing you did to put your first step on the moon. It rhymes with break it down into chunks where you can get to 100% accuracy instead of just throw it all in the context window and maybe magically it will work. Do you think that that's what's happening then? I think there's a shot that they've had maybe change what their contractors are doing.
Instead of just doing input in, answer out. They're doing input in. How would I think about solving this problem and then answer out? But then the interesting thing is then it's limited by the intelligence of people writing those instructions. And one of the things that we're investigating, for what it's worth with o one, is can we prompt it to tell it what to think about during its thinking process and inject again, we've hired some of the best lawyers in the country. How would some of the best lawyers in the country think about solving this problem? And maybe we have no conclusive evidence one way or the other yet that this dramatically improves things.
It's so early and just not enough time yet has passed. There's a chance that one of the new prompting techniques with zero one is teaching it not just how to answer the question what examples a good answer look like, but how to think, and I think that's another really interesting opportunity here, is injecting domain expertise or just your own intelligence. I'm just so thankful because I think you're sort of sharing the breadcrumbs and where there are a great many other spaces where this technology is just beginning. I mean, you go to pretty much any company, people have no concept of what's just happened. They actually literally still repeat all of those sort of tired tropes of, oh, you better be fine tuning or all. I mean, these things are just not connected to what we're seeing day to day with startups and founders trying to create things for users.
What I'm kind of glad for is that we get to actually share this news like this knowledge, because even the things we talked about, hey, you should probably do evals. There's a lot of alpha and getting to 100%, not just 70%. These are sort of the breadcrumbs that will actually go on to create all of the billion dollar companies, maybe thousands of them, actually. We hope so. I think that you're starting to see a lot of other fields like law, really level up when you don't have to spend millions of dollars in six months, literally in a basement reading document by document by document, right. When you actually can just get past that and get just the results. Now you're thinking strategically and intelligently. And the unlock for these companies, I mean, they currently pay, again, millions of dollars in salaries for these jobs to be done, each of them, right? So for any company to come out with an AI that can do even 80% of that, the value is really there.
And I just want to encourage people to not kind of give up based on those tropes, right? Like, oh, it hallucinates too much, it's too inaccurate, it's too whatever. There's an example of anything, it's like there's a path and you can do it. And there's some good news in that. You know what? The jobs aren't going to go away. They'll just be more interesting. That's what I think. Yeah. Well, with that, we're out of time. But Jake, thank you so much for being with us. Thanks for having me. See you guys next time.
Innovation, Entrepreneurship, Technology, Ai Legal Tools, Artificial Intelligence, Business Strategy, Yc, Y Combinator, Saas, Vertical Ai, Open Ai, O1, Chatgpt, Jake Heller, Garry Tan, Diana Hu, Jared Friedman, Casetext, Thomson Reuters, Y Combinator
Comments ()