The video provides a detailed explanation on how to fine-tune an open-source large language model (LLM) to become a subject matter expert in a specific field. It emphasizes that one does not need to be a developer or data scientist to perform the fine-tuning, as the process can be conducted from a personal laptop using a simplified method. The speaker introduces Instruct Lab, an open-source project that facilitates community-driven contributions to AI models, making it accessible for users to curate data and train models with domain-specific knowledge.
Three main steps are highlighted in this process: 1) curating data for the task or domain of interest, 2) generating synthetic data using an LLM to supplement the curated samples, and 3) integrating this knowledge into the model through a method called parameter-efficient fine-tuning. An example is provided with a language model being trained to provide accurate information about the 2024 Oscars, thereby showcasing the step-by-step implementation of these processes and the potential improvements in model responses.
Main takeaways from the video:
Please remember to turn on the CC button to view the subtitles.
Key Vocabularies and Common Phrases:
1. specialize [ˈspeʃəˌlaɪz] - (verb) - To focus on a specific area of study or work, becoming an expert in that field. - Synonyms: (focus, concentrate, major in)
Generative AI models are great, but have you ever wondered how to specialize them for a specific use case to be a subject matter expert in whatever your field might be?
2. inference [ˈɪnfərəns] - (noun) - The process of deriving logical conclusions from premises known or assumed to be true. - Synonyms: (conclusion, deduction, reasoning)
This means better responses with smaller prompts, potentially faster inference and lower compute cost, and a model that truly understands your domain.
3. taxonomy [tækˈsɒnəmi] - (noun) - A classification into ordered categories; a schema or structure for organizing information. - Synonyms: (classification, categorization, hierarchy)
We'll point to a taxonomy repository. This is how we're going to structure and organize our data.
4. yaml [ˈjæməl] - (noun) - A human-readable data serialization standard that is commonly used for configuration files. - Synonyms: (configuration file format, data format)
So this is a yaml formatted question and answer document in plain text.
5. curation [kjʊˈreɪʃən] - (noun) - The process of carefully selecting and organizing information or data. - Synonyms: (selection, organization, arrangement)
Now I mentioned the first step is that curation of data.
6. hierarchical [ˌhaɪəˈrɑːrkɪkl] - (adjective) - Arranged in order of rank or priority. - Synonyms: (ranked, ordered, tiered)
This is kind of a hierarchical structure of different folders to organize the information.
7. synthetic [sɪnˈθɛtɪk] - (adjective) - Made by combining different substances; not naturally occurring. - Synonyms: (artificial, man-made, manufactured)
So we're going to use a large language model that's running locally to help us create synthetic data from our initial examples that we've curated.
8. filtration [fɪlˈtreɪʃən] - (noun) - The process of removing impurities or unwanted parts through filtering. - Synonyms: (filtering, purification, refinement)
And what's great is that there's a filtration process because not all data is good data.
9. parameter-efficient [pəˈræmɪtər ɪˈfɪʃənt] - (adjective) - Utilizing parameters in a resource-saving manner to achieve desired outcomes. - Synonyms: (resource-saving, optimized, sparing)
It's time for what's known as parameter efficient fine tuning within Struct lab.
10. quantized [ˈkwɒntaɪzd] - (adjective) - Reduced in size for easier use or computational efficiency, often relating to digital signals. - Synonyms: (compressed, reduced, digitized)
I'm going to go ahead and serve the quantized version of this model so it can run locally on my machine.
Fine Tuning Large Language Models with InstructLab
Generative AI models are great, but have you ever wondered how to specialize them for a specific use case to be a subject matter expert in whatever your field might be? Well, in the next few minutes you'll learn how you can take an open source large language model and fine tune it from your laptop. And the best part is you don't have to be a developer or data scientist at all to do this. What you might have noticed after using LLMs is that they're great for general purposes, but for truly useful answers, they need to know the domain in which they're working with. And the data that's useful for your work is likely useful for an AI model too. So instead of needing to provide examples of behavior to a model, for example, respond back as an insurance claim adjuster with a professional tone, and this knowledge of common policies, you can actually bake this intuition into the model itself. This means better responses with smaller prompts, potentially faster inference and lower compute cost, and a model that truly understands your domain.
So let's start this fine tuning process with the open source project Instruct Lab. Instructlab is a research based approach to democratize and enable community based contributions to AI models and allow us to do it in an accessible way on our laptop, just as we'll be doing today. Now, there's three steps that I want to show you that we're going to be doing in today's video. So firstly is the curation of data for whatever you want your model to do or to know. Second, fine tuning a model takes a lot of data, more than what we have time or resources to create today. So we're going to use a large language model that's running locally to help us create synthetic data from our initial examples that we've curated. And finally, we're going to bake this back into the model itself using a multiphase tuning technique called lora.
Now I mentioned the first step is that curation of data. So let me explain how this works in my IDE. Now, I've already got instruct lab installed. The CLI is iLab, and we'll do an ILAB configuration initialize to set up our working directory. We'll set some defaults for the parameters of how we want to use this project. And we'll point to a taxonomy repository. This is how we're going to structure and organize our data. And we'll also point to a local model that we can serve locally to help us generate more examples. And just like that, we're ready to start using Instruct Lab. Now, I've actually got this taxonomy open and you can see it here on the left. This is kind of a hierarchical structure of different folders to organize the information we want to provide to the model in skills and knowledge. So let's check out a skill. So this is a yaml formatted question and answer document in plain text. So I could be anybody to contribute to this model. You don't have to be a data scientist or an ML engineer. And I can provide this to essentially teach the model new things. First example, this is to teach it how to read markdown formatted tables. So we have context. We also have a question, which breed has the most energy? And an answer. As you can see here, we've got a five out of five for the Labrador.
So we can use this as sample data to generate more examples like this and teach the model something new. Now, this is really cool, but I also want to show you teaching the model about new subjects as well. So the 2024 Oscars has happened, but the model that we're using today doesn't know that we need to fix that. So we're going to ask this model a specific question from this training data that we want to provide specifically, what film had the most Oscar nominations? Now I can do an ILAB model chat in order to talk to a model that I have running locally. This is Merlinite, 7 billion parameters. It's based off of the open source model Mistral. And we'll ask the question which film had the most Oscar nominations? And unfortunately the Irishman is incorrect. The answer is Oppenheimer and it's our job to make this correct. So what we're going to go ahead and do is use this local training information and curation that we've done from our local machine and create more synthetic training data. And also point to this seed document here at the bottom. This is markdown formatted information that we're going to pull during this data generation process that provides more context and information about the specific subject that we're going to teach the model. So let's get started.
Now it's time for the magic to happen. So these large language models, as you might know, have been trained extensively on terabytes of data. And what we're going to do is use a teacher model that we've already served locally to generate hundreds or potentially thousands of additional examples based on our C data that we provided. So let's kick this off. We're first going to do an Ilab taxonomy and make sure that everything is formatted as it should be. And we've got back that smiley face, we're good to go. So what we're going to go ahead and do now is start generating that data. So I'll do an ILAB data generate and specifically we're going to generate three instructions here using that locally served model, or we could point to one that's running remotely and we're going to search for that Oscars question and answer pair that we've provided and it's going to generate more similar examples to have enough training data to fully train this model.
So this is really cool because it's creating different variations of our initial training data to be able to train this model in the end. And as you see here, we've generated three examples. You can see who was nominated for the best actor award and we're providing or we're getting back this answer of these different actors. And what's great is that there's a filtration process because not all data is good data. With that newly generated data, it's time for what's known as parameter efficient fine tuning within Struct lab. So I'll go ahead and do an ILAB model train and what this is going to do is integrate this new knowledge and skills back into the new model, just updating a subset of its parameters as a whole, which is why we're able to do so on a consumer laptop like mine.
So we've done some cooking show magic here to speed things up since that process might have taken a few hours depending on your hardware. But finally our result is a newly fine tuned model specialized with the knowledge that we gave it. So let's go ahead and see it in action in a new terminal window. I'm going to go ahead and serve the quantized version of this model so it can run locally on my machine. And we're going to go ahead and ask it a question which I think you know what it might be. What film had the most Oscar nominations in 2024? So let's open up a new window and do an ILAB model chat and we're going to talk to this model to ask it this question. So what film had the most Oscar nominations?
Oppenheimer. So it's really incredible to see the before and after of doing this fine tuning process. And the coolest part, we're not AIML experts, we're just using open source projects like InstructLab among others out there that can aid with the fine tuning of large language models. Now with this fine tuned model, a popular way to provide external and up to date information would be to use RAG or retrieval augmented generation. But you can also imagine doing automated or regular builds with this fine tuned model when the static resources change.
So one more thing. We believe that the future of AI is open. But what does that really mean? Well, the InstructLab project is all about building a community of AI contributors, being able to share your contributions upstream and collaborate on domain specific models. Now what does that look like? Well, imagine that you're working at an insurance company. You could fine tune a model on your company's past claims and best practices for handling accidents to help agents in the field and make their life better. Or maybe you're a law firm specializing in entertainment contracts. You could train a model on your past contracts to help review and process new ones more quickly. But the possibilities are endless and you're in control. With what we've done today, you've effectively taken an open source large language model, locally trained it on specific data without using a third party, and now have a baked in model that we could use on premise in the cloud or share with others.
ARTIFICIAL INTELLIGENCE, TECHNOLOGY, INNOVATION, INSTRUCTLAB, MACHINE LEARNING, DATA CURATION, IBM TECHNOLOGY