The video delves into how to apply domain-specific knowledge throughout an LLM lifecycle effectively by incorporating various organizational roles and tools. It initially describes the traditional data model approach where data engineers curate data from SQL or NoSQL databases, which is then utilized by data scientists for model training. The limitation here is the lack of domain-specific knowledge integration, which can be crucial for enhancing the performance of models.

A novel approach is proposed where project managers and business analysts can contribute domain-specific knowledge stored in different document formats, managed using tools like InstructLab. This helps in creating synthetic data that allows more flexible ways of querying, empowering the model when training it. The video also discusses deployment using AI platforms like Kubernetes-based OpenShift, highlighting the importance of MLOps lifecycle for smooth deployment and management.

Main takeaways from the video:

💡
Domain-specific knowledge from various organizational roles can significantly enhance the LLM lifecycle.
💡
Tools like InstructLab and platforms like OpenShift can streamline the management, creation of synthetic data, and deployment of models.
💡
MLOps and technologies like RAG help efficiently manage resources and streamline model updates without frequent full lifecycle executions.
Please remember to turn on the CC button to view the subtitles.

Key Vocabularies and Common Phrases:

1. curating [ˈkjʊr.eɪ.tɪŋ] - (verb) - Selecting, organizing, and presenting data or items for a specific purpose. - Synonyms: (organizing, compiling, gathering)

A traditional approach starts with a data engineer who is curating data that is then used by a data scientist who ultimately takes that data, trains it to the model, and then makes that model available for in France.

2. taxonomy [takˈsɒn.ə.mi] - (noun) - A system for classifying and organizing information, often used in biology for the classification of organisms. - Synonyms: (classification, categorization, hierarchy)

InstructLab is an open source tool that allows the management of what we call a taxonomy.

3. synthetic data [sɪnˈθɛt.ɪk ˈdeɪ.tə] - (noun) - Artificially generated data that is used to mimic real-world scenarios, often utilized in model training. - Synonyms: (simulated data, artificial data, generated data)

InstructLab is handling all that and it will then create synthetic data through this process.

4. empowers [ɪmˈpaʊ.ərz] - (verb) - Gives someone the authority or power to do something or makes them stronger or more confident. - Synonyms: (enables, authorizes, strengthens)

This empowers the model, especially when we go through the training cycle, to then apply more opportunities to the LLM to be able to accurately reply to your prompt.

5. inference [ˈɪn.fər.əns] - (noun) - The process of reaching a conclusion based on evidence and reasoning. - Synonyms: (deduction, conclusion, reasoning)

Now this is the infrastructure layer, but we want to be able to interact with our model and be able to configure the inference and be able to apply metrics and all those things that need to be part of the MLOps lifecycle

6. accelerators [əkˈsel.ə.reɪ.tərz] - (noun) - Devices or technologies employed to speed up processes, often used in computing to expedite hardware operations. - Synonyms: (enhancers, boosters, catalysts)

And this can take advantage of of different AI accelerators that can be used, like Nvidia, AMD or Intel, for example.

7. lifecycle [ˈlaɪfˌsaɪ.kəl] - (noun) - The series of stages through which something (such as a project or product) passes during its lifetime. - Synonyms: (process, course, span)

Now this is the infrastructure layer, but we want to be able to interact with our model and be able to configure the inference and be able to apply metrics and all those things that need to be part of the MLOps lifecycle.

8. governance [ˈɡʌv.ər.nəns] - (noun) - The act, process, or power of governing; overseeing the control and direction of something. - Synonyms: (control, management, supervision)

Now, you may want to then interact with that model, you want to validate it or apply governance or other things, even just sandbox with it.

9. infrastructure [ˈɪn.frəˌstrʌk.tʃər] - (noun) - The basic physical and organizational structures needed for the operation of a society or enterprise. - Synonyms: (framework, foundation, base)

Now this is the infrastructure layer, but we want to be able to interact with our model and be able to configure the inference and be able to apply metrics and all those things that need to be part of the MLOps lifecycle.

10. reframing [riːˈfreɪ.mɪŋ] - (verb) - To change the way something is presented or considered. - Synonyms: (restructuring, reshaping, reinterpreting)

Think of synthetic data in this case as just another way of reframing the question.

Igniting LLM Performance - The Power of Domain Data!

Hey everybody. Today I want to talk to you about how to apply domain specific knowledge to your LLM lifecycle. A traditional approach starts with a data engineer who is curating data that is then used by a data scientist who ultimately takes that data, trains it to the model, and then makes that model available for in France. One of the challenges with this though, is that the data over here being used is typically a traditional database of some sort, either SQL or NoSQL, and it's usually containing metrics or sales data or anything that's typically organized or curated by an organization. The challenge here is getting domain specific knowledge within an organization and applying it to this same process.

Now let's look at the same approach, but use a variety of tools that can empower people like project managers and business analysts to be contributing to the process. So here we have a project manager and a business analyst. They both have domain specific knowledge about processes within their organization. But these could be stored in word documents or text files of some sort, not the traditional data stores that we have that we typically use within a model lifecycle. But we can change this. We can use a tool like InstructLab to manage this process.

InstructLab is an open source tool that allows the management of what we call a taxonomy. This taxonomy is just a git repository, typically where we can manage things like MD files or text files and then apply that to our model. We could even have more traditional document formats like PDFs and have those be transformed in the necessary file structure that InstructLab takes. Once they've applied that data to the taxonomy that InstructLab manages, we can then start the more traditional process that we saw earlier. But we don't actually need a data scientist in this case. InstructLab is handling all that and it will then create synthetic data through this process. Now, I know synthetic data, that sounds kind of scary, but I want to approach it in a different way. Think of synthetic data in this case as just another way of reframing the question. Instead of one way of asking a question, we can have many different ways of asking the same question. This empowers the model, especially when we go through the training cycle, to then apply more opportunities to the LLM to be able to accurately reply to your prompt.

Once we've trained the model, we can then go ahead and deploy it into an AI platform. This could be Kubernetes based, like OpenShift, for example. And this can take advantage of of different AI accelerators that can be used, like Nvidia, AMD or Intel, for example. Now this is the infrastructure layer, but we want to be able to interact with our model and be able to configure the inference and be able to apply metrics and all those things that need to be part of the MLOps lifecycle. Now we can do this with an extension for OpenShift called Red Hat OpenShift AI, which will provide you all those tools for managing the lifecycle of this model within production.

Now, you may want to then interact with that model, you want to validate it or apply governance or other things, even just sandbox with it. This could be done with something like WatsonX AI that can sit on top of OpenShift and interact with all the models that are being inferred within this AI stack. Now, once this lifecycle has finished, we can then restart the whole process again and use the new data that has since been built up by our project managers and business analysts and go through this life cycle once more. But one thing to note is that can be really costly. We may not want to run this process over and over again every week. We may only have the budget to run it maybe once a month or once every other month. Well, we can use technologies like RAG and have this data come over here in the interim before we go through this process again. Once we do that, we can flush out our RAG database and then start anew as data is collected by our project managers and our business analysts.

All right, we've shown the complete model life cycle and how to apply domain specific knowledge from people within our organization like project managers and business analysts, then manage that data through a taxonomy through InstructLab, generate synthetic data that's then used for training that model and then ultimately deploying into a Kubernetes based platform like OpenShift, utilizing AI services from tools like WatsonX AI and ultimately using technologies like RAG to enhance that experience. Thank you so much for watching.

ARTIFICIAL INTELLIGENCE, TECHNOLOGY, INNOVATION, MLOPS LIFECYCLE, SYNTHETIC DATA, OPENSHIFT, IBM TECHNOLOGY