In the video, fundamental techniques to optimize Language Learning Models (LLMs) are discussed with the analogy of managing employees in an electronic store. The focus is on context optimization and LLM optimization, vital for tailoring model behavior and enhancing user interaction. Techniques such as prompt engineering and rag retrieval augmented generation are highlighted, illustrating how to guide LLMs to deliver precise user-focused responses by manipulating context windows and sourcing answers from specific databases.

The practical application of these techniques in a business-like setting, such as an electronic store, serves to standardize employee-customer interactions and improve service delivery. Just like the need for well-trained employees who can respond to customer queries effectively, LLMs can be fine-tuned to provide accurate, contextually relevant responses. This adaptation ensures the model fulfills specific domain requirements, mirroring real-world scenarios where specialized training is crucial.

Main takeaways from the video:

💡
RAG and prompt engineering complement each other, optimizing LLMs within context window constraints.
💡
Fine-tuning an LLM enhances its ability to provide domain-specific, accurate responses, analogous to training staff for specialized queries.
💡
High-quality data input is crucial for effective LLM fine-tuning, outweighing sheer quantity, ensuring precise and relevant model behavior.
Please remember to turn on the CC button to view the subtitles.

Key Vocabularies and Common Phrases:

1. inundated [ɪ'nʌndeɪtɪd] - (adj.) - Overwhelmed by a large number of things or people to deal with. - Synonyms: (overwhelmed, flooded, swamped)

Now, our employee is doing well, but getting inundated by new information coming from all the new devices.

2. hallucination [həˌluːsɪˈneɪʃən] - (n.) - A false perception or belief that doesn't correspond to reality. - Synonyms: (illusion, delusion, mirage)

This can address things like hallucination as well, because you can really, in the prompt Engine say you need to give the answer only from these specified documents.

3. prompt engineering [prɒmpt ˈɛnʤɪˌnɪərɪŋ] - (n.) - The process of designing and structuring input prompts to guide an LLM to produce desired outputs. - Synonyms: (guidance formulation, input designing, model directing)

Similarly, in the context of an LLM, we have this thing called prompt engineering.

4. fine tuning [faɪn ˈtuːnɪŋ] - (n.) - Adjusting the parameters of a model to improve its performance on a specific task. - Synonyms: (model adaptation, parameter adjustment, skill honing)

fine tuning allows you to actually update the model parameters based on your data to ensure that you influence the behavior of the model and also make it specialized in a specific domain as well.

5. context optimization [ˈkɒntɛkst ˌɒptɪmaɪˈzeɪʃən] - (n.) - Enhancing the capacity of an AI model by focusing on the specific context and constraints. - Synonyms: (context enhancement, frame improvement, scenario refinement)

We will do this from a perspective of context optimization and LLM optimization

6. token window [ˈtəʊkən ˈwɪndəʊ] - (n.) - The capacity limit in terms of the number of tokens or text an AI model can process at one time. - Synonyms: (text limit, processing capacity, data window)

The token window is limited, so the more text you add to it, there can be more noise.

7. latency [ˈleɪtənsi] - (n.) - The delay before a transfer of data begins following an instruction for its transfer. - Synonyms: (delay, lag, wait time)

Once you have decided, okay, you have optimized it, but you're seeing you're getting more and more end users, latency is becoming a problem.

8. vernacular [vərˈnækjʊlər] - (n.) - The language spoken by ordinary people in a specific region. - Synonyms: (dialect, everyday language, colloquial speech)

Almost guarantees it in the vernacular that you wanted.

9. deduction [dɪˈdʌkʃən] - (n.) - The process of reasoning from general principles to specific instances. - Synonyms: (inference, conclusion, reasoning)

Passing it over for the model to make its deduction, generate the text and come back to it.

10. augment [ɔːɡˈmɛnt] - (v.) - To make something greater by adding to it; enhance. - Synonyms: (increase, amplify, enhance)

You can augment it with fine tuning at the appropriate stage as well.

Context Optimization vs LLM Optimization - Choosing the Right Approach

Imagine you just open an electronic store, you're hiring some employees. You need to make sure your clients have a good experience as they walk into the store, hopefully purchase more products, and you need to standardize all of it. How do you go about doing that? As part of this video, we're going to go over the fundamentals that will empower you to make the right decisions when it comes to updating and tweaking your LLMs for your requirement. We will do this from a perspective of context optimization and LLM optimization.

context optimization is essentially the window or the text that the model is going to take into account when it generates the text. And the model optimization is actually updating the model based on specific requirements. Now, let's go back to our store. We have hired our first employee, a generalist, polite enough, but you won't just let him lose on the store. You want to give some guidelines to this person.

So always greet the prospective clients, make sure you're polite, and based on the question they're asking, give them the top three options. Maybe there's some sales going on that may be relevant to the client, et cetera, et cetera. Similarly, in the context of an LLM, we have this thing called prompt engineering. prompt engineering is giving very clear guidelines on what you expect from the model. You can do so by giving some text. You can also give some examples like input and output, so that a model can understand what are you really looking for.

You can also help the model break down a complex problem into sub points and make sure it's kind of understanding what you're going after in the long run, which is called chain of thought prompting. Now, our employee is doing well, but getting inundated by new information coming from all the new devices. That smile can turn into a frown really quickly because it's hard to be up to speed with all the technology changes coming in.

So you have come up with a strategy where you have created this manual. And this manual has all the updates for all the different gadgets coming in. So you're good, but you can't expect the employee to read that document every time a user asks a question. So you have devised a strategy where based on the question, you report some of the pages from the manual, give it to the employee, reads the answer, comes back to it.

That, in a way is like rag retrieval augmented generation, which allows you to connect this LLM to your data sources to make sure that you're getting the right answers. This can address things like hallucination as well, because you can really, in the prompt engine, say you need to give the answer only from these specified documents. So it's a really powerful tool as well.

Now, going back to our store, business is doing really well and we need to hire more employees. That's great, but it was already hard with one employee. How do you make sure you standardize the behavior for all three of them? Being polite can mean different things to different people.

Secondly, your customers are getting more savvy, they're asking more specialized questions. They're asking things like how to fix things. So just reading off a guide is not going to do it. What you realize is you need them to go through a training school, be it from a sales perspective or technical perspective, to really make sure the questions are answered.

That is like fine tuning. fine tuning allows you to actually update the model parameters based on your data to ensure that you influence the behavior of the model and also make it specialized in a specific domain as well. Now, remember in the beginning I mentioned we are doing this in the lens of context optimization and LLM optimization.

So all that means is that RAG and PE are essentially taking all the information you need the model to know beforehand, passing it over for the model to make its deduction, generate the text and come back to it. fine tuning is actually optimizing the model to ensure that you're getting the right responses with the right kind of behavior that you would need. This addresses the two key problems we keep hearing from practitioners on why they're reluctant to move LLMs into production model behavior.

How do you really moderate the model output, both from a text perspective as well from kind of the vernacular and the qualitative aspects, if you will, and then the real time data access, how quickly can you get the model to answer a question from a real time data as well as ensure that it's accurate and relevant to the user?

So let's summarize our discussion so far with five points. First one is this whole technique or set of techniques is additive, so they're all three working and complementing each other. The first two, RAG and pe, are done in the context of the context window optimization. fine tuning actually updates the model parameters. This is important because the token window is limited, so the more text you add to it, there can be more noise.

So you need to be careful about what you're passing to the model. Secondly, on the model, while it may be expensive, the more you spend on the data and actually update the model with good quality data, you can then use a smaller LLM, insert a bigger LLM and save costs in the long run as well. The second one is always start with prompt engineering. This is one of the most powerful and agile tools that you have in your repository to ensure that a you understand whether even having an LLM based solution is right.

Two, the kind of data that you have, the end users is the baseline model accurate and all the work that you do and even the trial and error can actually be used for fine tuning. So it's really, really worth it. The third one is also important. People start worrying about the context window optimization too soon. So focus more on the accuracy versus the optimization. So what I mean by that is as you get closer to the right answer, especially in the context of again the window optimization, keep looking into the right answers and then start seeing different strategies on how you can reduce the window.

The fourth one is people say that data quantity is really key for fine tuning. Yes, that's important, but I would take the data quality DQ better than honestly the data quantity. This is really valuable because you can really start a good example of a fine tuning by just 100 examples. Of course that differs from every use case. But really focus on the quality. Some of the output that we get from prompt engineering is also going to be very important.

This brings me to my last point. You need to be able to quantify and baseline your success. Just saying that the answer is good enough is not going to cut. Especially when you try these techniques and try the nuances. The permutations between these three can be huge. So you need to make sure from accuracy perspective, precision perspective.

Again, going back to the context optimization, if you're using rag, not only is the answer important, it's also important what kind of documents you got from the vector database. This will help you reduce latency. So a lot of really good solutions that you can get. If you can start really quantifying everything and what success looks like for you.

So going to a diagram here, the two or key commonalities between all three is you're going to have increased accuracy reducing hallucinations. So not making up answers. Start with your pe. prompt engineering will help you really ensure that you have the right solution. So really quick iteration, super valuable. RAG is going to help you connect your context window to external data sources.

You can give it some guidance as well. And fine tuning actually changes the model behavior. You can control it more and can become a specialized model in a specific domain that you have. In terms of the commonality between RAG and prompt engineering, context window optimization is key. Of course it is constrained by that.

So as you look for accuracy, you need to Focus on how can you optimize that as well. Between prompt engineering and fine tuning, whilst they both are kind of influencing the model, they do it in different ways. So prompting can give some guidance. Only respond in three points. fine tuning almost guarantees it in the vernacular that you wanted.

Finally, between RAG and fine tuning, both can incorporate the data sources, but really think of RAG as the short term memory and fine tune as a long term memory for what you're trying to do. So if I were to summarize, context optimization is super valuable. It is one of the easiest and the first route that you should take to how to optimize an LLM model. The second one is once you have decided, okay, you have optimized it, but you're seeing you're getting more and more end users, latency is becoming a problem and now you know, okay, you know what, I can fine tune my use case a bit more.

That's where you use fine tuning. This will help you really specialize the model. It will not be a generalist, so there's a risk there. But as you focus your use case more, fine tuning is the right way to go. So if I were to summarize the discussion. As you know, all three techniques are really powerful.

But if you see it from the lens of context optimization, focusing on all the words and things you want to send to the model before it generates text, it is limited by the number of tokens. So the more you increase there, there's going to be more latency, there's going to be hope, maybe more downtime, and the more documents you bring in, it could actually create more noise for the model because it really doesn't understand that specific data.

However, on the other hand, if you have a model and you know, you have specific vernacular, very specialized domain, medical, financial, legal, etc. Etc. fine tuning is an option. You can actually update the parameters of the model using your data. As I mentioned before, you can look at the input and output you got from prompt engineering and make the model a more specialized expert.

It can also help you control the model behavior, which is super important when you talk about corporations and them using LLMs for their end user solutions as well. prompt engineering and rag. Again, the context window is the best way to kind of start and really ensure that you understand if LLMs are the right way to go and you can augment it with fine tuning at the appropriate stage as well.

EDUCATION, TECHNOLOGY, INNOVATION, CONTEXT OPTIMIZATION, FINE TUNING, PROMPT ENGINEERING, IBM TECHNOLOGY