The video delves into the significant role of inferencing in AI models, describing it as the stage where models apply learned patterns from the training phase to new, real-time data in order to make predictions or decisions. It emphasizes how AI models, through inferencing, are able to generalize from a stored representation to interpret new data efficiently. An example provided is a model that distinguishes between spam and non-spam emails, highlighting the practical applications of AI inference.

While training an AI model is resource-intensive due to the cost of computing processing time, inferencing is even costlier and constitutes a major portion of an AI model’s operational life. inferencing requires constant processing of new data, incurring costs in energy and infrastructure. This aspect brings focus to the environmental impact as AI models, particularly large language models, can contribute significantly to carbon emissions, surpassing even the average American car in terms of emissions over their lifetime.

Key takeaways from the video:

💡

AI inferencing applies learned patterns to new data, enabling models to make real-time decisions, like spam detection.

💡

inferencing is costly and has a significant environmental impact due to the constant processing of new data.

💡

Improvements in hardware, software, and middleware can boost inferencing efficiency, highlighting ongoing innovations in AI technology.

Please remember to turn on the CC button to view the subtitles.

Key Vocabularies and Common Phrases:

1. inferencing [ˈɪnfərənsɪŋ] - (noun) - The process by which an AI model applies information learned during its training stage to make predictions or perform tasks on new data. - Synonyms: (deducing, concluding, deriving)

What is inferencing? It's an AI model's time to shine its moment of truth

2. artificial neurons [ˌɑːrtɪˈfɪʃl ˈnjʊrɒnz] - (noun) - Basic units in a neural network, modeled after the human brain's neurons, used to process inputs and generate outputs. - Synonyms: (nodes, units, processors)

These are the weights that connect its artificial neurons.

3. spam detector model [spæm dɪˈtɛktər ˈmɒdəl] - (noun) - An AI model designed to identify spam by recognizing patterns commonly associated with unwanted emails. - Synonyms: (spam filter, junk mail detector, spam blocker)

We are going to build a spam detector model.

4. probability score [ˌprɒbəˈbɪlɪti skɔːr] - (noun) - A numerical value representing the likelihood of an event, such as an email being spam. - Synonyms: (likelihood ratio, chance measure, odds)

Now, the actionable result here might be a probability score indicating how likely the email is to be spam.

5. model compression [ˈmɒdəl kəmˈprɛʃən] - (noun) - Techniques aimed at reducing the size of AI models without significantly affecting their accuracy. - Synonyms: (compression, downsizing, reduction)

One is model compression.

6. pruning [ˈpruːnɪŋ] - (noun) - Removing unnecessary components in AI models, such as weights, to optimize performance. - Synonyms: (trimming, cutting, reducing)

Well, first of all, pruning that removes unnecessary weights from the model.

7. quantization [ˌkwɒntaɪˈzeɪʃən] - (noun) - The process of reducing the precision of a model's weights in AI, making computations faster and less resource-intensive. - Synonyms: (discretization, approximation, scaling)

And then for quantization, what that is talking about is reducing the precision of the model's weights.

8. middleware [ˈmɪdlwɛr] - (noun) - Software that connects different applications or components, helping them communicate, especially between hardware and software. - Synonyms: (software intermediary, integration layer, bridging software)

middleware bridges the gap between the hardware and the software.

9. graph fusion [ɡræf ˈfjuːʒən] - (noun) - A process in AI model computation that reduces the number of nodes, minimizing communication and enhancing processing efficiency. - Synonyms: (graph optimization, node fusion, network merging)

One of those things is called graph fusion.

10. ai accelerators [əˌsɛləreɪtərz] - (noun) - Specialized hardware devices designed to speed up AI computational tasks, particularly matrix operations. - Synonyms: (processing units, AI chips, computation boosters)

These ai accelerators can significantly speed up inferencing tasks.

AI Inference - The Secret to AI's Superpowers

What is inferencing? It's an AI model's time to shine its moment of truth. A test of how well the model can apply information learned during training to make a prediction or solve a task. And with it comes a focus on cost and speed. Let's get into it.

So an AI model, it goes through two primary stages. What are those? The first of those is the training stage, where the model learns how to do stuff. And then we have the inferencing stage that comes after training. Now, we can think of this as the difference between learning something and then putting what we've learned into practice.

So during training, a deep learning model computes how the examples in its training set are related. What it's doing effectively here is it's figuring out relationships between all of the data in its training set and it encodes these relationships into what are called a series of model weights. These are the weights that connect its artificial neurons. So that's training.

Now, during inference, a model goes to work on what we provide it, which is real time data. So this is the actual data that we are inputting into the model. What happens inferencing is the model compares the user's query with the information processed during training and all of those stored weights. And what the model effectively does is it generalizes based on everything that is learnt during training. So it generalizes from this stored representation to be able to interpret this new unseen data in much the same way that you and I can draw on prior knowledge to infer the meaning of a new word or make sense of a new situation.

And what's the goal of this? Well, the goal of AI inference is, is to calculate an output, basically a result, an actionable result. So what sort of result are we talking about? Well, let's consider a model that attempts to accurately flag incoming email and it's going to flag it based on whether or not it thinks it is spam. We are going to build a spam detector model. Right. So during the training stage, this model would be fed a large labeled data set. So we get in a whole load of data here.

And this contains a bunch of emails that have been labelled. Specifically, the labels are spam or not spam for each email. And what happens here is the model learns to recognize patterns and features commonly associated with spam emails. So these might include the presence of certain keywords? Yeah, those ones. So unusual sender email addresses, excessive use of exclamation marks, all that sort of thing. Now, the model encodes these learned patterns into its weight here, create Creating a complex set of rules to identify spam.

Now, during inference, this model is put to the test. It's put to the test with new, unseen data in real time, like when a new email arrives in a user's inbox. The model analyzes the incoming email, comparing its characteristics to the patterns it's learned during training, and and then makes a prediction. Is this new, unseen email spam or not spam?

Now, the actionable result here might be a probability score indicating how likely the email is to be spam, which is then tied into a business rule. So, for example, if the model assigns a 90% probability that what we're looking at here is spam, well, we should move that email directly to the spam folder. That's what the business rule would say. But if the probability the model comes back with is just 50%, the business rule might say to leave the email in the inbox, but flag it for the user to decide what to do.

So what's happening here is the model is generalizing. It can identify spam emails even if they don't exactly match any specific example from its training data, as long as they share similar characteristics with the spam patterns it's learned. Okay.

Now, when the topic of inferencing comes up, it is often accompanied with four preceding words. Let's cover those next. The high cost of those are the words often added before inferencing. Training AI models, particularly large language models, can cost millions of dollars in computing processing time. But as expensive as training an AI model can be, it is dwarfed by the expense of inferencing.

Each time someone runs an AI model, there's a cost. A cost in kilowatt hours, a cost in dollars, a cost in carbon emissions. On average, something like about 90% of an AI model's life is spent in inferencing mode. And therefore, Most of the AI's carbon footprint comes from serving models to the world, not in training them. In fact, by some estimates, running a large AI model puts more carbon into the atmosphere over its lifetime than the average American car.

Now, the high costs of inferencing, they stem from a number of different factors. So let's take a look at some of those. And first of all, there's just the sheer scale, the scale of operations. While training happens just once, inferencing happens millions or even billions of times over a model's lifetime. A chatbot might field millions of queries every day, each requiring a separate inference.

Second, there's the need, the need for speed. We want fast AI Models, we're working with real time data here, requiring near instantaneous responses, which often necessitate powerful energy hungry hardware like GPUs. Third, we have to consider also just the general complexity of these AI models. As models grow larger and more sophisticated to handle more complex tasks, they require more computational resources for each inference. This is particularly true for LLMs with billions of parameters. And then finally there is the cost in terms of infrastructure costs, data centers to maintain and cool, low latency network connections to power. All these factors contribute to significant ongoing costs in terms of energy consumption, hardware wear and tear and operational expenses.

Which brings up the question of if there's a better way to do this faster and more efficiently. How fast an AI model runs depends on the stack. What's the stack? Well, improvements made at each layer can speed up inferencing. And top of the stack is hardware. At the hardware level, engineers are developing specialized chips. These are chips made for AI, and they're optimized for the types of mathematical operations that dominate deep learning, particularly matrix multiplication. These ai accelerators can significantly speed up inferencing tasks compared to traditional CPUs and even to GPUs. And to do so in a more energy efficient way.

Now, bottom of stack, I put software. And on the software side, there are several approaches to accelerate inferencing. One is model compression. Now that involves techniques like pruning and quantization. So what do we mean by those? Well, first of all, pruning that removes unnecessary weights from the model, so it's reducing its size without significantly impacting accuracy. And then for quantization, what that is talking about is reducing the precision of the model's weights, such as from 32 bit floating point numbers to 8 bit integers. And that can really speed up computations and reduce memory requirements.

Okay, so we've got hardware and software. What's in the middle? middleware, of course. middleware bridges the gap between the hardware and the software. And middleware frameworks can perform a bunch of things to help here. One of those things is called graph fusion. And graph fusion reduces the number of nodes in the communication graph, and that minimizes the round trips between CPUs and GPUs. And they can also implement parallel tensors as well.

Strategically splitting the AI model's computational graph into chunks. And those chunks can be spread across multiple GPUs and run at the same time. So running a 70 billion parameter model, that requires something like 150 gigabytes of memory, which is nearly twice as much as an Nvidia A100 GPU holds, but if the compiler can split the AI model's computational graph into strategic chunks, those operations can be spread across GPUs and run at the same time.

So that's inferencing. It's a game. A game of pattern matching that turns complex training into rapid fire problem solving one spammy email at a time.

ARTIFICIAL INTELLIGENCE, TECHNOLOGY, INNOVATION, INFERENCING, DEEP LEARNING, SPAM DETECTION, IBM TECHNOLOGY

ENSPIRING.ai: AI Inference - The Secret to AI's Superpowers

Key Vocabularies and Common Phrases:

AI Inference - The Secret to AI's Superpowers

ENSPIRING.ai: Lessons on Identity from a place of shame - Sonita Tuner - TEDxWalthamstow

ENSPIRING.ai: Squirrel Ai Learning Founder Talks Revolutionizing K-12 Education For Millions Of Students

Join to our community 👋

Key Vocabularies and Common Phrases:

AI Inference - The Secret to AI's Superpowers

Share Article:

ENSPIRING.ai: Lessons on Identity from a place of shame - Sonita Tuner - TEDxWalthamstow

ENSPIRING.ai: Squirrel Ai Learning Founder Talks Revolutionizing K-12 Education For Millions Of Students

More in this Category Artificial Intelligence

ENSPIRING.ai: Kimi K2, DeepSeek-R1 vibe check and Googles data center investments

ENSPIRING.ai: Should we create superintelligent AI?

ENSPIRING.ai: DeepSeek - AI's Sputnik Moment? Steven Sinofsky and Martin Casado Discuss

ENSPIRING.ai: The evolution of AIhow we can solve for trust - Matt Kuperholz - TEDxSydney

Join to our community 👋