ENSPIRING.ai: What are Word Embeddings?
The video explores the concept of word embeddings and their role in transforming language data into numerical vectors that capture semantic and contextual relationships between words. It highlights the necessity of converting text into numeric form because machine learning algorithms require numbers as inputs. The video elaborates on how word embeddings are crucial in various natural language processing (NLP) tasks, such as text classification, named entity recognition, and semantic similarity measurement, enhancing machines' understanding of human language.
The process of creating word embeddings involves models trained on large text corpora. This includes preprocessing steps like tokenization and utilizing context windows to learn word relationships. For instance, vectors for words like "apple" and "orange" are close in vector space, indicating similar meanings. The video explains two main approaches to generating word embeddings: frequency-based methods, like tf-idf, and prediction-based methods, which capture semantic relationships and manage various word senses.
Main takeaways from the video:
Please remember to turn on the CC button to view the subtitles.
Key Vocabularies and Common Phrases:
1. word embeddings [wɜːrd ɛmˈbɛdɪŋz] - (noun) - Numeric vector representations of words that capture semantic relationships and contextual information. - Synonyms: (vector representations, word vectors, semantic vectors)
word embeddings represent words as numbers, specifically as numeric vectors, in a way that captures their semantic relationships and contextual information.
2. semantics [sɪˈmæntɪks] - (noun) - The study of meaning in language; the meaning or an interpretation of the meaning of a word, sign, sentence, etc. - Synonyms: (meaning, interpretation, connotation)
So that means words with similar meanings are positioned close to each other, and the distance and direction between vectors encode the degree of symmetry, similarity between words.
3. contextual [kənˈtɛksʧuəl] - (adjective) - Related to or dependent on the context of surrounding information. - Synonyms: (situational, circumstantial, relevant)
word embeddings represent words as numbers, specifically as numeric vectors, in a way that captures their semantic relationships and contextual information.
4. tokenization [ˌtoʊkənəˈzeɪʃən] - (noun) - The process of breaking up text into smaller parts called tokens, which can be words, phrases or symbols. - Synonyms: (segmentation, parsing, splitting)
The process begins with pre processing the text, including tokenization and removing stop words and punctuation.
5. corpus [ˈkɔːrpəs] - (noun) - A large and structured set of texts used for statistical analysis and linguistic research. - Synonyms: (body of text, collection, dataset)
word embeddings are created by training models on a large corpus of text.
6. frequency-based embeddings [ˈfriːkwənsi-beɪst ɛmˈbɛdɪŋz] - (noun) - A method of creating word embeddings based on how often a word appears in a text. - Synonyms: (term frequency, occurrence-based, frequency analysis)
So let's take a look at some of these embedding methods, and we'll start with the first one, which is frequency.
7. tf-idf [ˈtiː ef ˈaɪ diː ɛf] - (noun) - Term Frequency-Inverse Document Frequency, a statistical measure used to evaluate the importance of a word in a document relative to a corpus. - Synonyms: (term importance, weighting scheme, frequency measure)
One such embedding of frequency based is called TFIDF, that stands for term frequency inverse document frequency.
8. prediction-based embeddings [prɪˈdɪkʃən-beɪst ɛmˈbɛdɪŋz] - (noun) - word embeddings created based on predicting words in context, focusing on the semantic relationship and context between words. - Synonyms: (contextual embeddings, predictive vectors, semantic modeling)
Now, another embedding type is called prediction based embeddings.
9. Cbow (Continuous Bag Of Words) [siː ˈboʊ] - (noun) - A neural network architecture for word vector creation that uses context words to predict a target word. - Synonyms: (context word model, contextual prediction, input-output model)
Now, word two vec has two main architectures. There's something called cbow, and there's something called skip gram.
10. transformers [trænsˈfɔːrmərz] - (noun) - A deep learning model designed for handling sequential data and known for its effectiveness in language tasks. - Synonyms: (deep models, sequence models, neural networks)
Now, while these two word embedding models continue to be valuable tools in NLP, the field has seen some significant advances with the emergence of new tech, particularly transformers.
What are Word Embeddings?
word embeddings represent words as numbers, specifically as numeric vectors, in a way that captures their semantic relationships and contextual information. So that means words with similar meanings are positioned close to each other, and the distance and direction between vectors encode the degree of symmetry, similarity between words. But why do we need to transform words into numbers? The reason vectors are used to represent words is that most machine learning algorithms are just incapable of processing plain text in its raw form. They require numbers as input to perform any task. And that's where word embeddings come in.
So let's take a look at how word embeddings are used and the models used to create them. And let's start with a look at some applications. Now, word embeddings have become a fundamental tool in the world of NLP. That's natural language processing. Natural language processing helps machines understand human language. word embeddings are used in various NLP tasks. So, for example, you'll find them used in text classification very frequently. Now, in text classification, word embeddings are often used in tasks such as spam detection and topic categorization. Another common task is ner. That's an acronym for named entity recognition, and there is used to identify and classify entities in text. And an entity is like a name of a person or a place or an organization.
Now, word embeddings can also help with tasks related to word similarity and word analogy tasks. So, for example, that king is to queen as man is to woman. And then another example is in q and a. So, question and answering systems, they can benefit from word embeddings for measuring semantic similarities between words or documents, for tasks like clustering related articles, or finding similar documents, or recommending similar items.
Now, word embeddings are created by training models on a large corpus of text. So maybe like all of Wikipedia, the process begins with pre processing the text, including tokenization and removing stop words and punctuation. A sliding context window identifies target and context words, allowing the model to learn word relationships. Then, the model is trained to predict, based on their context positioning semantically similar words close together in the vector space. And throughout the training, the model parameters are adjusted to minimize prediction errors.
So what does this look like? Well, let's start with a super small corpus of just six words. Here they are. Now, we'll represent each word as a three dimensional vector. So each dimension has a numeric value, creating a unique vector for each word. And these values represent the word's position in a continuous three dimensional vector space. And if you look closely, you can see that words with similar meanings or contexts have similar vector representations. So, for instance, the vectors for apple and for orange are close together, reflecting their semantic relationship. Likewise, the vectors for happy and sad have opposite directions, indicating their contrasting meanings.
Now, of course, in real life, it's not this simple. A corpus of six words isn't going to be too helpful in practice. And actual word embeddings typically have hundreds of dimensions, not just three, to capture more intricate relationships and nuances in meaning. Now, there are two fundamental approaches to how word embedding methods generate vector representations for words. So let's take a look at some of these embedding methods, and we'll start with the first one, which is frequency.
So, frequency based embeddings. Now, frequency based embeddings are word representations that are derived from the frequency of words in a corpus. They're based on the idea that the importance or the significance of a word can be inferred from how frequently it occurs in the text. Now, one such embedding of frequency based is called TFIDF, that stands for term frequency inverse document frequency. TF IDF highlights words that are frequent within a specific document, but are rare across the entire corpus. So, for example, in a document about coffee, Tf iDF would emphasize words like espresso or cappuccino, which might appear often in that document, but rarely in others about different topics. Common words like the or and which appear frequently across all documents, would receive low TF IDF based scores.
Now, another embedding type is called prediction based embeddings. And prediction based embeddings, they capture semantic relationships and contextual information between words. So, for example, in the sentences, the dog is barking loudly and the dog is wagging its tail, a prediction based model would learn to associate dog with words like barkley, wag, and tail. This allows these models to create a single, fixed representation for dog that encompasses various well dog related concepts. Prediction based embeddings. They excel at separating words with close meanings and can manage the various senses in which a word may be used.
Now, there are various models for generating word embeddings. One of the most popular is called word that was developed by Google in 2013. Now, word two vec has two main architectures. There's something called cbow, and there's something called skip gram and cbow. That's an acronym for continuous bag of words. Now, CBOW predicts a target word based on its surrounding context words. Well, skip gram does the opposite, predicting the context words given a target word.
Now, another popular method is called glove, also an acronym that one stands for global vectors, for word representation, that was created at Stanford University in 2014, and that uses co occurrence statistics to create word vectors. Now, these models, they differ in their approach. Word two vec, that focuses on learning from the immediate context around each word, while glove takes a broader view by analyzing how often words appear together across the entire corpus, then uses this information to create word vectors.
Now, while these two word embedding models continue to be valuable tools in NLP, the field has seen some significant advances with the emergence of new tech, particularly transformers. While traditional word embeddings assign a fixed vector to each word, transformer models use a different type of embedding, and it's called a contextual based embedding. Now, contextual embased embeddings are where the representation of a word changes based on its surrounding context. So, for example, in a transformer model, the word bank would have different representations in the sentence, I'm going to the bank to deposit money and I'm sitting on the bank of a river. This context sensitivity allows these models to capture more nuanced meanings and relationships between words, which has led to all sorts of improvements in the various fields of NLP tasks.
So that's word embeddings. From simple numeric vectors to complex representations, word embeddings have revolutionized how machines understand and process human language, proving that transforming words into numbers is indeed a powerful tool for making sense of our linguistic world.
Artificial Intelligence, Technology, Science, Natural Language Processing, Word Embeddings, Machine Learning, Ibm Technology
Comments ()