Word2Vec Embeddings
Introduction
In the so called classical NLP, words were treated as atomic symbols, e.g. hotel
, conference
, walk
and they were represented with on-hot encoded (sparse) vectors e.g.
The size of the vectors is equal to the vocabulary size motel
and hotel
as their dot product is
One of key ideas that made NLP successful is the distributional semantics that originated from Firth’s work: a word’s meaning is given by the words that frequently appear close-by. When a word
Distributional similarity representations -
banking
is represented by the words left and right across all sentences of our corpus.
This is the main idea behind word2vec word embeddings (representations) that we address next.
Before we deal with embeddings though its important to address a conceptual question:
Is there some ideal word-embedding space that would perfectly map human language and could be used for any natural-language-processing task? Possibly, but we have yet to compute anything of the sort. More pragmatically, what makes a good word-embedding space depends heavily on your task: the perfect word-embedding space for an English-language movie-review sentiment-analysis model may look different from the perfect embedding space for an English-language legal–document-classification model, because the importance of certain semantic relationships varies from task to task. It’s thus reasonable to learn a new embedding space with every new task.
Features of Word2Vec embeddings
In 2012, Thomas Mikolov, an intern at Microsoft, found a way to encode the meaning of words in a modest number of vector dimensions
word2vec generated embedding for the word
banking
in d=8 dimensions
Here is a visualization of these embeddings in the re-projected 3D space (from
Semantic Map produced by word2vec for US cities
Another classic example that shows the power of word2vec representations to encode analogies, is classical king + woman − man ≈ queen example shown below.
Classic queen example where
king − man ≈ queen − woman
, and we can visually see that in the red arrows. There are 4 analogies one can construct, based on the parallel red arrows and their direction. This is slightly idealized; the vectors need not be so similar to be the most similar from all word vectors. The similar direction of the red arrows indicates similar relational meaning.
So what is the more formal description of the word2vec algorithm? We will focus on one of the two computational algorithms1 - the skip-gram method and use the following diagrams as examples to explain how it works.
In the skip-gram we predict the context given the center word.We need to calculate the probability
We go through each position
For example, the meaning of banking
is predicting the context (the words around it) in which banking
occurs across our corpus.
The term prediction points to the regression section and the maximum likelihood principle. We start from the familiar cross entropy loss and architect a neural estimator that will minimize the distance between
where
Training data generation for the sentence ‘Claude Monet painted the Grand Canal of Venice in 1908’
So the question now becomes how to calculate
Conceptual architecture of the neural network that learns word2vec embeddings. The text refers to the hidden layer dimensions as
The network accepts the center word and via an embedding layer
The parameters
Training for large vocabularies can be quite computationally intensive. At the end of training we are then able to store the matrix
Mind the important difference between learning a representation that from the context across the corpus and the application of that representation. Word2Vec are applied context-free. After training, a single
Contextual representations will be addressed in a separate section.
Example
Consider the training corpus having the following sentences:
“the dog saw a cat”,
“the dog chased the cat”,
“the cat climbed a tree”
In this corpus
Suppose we want the network to learn relationship between the words “cat” and “climbed”. That is, the network should show a high probability for “cat” when “climbed” is presented at the input of the network. In word embedding terminology, the word “cat” is referred as the target word and the word “climbed” is referred as the context word. In this case, the input vector
With the input vector representing “climbed”, the output at the hidden layer neurons can be computed as
Carrying out similar manipulations for hidden to output layer, the activation vector for output layer neurons can be written as
Since the goal is produce probabilities for words in the output layer,
Thus, the probabilities for eight words in the corpus are:
The probability in bold is for the chosen target word ‘cat’. Given the target vector [0 1 0 0 0 0 0 0 ], the error vector for the output layer is easily computed via CE loss. Once the loss is known, the weights in the matrices W can be updated using backpropagation. Thus, the training can proceed by presenting different context-target words pairs from the corpus. The context can be more than one word and in this case the loss is the average loss across all pairs.
References
The simple example was from here.