Neural Language Models
These notes heavily borrowing from the CS229N 2019 set of notes on Language Models.
Language modeling is the task of predicting (aka assigning a probability) what word comes next. More formally, given a sequence of words $\mathbf x_1, …, \mathbf x_t$ the language model returns
$$p(\mathbf x_{t+1} | \mathbf x_1, …, \mathbf x_t)$$
Language Model Example
How we can build language models though ?
We will use the RNN architectures as shown next for a simple example:
RNN Language Model. Note the different notation and certain replacements must be made: $W_h → W$, $W_e \rightarrow U$, $U → V$
where the vocabulary is [‘h’,‘e’,‘l’,‘o’]. where the tokens are single letters represented in the input with a one-hot encoded vector.
RNN language model example - training ref. Note that in practice in the place of the on-hot encoded word vectors we will have word embeddings.
Let us assume that the network is being trained with the sequence “hello”. The letters will come in one at a time, each letter going through the forward pass that produces at the output the $\mathbf y_t$ that indicates which letter is expected to arrive next. You can see, since we are just started training, that this network is not predicting correctly - this will improve over time as the model is trained with more sequence permutations form our limited vocabulary. During inference we will use the language model to generate the next token.
RNN language model example - generate the next token ref
More concretely, to train an language model we need a big corpus of text which is a sequence of tokens $\mathbf x_1, …, \mathbf x_{T}$ where T is the number of words / tokens in the corpus.
Every time step we feed one word at a time to the RNN and compute the output probability distribution $\hat \mathbf y_t$, which by construction is a conditional probability distribution of every word in the dictionary given the words we have seen so far. The loss function at time step $t$ is the classic cross entropy loss between the predicted probability distribution and the distribution that corresponds to the one-hot encoded true next token.
$$J_t(\theta) = CE(\hat \mathbf y_t, \mathbf y_t) = - \sum_{w \in T} \mathbf y_t^{(w)} \log \hat \mathbf y_t^{(w)} = \log \hat \mathbf y_t^{(w)}$$
This is visually shown in the next figure for a hypothetical example of the shown sequence of words.
RNN Language Model Training Loss. For each input word (at step t$t$), the RNN predicts the next word and is penalized with a loss $J_t(\theta)$. The total loss is the average across the corpus.
In practice we don’t compute the total loss over the whole corpus but just like what we have done with DNN and CNN networks we train over a finite span and compute gradients over that span iterating on a stochastic gradient decent optimization algorithm. Character-level language models have achieved state of the art NLP results by Facebook Research.