Simple RNNs and their Backpropagation

Simple RNN #

The simple RNN architecture with just a single layer of neurons that receive the input x\mathbf{x} is shown below.

simple-rnn-single-layer Simple RNN with just a single layer of nneuronsn_neurons mapping the input to the hidden-state at each time step

A more practical simple RNN architecture is shown below.

rnn-hidden-recurrence

Simple RNN with recurrences between hidden units. This architecture can compute any computable function and therefore is a Universal Turing Machine.

Notice how the path from input xt1\bm x_{t-1} affects the label yt\bm y_{t} and also the conditional independence between y\bm y given x\bm x. Please note that this is not a computational graph rather one way to represent the hidden state transfer between recurrences.

Dimensioning Simple RNNs #

In the table below mm is the number of examples in the mini-batch.

Variable Dimensions
ht\bm{h}_t nneurons×1n_{neurons} \times 1 or m×nneuronsm \times n_{neurons}
xt\bm{x}_t ninput×1n_{input} \times 1 or m×ninputm \times n_{input}
U\bm{U} nneurons×ninputn_{neurons} \times n_{input}
W\bm{W} nneurons×nneuronsn_{neurons} \times n_{neurons}
b\bm{b} nneurons×1n_{neurons} \times 1
V\bm{V} noutput×nneuronsn_{output} \times n_{neurons}
o\bm{o} noutput×1n_{output} \times 1
c\bm{c} noutput×1n_{output} \times 1

Please note that there may be multiple layers that can be stacked on top of each other and they can individually keep a hidden state.

Forward Propagation #

This network maps the input sequence to a sequence of the same length and implements the following forward pass:

at=Wht1+Uxt+b\bm a_t = \bm W \bm h _{t-1} + \bm U \bm x_t + \bm b

ht=tanh(at)\bm h_t = \tanh(\bm a_t)

ot=Vht+c\bm o_t = \bm V \bm h_t + \bm c

y^t=softmax(ot)\hat \bm y_t = \mathtt{softmax}(\bm o_t)

L(x1,,xτ,y1,,yτ)=DKL[p^data(yx)pmodel(yx;w)]L(\bm x_1, \dots , \bm x_{\tau}, \bm y_1, \dots , \bm y_{\tau}) = D_{KL}[\hat p_{data}(\bm y | \bm x) || p_{model}(\bm y | \bm x; \bm w)]

=Eyxp^datalogpmodel(yx;w)=tlogpmodel(ytx1,,xt;w)= - \mathbb E_{\bm y | \bm x \sim \hat{p}_{data}} \log p_{model}(\bm y | \bm x ; \bm w) = - \sum_t \log p_{model}(y_t | \bm x_1, \dots, \bm x_t ; \bm w)

Notice that RNNs can model very generic distributions logpmodel(x,y;w)\log p_{model}(\bm x, \bm y ; \bm w). The simple RNN architecture above, effectively models the posterior distribution logpmodel(yx;w)\log p_{model}(\bm y | \bm x ; \bm w) and based on a conditional independence assumption it factorizes into tlogpmodel(ytx1,,xt;w)\sum_t \log p_{model}(y_t | \bm x_1, \dots, \bm x_t ; \bm w).

Note that by connecting the yt1\bm y_{t-1} to ht\bm h_t via a matrix e.g. R\bm R we can avoid this simplifying assumption and be able to model an arbitrary distribution logpmodel(yx;w)\log p_{model}(\bm y | \bm x ; \bm w). In other words just like in the other DNN architectures, connectivity directly affects the representational capacity of the hypothesis set.

In many instances we have problems where it only matters the label yτy_\tau at the end of the sequence. Lets say that you are classifying speech or video inside the cabin of a car to detect the psychological state of the driver. The same architecture shown above can also represent such problems - the only difference is the only the oτ\bm o_\tau, LτL_\tau and yτy_\tau will be considered.

Lets see an example to understand better the forward propagation equations.

example-sentence Example sentence as input to the RNN

In the figure above you have a hypothetical document (a sentence) that is broken into what in natural language processing called tokens. Lets say that a token is a word in this case. In the simpler case where we need a classification of the whole document, given that τ=6\tau=6, we are going to receive at t=1, the first token x1\bm x_1 and with an input hidden state h0=0\bm h_0 = 0 we will calculate the forward equations for h1\bm h_1, ignoring the output o1\bm o_1 and repeat the unrolling when the next input x2\bm x_2 comes in until we reach the end of sentence token x6\bm x_6 which in this case will calculate the output y6y_6 and loss

logpmodel(y6x1,,x6;w)- \log p_{model} (y_6|\bm x_1, \dots , \bm x_6; \bm w)

where w={W,U,V,b,c}\bm w = \{ \bm W, \bm U, \bm V, \bm b, \bm c \}.

Back-Propagation Through Time (BPTT) #

Lets now see how the training through backward propagation would work for RNNs.

rnn-BPTT Understanding RNN memory through BPTT procedure

Backpropagation is similar to that of feed-forward (FF) networks simply because the unrolled architecture resembles a FF one. But there is an important difference and we explain this using the above computational graph for the unrolled recurrences tt and t1t-1. During computation of the variable ht\bm h_t we use the value of the variable ht1\bm h_{t-1} calculated in the previous recurrence. So when we apply the chain rule in the backward phase of BP, for all nodes that involve the such variables with recurrent dependencies, the end result is that _non local_ gradients from previous backpropagation steps (tt in the figure) appear. This is effectively why we say that simple RNNs feature _memory_. This is in contrast to the FF network case where during BP only local to each gate gradients where involved as we have seen in the the DNN chapter.

The key point to notice in the backpropagation in recurrence t1t-1 is the junction between tanh\tanh and Vht1\bm V \bm h_{t-1}. This junction brings in the gradient ht1Lt\nabla_{\bm h_{t-1}}L_t from the backpropagation of the Wht1\bm W h_{t-1} node in recurrence tt and just because its a junction, it is added to the backpropagated gradient from above in the current recurrence t1t-1 i.e.

ht1Lt1ht1Lt1+ht1Lt\nabla_{\bm h_{t-1}}L_{t-1} \leftarrow \nabla_{\bm h_{t-1}}L_{t-1} + \nabla_{\bm h_{t-1}}L_t

Ian Goodfellow’s book section 10.2.2 provides the exact equations - please note that you need to know only the intuition behind computational graphs for RNNs. In practice BPTT is truncated to avoid having to do one full forward pass and one full reverse pass through the training dataset of a e.g. language model that is usually very large, to do a single gradient update.

Vanishing or exploding gradients #

In the figure below we have drafted a conceptual version of what is happening with recurrences over time. Its called an infinite impulse response filter for reasons that will be apparent shortly.

rnn-IIR Infinite Impulse Response (IIR) filter with weight ww

With DD denoting a unit delay, the recurrence formula for this system is:

ht=wht1+xth_t = w h_{t-1} + x_t

where wwis a weight (a scalar). Lets consider what happens when an impulse, xt=δtx_t = \delta_t is fed at the input of this system with w=0.9w=-0.9.

h0=0.9h1+δ0=1h_0 = -0.9 h_{-1} + \delta_0 = 1 h1=0.9h0+δ1=0.9h_1 = -0.9 h_{0} + \delta_1 = -0.9 h2=0.9h1+δ2=0.81h_2 = -0.9 h_{1} + \delta_2 = 0.81 h3=0.9h2+δ3=0.729h_3 = -0.9 h_{2} + \delta_3 = -0.729

With w=0.9w=-0.9, the h_t (called impulse response) follows a decaying exponential envelope while obviously with w>1.0w > 1.0 it would follow an exponentially increasing envelope. Such recurrences if continue will result in vanishing or exploding responses long after the impulse showed up in the input t=0t=0.

Using this primitive IIR filter as an example, we can see that the weight plays a crucial role in the impulse response. In a similar fashion, the RNN hidden state recurrence, in the backwards pass of BP that extends from the t=τt=\tau to t=1t=1 can make the gradient, when τ\tau is large, either vanish or explode. Instead of a scalar ww we have matrices W\bm W and instead of hh we have gradients htLt\nabla_{\bm h_{t}}L_{t}. This is discussed in this paper.

Simplistically thinking, the gradient of the tanh\tanh non-linearity shown below, is between 0 and 1 suppressing the gradients and slowing down training.

tanh-derivative Derivative of tanh\tanh non-linearity

Similar the successive application of the WW matrix is causing explosive gradients as simplistically (ignoring the non-linearity) the hidden state can be written as

ht=Wht1\mathbf h_{t} = \mathbf W \mathbf h_{t-1}

making after τ\tau steps

ht=Wτh0\mathbf h_{t} = \mathbf W^\tau \mathbf h_{0}

If the magnitude of the eigenvalues are less than 1.0 the matrix will create vanishing gradients as it is involved in the htLt\nabla_{\bm h_{t}}L_{t} expression (see equations in section 10.2.2 in the textbook). This issue is addressed using the LSTM type of cells.