# Recurrent Neural Network¶

Note

This part of the documentation is largely referring to Aalto CS-E4890. < Source: Aalto CS-E4890 >

RNN is a specialized NN for processing sequential data $$x^{(1)}, \cdots, x^{(\mathcal{T})}$$. RNN employs parameter sharing just as CNN does. In CNN it is a kernel applied to a grid within images, and in RNN an n-gram sequence on sentences. RNNs are able to preserve information about the sequence order. < The unfolding the computational graph makes the RNN to resemble feedforward network structure. The back propagation is done normally. The parameters are shared over the training. Source: Aalto CS-E4890 > < The back propagation through time is expensive as one has to calculate loss L from each time step to every hidden state i.e., $$L = \sum_{t=1}^{\mathcal{T}} L^{(t)}$$ Source: Aalto CS-E4890 > < Training RNNs is unstable. Source: Aalto CS-E4890 >

## Vanishing & exploding gradient problem¶

If a computatinal graph is deep and a shared parameters are repeatedly multiplied, as in RNN, there may be either a vanishing or exploding gradient problem. Here’s a simple recurrence without input or activation function:

$h^{(t)} = W^T h^{(t-1)}$

This can be also presented as several multiplications by the same weight matrix

$h^{(t)} = (W^T)^t h^{(0)}$

W can be factorized as

\begin{split}\begin{align} \mathbf{W} = \mathcal{Q \Lambda Q^T}, & \\ & \text{( \mathcal{Q}: orthogonal matrix composed of eigenvectors of \mathbf{W}, } \\ & \text{ \Lambda: diagonal matrix of eigenvalues)} \end{align}\end{split}

We can thus conclude that:

$h^{(t)} = \mathcal{Q \Lambda^t Q^T} h^{(0)}$

Since the eigenvalues are raised to the power of t, the gradient will explode if the largest eigenvalue is >1, and vanish if the largest eigenvalue is < 1

You can solve this issue by clipping gradients; just clip the gradient if it is larger than the threshold. Clipping can be done element-wise or in vectorized way.

$if \|g\| > v, \quad g \leftarrow \frac{gv}{\|g\|}$