Fundamentals

Intro

alternate text

< Representation of a linear classifier neural network. Source: cs231n >

\(\Theta^{(j)}\): matrix of weights controlling function mapping from layer j to layer j+1.

The values for each of the activation nodes is obtained as follows:

\[\begin{split}\text{3x4 matrix} \left\{ \begin{array}{lr} a_1^{(2)} &= g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3) \\ a_2^{(2)} &= g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_2 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3) \\ a_3^{(2)} &= g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_3 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3) \end{array} \right.\end{split}\]
\[ \begin{align}\begin{aligned}h_\Theta(x) &= a_1^{(3)}\\a_1^{(3)} &= g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} + \Theta_{13}^{(2)}a_3^{(2)})\end{aligned}\end{align} \]

As you can see each layer gets its own matrix of weights, \(\Theta^{(j)}\).


If a network has \(S_j\) units in layer j and \(S_{j+1}\) units in layer j+1, then \(\Theta^{(j)}\) will be of dimension \(S_{j+1} \times (S_j + 1)\). (1 is a bias node, \(x_0\))

Score function

A function that takes an input and returns class scores. In linear classifier it would be \(Wx + b\).

Cost/Loss Function

Loss function measures how good a given set of weights/parameters are compared to the ground truth labels in the training set. Popular loss functions are

  • relu
  • svm
  • softmax
  • sigmoid
  • tanh
alternate text

< Left: Sigmoid non-linearity squashes real numbers to range between [0,1] Right: The tanh non-linearity squashes real numbers to range between [-1,1]. Source: Stanford cs231n >

Sigmoid: The sigmoid non-linearity has the mathematical form \(\sigma ( x ) = 1/ \left( 1+ e ^ { - x } \right)\). It’s not zero-centered so in gradient descent it may cause zig-zag effects.

tanh: tanh neuron is simply a scaled sigmoid neuron, in particular the following holds: \(\tanh ( x ) = 2\sigma ( 2x ) - 1\).

alternate text

< Left: Rectified Linear Unit (ReLU) activation function, which is zero when x < 0 and then linear with slope 1 when x > 0. Right: A plot from Krizhevsky et al. (pdf) paper indicating the 6x improvement in convergence with the ReLU unit compared to the tanh unit. >

Maxout: The Maxout neuron computes the function \(\max \left( w _ { 1} ^ { T } x + b _ { 1} ,w _ { 2} ^ { T } x + b _ { 2} \right)\). It’s a generalized form of ReLU and Leaky ReLU.

\[\begin{split}J(\theta) = - \frac{1}{m} \sum^{m}_{i=1} \sum^{K}_{k=1} [y_k^i \log((h_\theta (x^i))_k) + (1-y_k^i)\log(1-(h_\theta(x^i))_k)] \\ + \frac{\lambda}{2m} \sum^{L-1}_{l=1} \sum^{S_l}_{i=1} \sum^{S_{l+1}}_{j=1} (\theta_{j,i}^{(l)})^2\end{split}\]

Here \(\lambda\) is regularized term.

Weight optimization

Practically it often works better to compute the numeric gradient using the centered difference formula: \([f(x+h) - f(x-h)] / 2 h\) (Wiki).

Gradient Descent

Here are pseudo-code for vanila GD and SGD.

# Source: http://cs231n.github.io/optimization-1/
# Vanilla Gradient Descent
while True:
  weights_grad = evaluate_gradient(loss_fun, data, weights)
  weights += - step_size * weights_grad # perform parameter update
# Vanilla Minibatch Gradient Descent
while True:
  data_batch = sample_training_data(data, 256) # sample 256 examples
  weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
  weights += - step_size * weights_grad # perform parameter update

There two ways to find gradients

  • numeric
  • analytic

Numerical solution is simple but doens’t give the exact solution but rather an approximation. Analytical solution is fast and give the exact solution but is error-prone as one could make a mistake during mathematical derivation. Therefore, in practice you’d use a gradient check(compare numerical and analytical solution).

Epochs, batches and iterations

  • Epoch: A single through of an entire dataset
  • Batch: A single dataset can be divided into batches.
  • Iteration: A number of batches to complete an epoch.
\[\text{A number of dataset $=$ Batch $\times$ Iteration}\]

Back-propagation

It’s a recursive application of a chain rule along a computational graph to compute the gradients of all parameters. It’s contrary to stockastic gradient descent which is used to perform learning using the gradient. It’s an algorithm that computes the chain rule of calculus , with a specific order of operations that is highly efficient [Goodfellow-et-al]. It modifies the connection weight parameters layer-by-layer starting from the output layer and progressing toward the input layer.

In a stanford lecture about backpropagation the TA shows that analytical gradient search could be represented as a computational graph.

alternate text

< Source: Stanford cs231n >

Example

alternate text

< In the figure, the upper digits are the values of the nodes and the lower its gradient/derivative. The node values are filled by forward pass and the gradients by back propagation. Source: Stanford cs231n >

\[\begin{split}\begin{align} \frac{\delta f}{\delta x} &= \frac{\delta q}{\delta x} \frac{\delta f}{\delta q} = 1 \times -4 = -4 \\ \frac{\delta f}{\delta y} &= \frac{\delta q}{\delta y} \frac{\delta f}{\delta q} = 1 \times -4 = -4 \\ \frac{\delta f}{\delta z} &= -2 + 5 = 3 \end{align}\end{split}\]

So what do we do with the local gradients in the computational graph? We send the upstream gradient going down and multiply it by the local gradients in order to get the gradient respect to the input.

alternate text

Here’s a bit more complicated example.

alternate text

< Source: Stanford cs231n >

I will show the back propagation step-by-step.

\[\begin{split}\frac{\delta f}{\delta f} = 1 \\ \frac{\delta q}{\delta x} = \frac{\delta 1/x}{\delta x} = \frac{-1}{x^2} = \frac{-1}{1.37^2} = -0.53 \\ \frac{\delta w}{\delta x} = \frac{\delta c + x}{\delta x} = 1, 1 \times -0.53 = -0.53 \\ \frac{\delta e}{\delta x} = \frac{\delta e^x}{\delta x} = e^x = e^{-1} = 0.37, 0.37 \times -0.53 = -0.2 \\ \frac{\delta r}{\delta x} = \frac{\delta -x}{\delta x} = -1, -1 \times -0.2 = 0.2 \\ \frac{\delta t}{\delta x} = \frac{\delta c + x}{\delta x} = 1, 1 \times 0.2 = 0.2 \\ \frac{\delta y}{\delta x} = \frac{\delta c + x}{\delta x} = 1, 1 \times 0.2 = 0.2 \\ \frac{\delta u}{\delta x} = \frac{\delta x_0x}{\delta x} = x_0 = -1, -1 \times 0.2 = -0.2 \\ \frac{\delta p}{\delta x} = \frac{\delta w_0x}{\delta x} = w_0 = 2, 2 \times 0.2 = 0.4 \\ \frac{\delta s}{\delta x} = \frac{\delta x_1x}{\delta x} = x_1 = -2, -2 \times 0.2 = -0.4 \\ \cdots\end{split}\]

However, there isn’t only one way to draw a computational graph. One can decide the level of complexity like in the bottom, in which it substitutes a sigmoid gate with 4 nodes on the right.:

alternate text

< Source: Stanford cs231n >

Patterns in back propagation

In the example you could observe a pattern in the back propagation. The add gate distributes gradients. The mul gate switches scaler and multiply it to the upstream gradient. So in the above example for w0 local gradient it is \(0.2 \times -1\). max gate is interesting. It routes the gradient only to the max node.

alternate text

< Source: Stanford cs231n >

So to summarize:

Gates Rules
add gate Gradient distributor
max gate Gradient router
mul gate Scaler switcher

Vectorized example

alternate text

< Source: Stanford cs231n >

The idea is the same with scalar example. For instance, in order to get the gradient of \(W\), you follow the scaler switcher rule.

np.array([0.2,0.4]).reshape(2,1).dot(np.array([0.44,.52]).reshape(1,2))
[Goodfellow-et-al]Deep Learning

Activation functions

In NN, we use non-linear activation functions. This excellent Stackoverflow answer explains why we use non-linear activation functions.

The purpose of the activation function is to introduce non-linearity into the network.

In turn, this allows you to model a response variable (aka target variable, class label, or score) that varies non-linearly with its explanatory variables

non-linear means that the output cannot be reproduced from a linear combination of the inputs (which is not the same as output that renders to a straight line–the word for this is affine).

another way to think of it: without a non-linear activation function in the network, a NN, no matter how many layers it had, would behave just like a single-layer perceptron, because summing these layers would give you just another linear function (see definition just above).

Learning rate vs. Momentum

When performing gradient descent, learning rate measures how much the current situation affects the next step, while momentum measures how much past steps affect the next step. [Quara-What-is-the-difference-between-momentum-and-learning-rate]

[Quara-What-is-the-difference-between-momentum-and-learning-rate]https://www.quora.com/What-is-the-difference-between-momentum-and-learning-rate
alternate text

< Momentum and other gradient descent techiques visualized. Source >


Batch normalization

Adaptive reparameterization method

  1. Normalize output from activation function.
\[z = \frac { x - m } { s }\]
  1. Multiply normlized output by arbitrary parameter, g.
\[z * g\]
  1. Add arbitrary parameter, b, to resulting product
\[( z * g ) + b\]

At test time

https://www.youtube.com/watch?v=5qefnAek8OA

\[\begin{split}\left.\begin{aligned} \mu & = \frac { 1 } { m } \sum _ { i } z ^ { ( i ) } \\ \sigma ^ { 2 } & = \frac { 1 } { m } \sum _ { i } \left( z ^ { ( i ) } - \mu \right) ^ { 2 } \end{aligned} \right. \\ \left.\begin{aligned} Z _ { \text { norm } } ^ { ( i ) } & = \frac { Z ^ { ( i ) } - \mu } { \sqrt { \sigma ^ { 2 } + \varepsilon } } \\ \tilde { Z } ^ { ( i ) } = & \gamma Z _ { \text { norm } } ^ { ( i ) } + \beta \end{aligned} \right.\end{split}\]