# Fundamentals¶

## Intro¶

< Representation of a linear classifier neural network. Source: cs231n >

$$\Theta^{(j)}$$: matrix of weights controlling function mapping from layer j to layer j+1.

The values for each of the activation nodes is obtained as follows:

$\begin{split}\text{3x4 matrix} \left\{ \begin{array}{lr} a_1^{(2)} &= g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3) \\ a_2^{(2)} &= g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_2 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3) \\ a_3^{(2)} &= g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_3 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3) \end{array} \right.\end{split}$
\begin{align}\begin{aligned}h_\Theta(x) &= a_1^{(3)}\\a_1^{(3)} &= g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} + \Theta_{13}^{(2)}a_3^{(2)})\end{aligned}\end{align}

As you can see each layer gets its own matrix of weights, $$\Theta^{(j)}$$.

If a network has $$S_j$$ units in layer j and $$S_{j+1}$$ units in layer j+1, then $$\Theta^{(j)}$$ will be of dimension $$S_{j+1} \times (S_j + 1)$$. (1 is a bias node, $$x_0$$)

## Score function¶

A function that takes an input and returns class scores. In linear classifier it would be $$Wx + b$$.

## Cost/Loss Function¶

Loss function measures how good a given set of weights/parameters are compared to the ground truth labels in the training set. Popular loss functions are

• relu
• svm
• softmax
• sigmoid
• tanh

< Left: Sigmoid non-linearity squashes real numbers to range between [0,1] Right: The tanh non-linearity squashes real numbers to range between [-1,1]. Source: Stanford cs231n >

Sigmoid: The sigmoid non-linearity has the mathematical form $$\sigma ( x ) = 1/ \left( 1+ e ^ { - x } \right)$$. It’s not zero-centered so in gradient descent it may cause zig-zag effects.

tanh: tanh neuron is simply a scaled sigmoid neuron, in particular the following holds: $$\tanh ( x ) = 2\sigma ( 2x ) - 1$$.

< Left: Rectified Linear Unit (ReLU) activation function, which is zero when x < 0 and then linear with slope 1 when x > 0. Right: A plot from Krizhevsky et al. (pdf) paper indicating the 6x improvement in convergence with the ReLU unit compared to the tanh unit. >

Maxout: The Maxout neuron computes the function $$\max \left( w _ { 1} ^ { T } x + b _ { 1} ,w _ { 2} ^ { T } x + b _ { 2} \right)$$. It’s a generalized form of ReLU and Leaky ReLU.

$\begin{split}J(\theta) = - \frac{1}{m} \sum^{m}_{i=1} \sum^{K}_{k=1} [y_k^i \log((h_\theta (x^i))_k) + (1-y_k^i)\log(1-(h_\theta(x^i))_k)] \\ + \frac{\lambda}{2m} \sum^{L-1}_{l=1} \sum^{S_l}_{i=1} \sum^{S_{l+1}}_{j=1} (\theta_{j,i}^{(l)})^2\end{split}$

Here $$\lambda$$ is regularized term.

## Weight optimization¶

Practically it often works better to compute the numeric gradient using the centered difference formula: $$[f(x+h) - f(x-h)] / 2 h$$ (Wiki).

Here are pseudo-code for vanila GD and SGD.

# Source: http://cs231n.github.io/optimization-1/
# Vanilla Gradient Descent
while True:
weights += - step_size * weights_grad # perform parameter update

# Vanilla Minibatch Gradient Descent
while True:
data_batch = sample_training_data(data, 256) # sample 256 examples
weights += - step_size * weights_grad # perform parameter update


There two ways to find gradients

• numeric
• analytic

Numerical solution is simple but doens’t give the exact solution but rather an approximation. Analytical solution is fast and give the exact solution but is error-prone as one could make a mistake during mathematical derivation. Therefore, in practice you’d use a gradient check(compare numerical and analytical solution).

## Epochs, batches and iterations¶

• Epoch: A single through of an entire dataset
• Batch: A single dataset can be divided into batches.
• Iteration: A number of batches to complete an epoch.
$\text{A number of dataset = Batch \times Iteration}$

## Back-propagation¶

It’s a recursive application of a chain rule along a computational graph to compute the gradients of all parameters. It’s contrary to stockastic gradient descent which is used to perform learning using the gradient. It’s an algorithm that computes the chain rule of calculus , with a specific order of operations that is highly efficient [Goodfellow-et-al]. It modifies the connection weight parameters layer-by-layer starting from the output layer and progressing toward the input layer.

In a stanford lecture about backpropagation the TA shows that analytical gradient search could be represented as a computational graph.

< Source: Stanford cs231n >

### Example¶

< In the figure, the upper digits are the values of the nodes and the lower its gradient/derivative. The node values are filled by forward pass and the gradients by back propagation. Source: Stanford cs231n >

\begin{split}\begin{align} \frac{\delta f}{\delta x} &= \frac{\delta q}{\delta x} \frac{\delta f}{\delta q} = 1 \times -4 = -4 \\ \frac{\delta f}{\delta y} &= \frac{\delta q}{\delta y} \frac{\delta f}{\delta q} = 1 \times -4 = -4 \\ \frac{\delta f}{\delta z} &= -2 + 5 = 3 \end{align}\end{split}

So what do we do with the local gradients in the computational graph? We send the upstream gradient going down and multiply it by the local gradients in order to get the gradient respect to the input.

Here’s a bit more complicated example.

< Source: Stanford cs231n >

I will show the back propagation step-by-step.

$\begin{split}\frac{\delta f}{\delta f} = 1 \\ \frac{\delta q}{\delta x} = \frac{\delta 1/x}{\delta x} = \frac{-1}{x^2} = \frac{-1}{1.37^2} = -0.53 \\ \frac{\delta w}{\delta x} = \frac{\delta c + x}{\delta x} = 1, 1 \times -0.53 = -0.53 \\ \frac{\delta e}{\delta x} = \frac{\delta e^x}{\delta x} = e^x = e^{-1} = 0.37, 0.37 \times -0.53 = -0.2 \\ \frac{\delta r}{\delta x} = \frac{\delta -x}{\delta x} = -1, -1 \times -0.2 = 0.2 \\ \frac{\delta t}{\delta x} = \frac{\delta c + x}{\delta x} = 1, 1 \times 0.2 = 0.2 \\ \frac{\delta y}{\delta x} = \frac{\delta c + x}{\delta x} = 1, 1 \times 0.2 = 0.2 \\ \frac{\delta u}{\delta x} = \frac{\delta x_0x}{\delta x} = x_0 = -1, -1 \times 0.2 = -0.2 \\ \frac{\delta p}{\delta x} = \frac{\delta w_0x}{\delta x} = w_0 = 2, 2 \times 0.2 = 0.4 \\ \frac{\delta s}{\delta x} = \frac{\delta x_1x}{\delta x} = x_1 = -2, -2 \times 0.2 = -0.4 \\ \cdots\end{split}$

However, there isn’t only one way to draw a computational graph. One can decide the level of complexity like in the bottom, in which it substitutes a sigmoid gate with 4 nodes on the right.:

< Source: Stanford cs231n >

### Patterns in back propagation¶

In the example you could observe a pattern in the back propagation. The add gate distributes gradients. The mul gate switches scaler and multiply it to the upstream gradient. So in the above example for w0 local gradient it is $$0.2 \times -1$$. max gate is interesting. It routes the gradient only to the max node.

< Source: Stanford cs231n >

So to summarize:

Gates Rules
max gate Gradient router
mul gate Scaler switcher

### Vectorized example¶

< Source: Stanford cs231n >

The idea is the same with scalar example. For instance, in order to get the gradient of $$W$$, you follow the scaler switcher rule.

np.array([0.2,0.4]).reshape(2,1).dot(np.array([0.44,.52]).reshape(1,2))

 [Goodfellow-et-al] Deep Learning

## Activation functions¶

In NN, we use non-linear activation functions. This excellent Stackoverflow answer explains why we use non-linear activation functions.

The purpose of the activation function is to introduce non-linearity into the network.

In turn, this allows you to model a response variable (aka target variable, class label, or score) that varies non-linearly with its explanatory variables

non-linear means that the output cannot be reproduced from a linear combination of the inputs (which is not the same as output that renders to a straight line–the word for this is affine).

another way to think of it: without a non-linear activation function in the network, a NN, no matter how many layers it had, would behave just like a single-layer perceptron, because summing these layers would give you just another linear function (see definition just above).

## Learning rate vs. Momentum¶

When performing gradient descent, learning rate measures how much the current situation affects the next step, while momentum measures how much past steps affect the next step. [Quara-What-is-the-difference-between-momentum-and-learning-rate]

< Momentum and other gradient descent techiques visualized. Source >

## Batch normalization¶

$z = \frac { x - m } { s }$
$z * g$
$( z * g ) + b$
\begin{split}\left.\begin{aligned} \mu & = \frac { 1 } { m } \sum _ { i } z ^ { ( i ) } \\ \sigma ^ { 2 } & = \frac { 1 } { m } \sum _ { i } \left( z ^ { ( i ) } - \mu \right) ^ { 2 } \end{aligned} \right. \\ \left.\begin{aligned} Z _ { \text { norm } } ^ { ( i ) } & = \frac { Z ^ { ( i ) } - \mu } { \sqrt { \sigma ^ { 2 } + \varepsilon } } \\ \tilde { Z } ^ { ( i ) } = & \gamma Z _ { \text { norm } } ^ { ( i ) } + \beta \end{aligned} \right.\end{split}