# Fundamentals¶

## Intro¶

\(\Theta^{(j)}\): matrix of weights controlling function mapping from layer *j* to layer *j+1*.

The values for each of the *activation* nodes is obtained as follows:

As you can see each layer gets its own matrix of weights, \(\Theta^{(j)}\).

If a network has \(S_j\) units in layer *j* and \(S_{j+1}\) units in layer *j+1*, then \(\Theta^{(j)}\) will be of dimension \(S_{j+1} \times (S_j + 1)\). (1 is a bias node, \(x_0\))

## Score function¶

A function that takes an input and returns class scores. In linear classifier it would be \(Wx + b\).

## Cost/Loss Function¶

Loss function measures how good a given set of weights/parameters are compared to the ground truth labels in the training set. Popular loss functions are

- relu
- svm
- softmax
- sigmoid
- tanh

Here \(\lambda\) is regularized term.

## Weight optimization¶

Practically it often works better to compute the numeric gradient using the centered difference formula: \([f(x+h) - f(x-h)] / 2 h\) (Wiki).

### Gradient Descent¶

Here are pseudo-code for vanila GD and SGD.

```
# Source: http://cs231n.github.io/optimization-1/
# Vanilla Gradient Descent
while True:
weights_grad = evaluate_gradient(loss_fun, data, weights)
weights += - step_size * weights_grad # perform parameter update
```

```
# Vanilla Minibatch Gradient Descent
while True:
data_batch = sample_training_data(data, 256) # sample 256 examples
weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
weights += - step_size * weights_grad # perform parameter update
```

There two ways to find gradients

- numeric
- analytic

Numerical solution is simple but doens’t give the exact solution but rather an approximation. Analytical solution is fast and give the exact solution but is error-prone as one could make a mistake during mathematical derivation. Therefore, in practice you’d use a gradient check(compare numerical and analytical solution).

## Epochs, batches and iterations¶

- Epoch: A single through of an entire dataset
- Batch: A single dataset can be divided into batches.
- Iteration: A number of batches to complete an epoch.

## Back-propagation¶

It’s a recursive application of a chain rule along a computational graph to compute the gradients of all parameters. It’s contrary to stockastic gradient descent which is used to perform learning using the gradient. It’s an algorithm that computes the chain rule of calculus , with a specific order of operations that is highly efficient [Goodfellow-et-al]. It modifies the connection weight parameters layer-by-layer starting from the output layer and progressing toward the input layer.

In a stanford lecture about backpropagation the TA shows that analytical gradient search could be represented as a computational graph.

### Example¶

So what do we do with the **local gradients** in the computational graph? We send the upstream gradient going down and multiply it by the local gradients in order to get the gradient respect to the input.

Here’s a bit more complicated example.

I will show the back propagation step-by-step.

However, there isn’t only one way to draw a computational graph. One can decide the level of complexity like in the bottom, in which it substitutes a sigmoid gate with 4 nodes on the right.:

### Patterns in back propagation¶

In the example you could observe a **pattern** in the back propagation. The **add** gate distributes gradients. The **mul** gate switches scaler and multiply it to the upstream gradient. So in the above example for w0 local gradient it is \(0.2 \times -1\). **max** gate is interesting. It routes the gradient only to the max node.

So to summarize:

Gates | Rules |
---|---|

add gate | Gradient distributor |

max gate | Gradient router |

mul gate | Scaler switcher |

### Vectorized example¶

The idea is the same with scalar example. For instance, in order to get the gradient of \(W\), you follow the **scaler switcher** rule.

```
np.array([0.2,0.4]).reshape(2,1).dot(np.array([0.44,.52]).reshape(1,2))
```

[Goodfellow-et-al] | Deep Learning |

## Activation functions¶

In NN, we use non-linear activation functions. This excellent Stackoverflow answer explains why we use non-linear activation functions.

The purpose of the activation function is to introduce

non-linearity into the network.In turn, this allows you to model a response variable (aka target variable, class label, or score) that varies non-linearly with its explanatory variables

non-linearmeans that the output cannot be reproduced from a linear combination of the inputs (which is not the same as output that renders to a straight line–the word for this is affine).another way to think of it: without a non-linear activation function in the network, a NN,

no matter how many layers it had, would behave just like a single-layer perceptron, because summing these layers would give you just another linear function (see definition just above).

## Learning rate vs. Momentum¶

When performing gradient descent, **learning rate** measures how much the current situation affects the next step, while **momentum** measures how much past steps affect the next step. [Quara-What-is-the-difference-between-momentum-and-learning-rate]

[Quara-What-is-the-difference-between-momentum-and-learning-rate] | https://www.quora.com/What-is-the-difference-between-momentum-and-learning-rate |