# Linear Regression¶

Consider a hypothesis function as a linear function of x.

\begin{align} h_{\theta}(x) = \theta_0 +\theta_1x_1 + \theta_2x_2 && \text{(\theta_i's are parameters/weights)} \end{align}

To simplify our notation, we introduce the convention of let the intercept be 1: $$x_0 = 1$$.

$h(x) = \sum_{i=0}^n \theta_i x_i = \theta^Tx$

How do we learn the parameters? Make $$h(x)$$ close to $$y$$. In other words, make the cost as small as possible. Cost function $$J(\theta)$$,

$J(\theta) = \frac{1}{2} \sum_{i=1}^m (h_\theta(x^i) - y^i)^2$

## Gradient Descent¶

This is the tool to minimize your cost. Following is how it works.

Repeat until convergence {

\begin{align} \theta_j &:= \theta_j - \alpha \frac{\delta}{\delta\theta_j}J(\theta_0, \theta_1), &\text{(for j = 0 and j = 1)} \nonumber \end{align}

}

Be careful not to update $$\theta$$ separately. Update them all together at the end of each loop. i.e.,

\begin{align}\begin{aligned}temp_0 &:= \theta_0 - \alpha \frac{\delta}{\delta\theta_j}J(\theta_0, \theta_1)\\temp_1 &:= \theta_1 - \alpha \frac{\delta}{\delta\theta_j}J(\theta_0, \theta_1)\\\theta_0 &:= temp_0\\\theta_1 &:= temp_1\end{aligned}\end{align}

### Bivariate Gradient Descent¶

$\begin{split}\begin{split} \frac{\delta}{\delta\theta_j}J(\theta_0, \theta_1) &= \frac{\delta}{\delta\theta_j}J(\theta_0, \theta_1) \frac{1}{2m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i)^2 \\ &= \frac{\delta}{\delta\theta_j}J(\theta_0, \theta_1) \frac{1}{2m} \sum^{m}_{i=1}(\theta_0 + \theta_1 x^i - y^i)^2 \end{split}\end{split}$
$\begin{split}\begin{array}{ll} \theta_0 \Rightarrow j = 0 : \frac{\delta}{\delta\theta_j}J(\theta_0, \theta_1) = \frac{1}{m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i) \\ \theta_1 \Rightarrow j = 1 : \frac{\delta}{\delta\theta_j}J(\theta_0, \theta_1) = \frac{1}{m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i) \end{array}\end{split}$

Repeat until convergence {

$\begin{split}\begin{array}{ll} \theta_0 &:= \theta_0 - \alpha \frac{1}{m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i) x_0^i \\ \nonumber \theta_1 &:= \theta_1 - \alpha \frac{1}{m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i) x_1^i \end{array}\end{split}$

}

Here $$x_0^i$$ is 0.

### Multivariate gradient descent¶

Repeat until convergence {

$\theta_j := \theta_j - \alpha \frac{1}{m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i) x_j^i$

}

### Normal Equation¶

In linear gression, instead of a loop as above, gradient descent can be expressed in a non-loop equation:

$\theta = (X^TX)^{-1}X^Ty$

## Regularization¶

• Overfitting: low bias, high variance
• Underfitting: high bias, low variance

You could think bias in respect of the training dataset, and variance the test set. If a model overfits the bias would be almost 0 because it perfectly fits the training set. However, it wouldn’t be a good model for new dataset so results in high variance.

Watch Andrew Ng’s lecture.

### Example¶

Say we want to make the following function more quadratic:

$H = \theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3 + \theta_4x^4$

Without actually removing $$\theta_3x^3 + \theta_4x^4$$ or changing the form of $$H$$, we can modify our cost function:

$\text{min}_\theta \frac{1}{2m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i)^2 + 1000 \theta_3^2 + 1000 \theta_4^2$

We could also regularize all of our theta parameters in a single summation

$\text{min}_\theta \frac{1}{2m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i)^2 + \lambda \sum^{n}_{j=1}\theta_j^2$

The square in the second sum comes from the first sum.

### Regularization - Gradient Descent¶

Repeat until convergence {

\begin{split}\begin{align} \theta_0 &:= \theta_0 - \alpha \frac{1}{m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i) x_0^i &&\text{(Don't penalize the intercept \theta_0)} \nonumber \\ \theta_j &:= \theta_j - \alpha \Bigg[ \bigg( \frac{1}{m} \sum^{m}_{i=1}\Big(h_\theta(x^i) - y^i\Big) x_j^i \bigg) + \frac{\lambda}{m}\theta_j \Bigg] && j \in {1,2,...,n} \end{align}\end{split}

}

$$\frac{\lambda}{m}$$ is a regularization performer.

The above can be represented as:

$\theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i) x_j^i$

$$1 - \alpha\frac{\lambda}{m}$$ is always less than 1. Thus, regularized.

### Regularization - Normal Equation¶

\begin{split}\begin{align} X &= \begin{bmatrix} (x^1)^T \\ \vdots\\ (x^m)^T \end{bmatrix}, & \text{size is (m)\times(n+1)} \end{align}\end{split}
\begin{split}\begin{align} \vec{y} &= \begin{bmatrix} y^1 \\ \vdots\\ y^m \end{bmatrix}, & \text{size is (m)\times(1)} \end{align}\end{split}
$\theta = (X^TX + \lambda L)^{-1}X^T y$

where L is a pseudo-diagonal matrix of

\begin{split}\begin{align} L &= \begin{bmatrix} 0 & 0 & 0 & \dots & 0 \\ 0 & 1 & 0 & \dots & 0 \\ 0 & 0 & 1 & \dots & 0 \\ \hdotsfor{5}\\ 0 & 0 & 0 & \dots & 1 \end{bmatrix}, & \text{size is (n+1)\times(n+1)} \end{align}\end{split}

If $$m \leq n$$, then $$X^TX$$ is non-invertable and so is $$(X^TX + \lambda L)$$.