Linear Regression

Consider a hypothesis function as a linear function of x.

\[\begin{align} h_{\theta}(x) = \theta_0 +\theta_1x_1 + \theta_2x_2 && \text{($\theta_i$'s are parameters/weights)} \end{align}\]

To simplify our notation, we introduce the convention of let the intercept be 1: \(x_0 = 1\).

\[h(x) = \sum_{i=0}^n \theta_i x_i = \theta^Tx\]

How do we learn the parameters? Make \(h(x)\) close to \(y\). In other words, make the cost as small as possible. Cost function \(J(\theta)\),

\[J(\theta) = \frac{1}{2} \sum_{i=1}^m (h_\theta(x^i) - y^i)^2\]

Gradient Descent

This is the tool to minimize your cost. Following is how it works.

Repeat until convergence {

\[\begin{align} \theta_j &:= \theta_j - \alpha \frac{\delta}{\delta\theta_j}J(\theta_0, \theta_1), &\text{(for j = 0 and j = 1)} \nonumber \end{align}\]

}

Be careful not to update \(\theta\) separately. Update them all together at the end of each loop. i.e.,

\[ \begin{align}\begin{aligned}temp_0 &:= \theta_0 - \alpha \frac{\delta}{\delta\theta_j}J(\theta_0, \theta_1)\\temp_1 &:= \theta_1 - \alpha \frac{\delta}{\delta\theta_j}J(\theta_0, \theta_1)\\\theta_0 &:= temp_0\\\theta_1 &:= temp_1\end{aligned}\end{align} \]

Bivariate Gradient Descent

\[\begin{split}\begin{split} \frac{\delta}{\delta\theta_j}J(\theta_0, \theta_1) &= \frac{\delta}{\delta\theta_j}J(\theta_0, \theta_1) \frac{1}{2m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i)^2 \\ &= \frac{\delta}{\delta\theta_j}J(\theta_0, \theta_1) \frac{1}{2m} \sum^{m}_{i=1}(\theta_0 + \theta_1 x^i - y^i)^2 \end{split}\end{split}\]
\[\begin{split}\begin{array}{ll} \theta_0 \Rightarrow j = 0 : \frac{\delta}{\delta\theta_j}J(\theta_0, \theta_1) = \frac{1}{m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i) \\ \theta_1 \Rightarrow j = 1 : \frac{\delta}{\delta\theta_j}J(\theta_0, \theta_1) = \frac{1}{m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i) \end{array}\end{split}\]

Repeat until convergence {

\[\begin{split}\begin{array}{ll} \theta_0 &:= \theta_0 - \alpha \frac{1}{m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i) x_0^i \\ \nonumber \theta_1 &:= \theta_1 - \alpha \frac{1}{m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i) x_1^i \end{array}\end{split}\]

}

Here \(x_0^i\) is 0.

Multivariate gradient descent

Repeat until convergence {

\[\theta_j := \theta_j - \alpha \frac{1}{m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i) x_j^i\]

}

Normal Equation

In linear gression, instead of a loop as above, gradient descent can be expressed in a non-loop equation:

\[\theta = (X^TX)^{-1}X^Ty\]

Regularization

  • Overfitting: low bias, high variance
  • Underfitting: high bias, low variance

You could think bias in respect of the training dataset, and variance the test set. If a model overfits the bias would be almost 0 because it perfectly fits the training set. However, it wouldn’t be a good model for new dataset so results in high variance.

Watch Andrew Ng’s lecture.

Example

Say we want to make the following function more quadratic:

\[H = \theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3 + \theta_4x^4\]

Without actually removing \(\theta_3x^3 + \theta_4x^4\) or changing the form of \(H\), we can modify our cost function:

\[\text{min}_\theta \frac{1}{2m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i)^2 + 1000 \theta_3^2 + 1000 \theta_4^2\]

We could also regularize all of our theta parameters in a single summation

\[\text{min}_\theta \frac{1}{2m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i)^2 + \lambda \sum^{n}_{j=1}\theta_j^2\]

The square in the second sum comes from the first sum.

Regularization - Gradient Descent

Repeat until convergence {

\[\begin{split}\begin{align} \theta_0 &:= \theta_0 - \alpha \frac{1}{m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i) x_0^i &&\text{(Don't penalize the intercept $\theta_0$)} \nonumber \\ \theta_j &:= \theta_j - \alpha \Bigg[ \bigg( \frac{1}{m} \sum^{m}_{i=1}\Big(h_\theta(x^i) - y^i\Big) x_j^i \bigg) + \frac{\lambda}{m}\theta_j \Bigg] && j \in {1,2,...,n} \end{align}\end{split}\]

}

\(\frac{\lambda}{m}\) is a regularization performer.

The above can be represented as:

\[\theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m} \sum^{m}_{i=1}(h_\theta(x^i) - y^i) x_j^i\]

\(1 - \alpha\frac{\lambda}{m}\) is always less than 1. Thus, regularized.

Regularization - Normal Equation

\[\begin{split}\begin{align} X &= \begin{bmatrix} (x^1)^T \\ \vdots\\ (x^m)^T \end{bmatrix}, & \text{size is $(m)\times(n+1)$} \end{align}\end{split}\]
\[\begin{split}\begin{align} \vec{y} &= \begin{bmatrix} y^1 \\ \vdots\\ y^m \end{bmatrix}, & \text{size is $(m)\times(1)$} \end{align}\end{split}\]
\[\theta = (X^TX + \lambda L)^{-1}X^T y\]

where L is a pseudo-diagonal matrix of

\[\begin{split}\begin{align} L &= \begin{bmatrix} 0 & 0 & 0 & \dots & 0 \\ 0 & 1 & 0 & \dots & 0 \\ 0 & 0 & 1 & \dots & 0 \\ \hdotsfor{5}\\ 0 & 0 & 0 & \dots & 1 \end{bmatrix}, & \text{size is $(n+1)\times(n+1)$} \end{align}\end{split}\]

If \(m \leq n\), then \(X^TX\) is non-invertable and so is \((X^TX + \lambda L)\).