Logistic Regression

Used for classification. We want \(\theta \leq h_\theta (x) \leq 1\).

\[h_\theta(x) = g(\theta^T x) = g(z) = \frac{1}{1 + e^z} \Rightarrow h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}\]

\(\frac{1}{1 + e^{-\theta^T x}}\) is a sigmoid/logistic function

alternate text

< Logistic function plot >

The idea of logistic regression is

  • Suppose predict \(y = 1\) if \(h_\theta (x) \geq 0.5\)
  • Suppose predict \(y = 0\) if \(h_\theta (x) < 0.5\)

Linearity of Logistic Regression

From Stackoverflow.

The logistic regression model is of the form,

\[\mathrm{logit}(p_i) = \mathrm{ln}\left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \cdots + \beta_p x_{p,i}.\]

It is called a generalized linear model not because the estimated probability is linear, but because the logit of the estimated probability response is a linear function of the parameters.

Cost Function

\[\begin{split}\text{cost}(h_\theta(x), y) = \left\{ \begin{array}{lr} - \log(h_\theta(x)) & \text{if $y=1$ }\\ - \log(1 - h_\theta(x)) & \text{if $y=0$ } \end{array} \right.\end{split}\]

We use a separate cost function for logistic regression which differs from linear regression because otherwise it will be too wavy; cause too many local optima.

\[J(\theta) = \frac{1}{m} \sum^{m}_{i=1} \text{cost}(h_\theta(x^i), y^i)\]

Simplified Logistic Regression Cost Function

\[ \begin{align}\begin{aligned}\text{cost}(h_\theta(x^i), y^i) = -y \log(h_\theta(x)) - (1-y) \log(1-h_\theta(x))\\J(\theta) = - \frac{1}{m} [ \sum^{m}_{i=1} y^i \log(h_\theta (x^i)) + (1-y^i)\log(1-h_\theta(x^i))]\end{aligned}\end{align} \]

Logistic Regression Cost Function Gradient Descent

To minimize \(J(\theta)\), repeat until convergence:

\[\theta_j := \theta_j - \alpha \sum^{m}_{i=1} (h_\theta (x^i) - y^i)x_j^i\]

This algorithm is identical to linear regression.

Normal Equation of Gradient Descent

\[\theta := \theta - \frac{\alpha}{m}X^T (g(X\theta) - \vec{y})\]

Regularization

\[J(\theta) = - \frac{1}{m} \sum^{m}_{i=1} [y^i \log(h_\theta (x^i)) + (1-y^i)\log(1-h_\theta(x^i))] + \frac{\lambda}{2m} \sum^{n}_{j=1}\theta_j^2\]