Fundamentals¶

Why are we using $$e$$? [1]¶

$$e$$ is a special number. The derivative of exponential $$e^t$$ is always $$e^t$$. Let’s look at non-$$e$$ exponentail $$2^t$$ and its derivative.

\begin{split}\begin{align} \frac{\delta}{\delta t} 2^t &= \lim_{\delta t \rightarrow 0} \frac{2^{t+\delta t} - 2^t }{\delta t} \\ &= 2^t \big( \lim_{\delta t \rightarrow 0} \frac{2^{\delta t} - 1 }{\delta t} \big) \end{align}\end{split}
t= .1
for i in range(1, 10):
dt = t ** i
d = (2**(dt) - 1 )/(dt)
print(dt, d)

# 0.1 0.7177346253629313
# 0.010000000000000002 0.6955550056718883
# 0.0010000000000000002 0.6933874625807411
# 0.00010000000000000002 0.6931712037649972
# 1.0000000000000003e-05 0.6931495828199628
# 1.0000000000000004e-06 0.6931474207938491
# 1.0000000000000004e-07 0.6931472040783146
# 1.0000000000000005e-08 0.6931471840943001
# 1.0000000000000005e-09 0.6931470952764581


So we see that as $$\delta t$$ becomes finitely small, $$\frac{\delta}{\delta t} 2^t$$ converges to 0.693. It’s good to know that it converges but it would be handy if we can find a pattern and that’s where $$e$$ comes in!

$\text{Let's rewrite } 2^t$
\begin{split}\begin{align} 2^t &= e^{t \log2} \quad \text{(as e and 2 can be exchanged)} \\ \frac{\delta}{\delta t} 2^t &= \frac{\delta}{\delta t} e^{t \log2} \\ &= \log 2 e^{t \log2} \quad \text{(as \frac{\delta}{\delta t} e^{at} = ae^{at})} \\ (\log 2 &= 0.6931471805599453) \end{align}\end{split}

So we got the same results by using $$e$$ and it is more universal as we can easily plug any constants and get the derivative using logarithm!

Matrix Differentiation¶

Let’s demonstrate matrix differentiation with the following example.

$f ( w ) = ( 1/ | X | ) \| \mathbf { y } - \mathbf { X } \mathbf { w } \| _ { 2} ^ { 2} + \lambda \| \mathbf { w } \| _ { 2} ^ { 2}$

I will divide the two terms into separate expressions $$f_1$$ and $$f_2$$.

\begin{split}\begin{align} f_1 ( w ) &= ( 1/ | X | ) \| \mathbf { y } - \mathbf { X } \mathbf{ w } \| _ { 2} ^ { 2} \\ f_2 ( w ) &= \lambda \| \mathbf { w } \| _ { 2} ^ { 2} \end{align}\end{split}

To get $$\nabla f_1(w)$$ I will use the chain rule.

\begin{split}\begin{align} h(w) &= \mathbf { y } - \mathbf { X } \mathbf { w } \\ f_1 ( w ) &= ( 1/ | X | ) \| h \| _ { 2} ^ { 2} \\ \end{align}\end{split}

Now let’s get the derivative! Remember that the norm sign with double 2 on the right means the square of euclidean distance.

\begin{split}\begin{align} \delta f_1 ( w ) &= ( 1/ | X | ) 2 h \delta h \\ \delta h &= -X \delta w \\ \delta f_1 ( w ) &= ( -2/ | X | ) \mathbf { X }^T ( \mathbf { y } - \mathbf { X } \mathbf{ w } ) \delta w \\ \end{align}\end{split}

The transpose for $$\mathbf { X }^T$$ is for matrix calculation. Solve $$\nabla f_2(w)$$

\begin{align} \delta f_2 ( w ) &= 2 \lambda \mathbf { w } \delta \mathbf { w } \end{align}

Finally add them together

\begin{split}\begin{align} \frac{\delta f( w )}{\delta w} &= \frac{\delta f_1( w )}{\delta w} + \frac{\delta f_2( w )}{\delta w} \\ &= \frac{\delta f_1}{\delta h} \frac{\delta h}{\delta w} + \frac{\delta f_2( w )}{\delta w} \\ &= ( \frac{-2}{| X |} ) \mathbf { X }^T ( \mathbf { y } - \mathbf { X } \mathbf{ w } ) + 2 \lambda \mathbf { w } \end{align}\end{split}

Matrix expression example 1¶

We have a function

$f(w) = w^{T} \mathbf{A} w + \mathbf{b}^{T} w + c$

Given $$\mathbf{A} = (1/N) \mathbf{X}^{T} \mathbf{X} + \lambda \mathbf{I}$$ , $$\mathbf{b} =- (2/N) \mathbf{X}^{T} \mathbf{y}$$ and $$c = (1/N) y^{T} y$$, complete the function.

\begin{split}\begin{align} f(w) &= w^{T} \big( (1/N) \mathbf{X}^{T} \mathbf{X} + \lambda \mathbf{I} \big) w + \big(- (2/N) \mathbf{X}^{T} \mathbf{y} \big)^{T} w + (1/N) y^{T} y \\ &= \frac{1}{N} w^{T}\mathbf{X}^{T} \mathbf{X} w+ w^{T}\lambda w + \frac{-2}{N} \mathbf{y}^{T} \mathbf{X} w + \frac{1}{N} y^{T} y \end{align}\end{split}

Then we realize it is the expanded form of

$f ( w ) = ( 1/ | X | ) \| \mathbf { y } - \mathbf { X } \mathbf { w } \| _ { 2} ^ { 2} + \lambda \| \mathbf { w } \| _ { 2} ^ { 2}$

KL-divergence¶

A measure of how one probability distribution diverges from a second, expected probability distribution. [2]

< The Kullback-Leibler divergence for a normal Gaussian probability distribution. On the top left is an example of two Gaussian PDF’s and to the right of that is the area which when integrated gives the KL metric. >

Here’s another example,

Reference

 [2] https://en.wikipedia.org/wiki/Kullback–Leibler_divergence#/media/File:KL-Gauss-Example.png