4a Neural Networks - Representation

Aug 13, 2017 14:04 · 524 words · 3 minutes read

Neural Networks

Performing linear regression with many features is unwieldy. If you wanted a hypothesis with three features that included quadratic terms, it would require six features; for 100 features, that would jump to over 5000.

The growth for a quadratic hypothesis is $\mathcal{O}(n^2/2)$, while including cubic terms would bring it to $\mathcal{O}(n^3)$.

If we're using a 50x50 pixel greyscale image, that's not practical. Neural networks offer an alterate way to perform machine learning when we have complex hypothesis with many features.

Model Representation

Our neurons are computation units that take input and convert them to outputs. Our input is the features $(x_1 ... x_n)$, and the output is the result of our hypothesis function.

$x_0$ is sometimes known as the "bias unit" and is equal to 1.

Like logistic regression, we use the same logistic function \(\frac{1}{1 + e^{-\theta^T x}}\), and call it a sigmoid activation function.

Our "theta" parameters are called "weights".

The first layer is called in "input layer" and the final layer the "output layer", which gives the final value computed on the hypothesis.

We can have intermediate layers of nodes between the input and output called the "hidden layer". These are labelled $a_0^2 ... a_n^2$ and are called "activation units".

\[a_i^{(j)} = \text{"activation" of unit i in layer j} \\ \Theta^{(j)} = \text{matrix of weights controlling function mapping from layer j to layer j+1} \]

The values for the "activation" nodes is as follows:

\[ a_1^{(2)} = g( \Theta_{1 \ 0}^{(1)} x_0 + \Theta_{1 \ 1}^{(1)} x_1 + \Theta_{1 \ 2}^{(1)} x_2 + \Theta_{1 \ 3}^{(1)} x_3 ) \\ a_2^{(2)} = g( \Theta_{2 \ 0}^{(1)} x_0 + \Theta_{2 \ 1}^{(1)} x_1 + \Theta_{2 \ 2}^{(1)} x_2 + \Theta_{2 \ 3}^{(1)} x_3 ) \\ a_3^{(2)} = g( \Theta_{3 \ 0}^{(1)} x_0 + \Theta_{3 \ 1}^{(1)} x_1 + \Theta_{3 \ 2}^{(1)} x_2 + \Theta_{3 \ 3}^{(1)} x_3 ) \\ \]

\[ h_\Theta(x) = a_1^{(3)} = g( \Theta_{1 \ 0}^{(2)} x_0 + \Theta_{1 \ 1}^{(2)} x_1 + \Theta_{1 \ 2}^{(2)} x_2 + \Theta_{1 \ 3}^{(2)} x_3 ) \]

We compute our activation nodes by using a $3 \times 4$ matrix of parameters. Each row of parameters is applied to our inputs to obtain the value for one activation node. Our hypothesis output is the logistic function applied to the sum of the values of our activation nodes which have been multiplied by the next parameter matrix. Each layer gets its own matrix of weights $\Theta^{(j)}$.

If a network has \(s_j\) units in layer $j$ and $s_{j+1}$ units in layer $j + 1$, then $\Theta^{(j)}$ will be of dimension $s_{j+1} \times (s_j + 1)$. The $+1$ comes from the addition of "bias nodes", $x_0$ and $\Theta_0^{(j)}$.

Vectorisation

We can vectorise this by moving the sigmoid function outside of the multiplication by $\Theta$. We can write $z^{(j)} = \Theta^{(j-1)} a^{(j-1)}$, and then compute $a^{(j)} = g( z^{(j)} )$ where $g$ can be applied element-wise to our vector $z^{(j)}$.

Multiclass Classification

To classify data into multiple classes, we let our hypothesis function return a vector of hypotheses. The following vector would mean the third set:

\[\begin{bmatrix} 0 \newline 0 \newline 1 \newline 0 \newline \end{bmatrix}\]