5a Neural Networks - Learning

Aug 13, 2017 14:05 · 854 words · 5 minutes read

Neural Networks - Learning

Defining a few variables:

\[ L = \text{total number of layers in the network} \\ s_l = \text{number of units in layer } l \\ K = \text{number of output units/classes} \]

In neural networks, we have many output nodes, and denote $h_\theta (x)_k$ as the hypothesis that results in the $k^{th}$ output.

Cost Function

A generalised version of the one we used for logistic regression, which was:

\[ J(\theta) = - \frac{1}{m} \sum_{i=1}^m [ y^{(i)} log( h_\theta ( x^{(i)} ) ) + (1-y^{(i)} ) log (1 - h_\theta( x^{(i)} ) ) ] + \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2 \]

For neural networks, we have nested summations to account for the multiple output nodes. In the regularisation part, we must account for multiple theta matrices. The number of columns in the theta matrix is equal to the number of nodes in the layer (+ bias unit), and the number of rows is equal to the number of nodes in the next layer (excluding the bias unit).

\[ J(\Theta) = - \frac{1}{m} [ \sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} log(( h_\theta( x^{(i)} ))_k ) + (1-y_k^{(i)} ) log( 1 - (h_\theta( x^{(i)} ))_k ) ] + \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_l+1} ( \Theta_{j,i}^{(l)} )^2 \]

  • The double sum simply adds up the logistic regression costs calculated for each cell in the output layer
  • The triple sum simply adds up all the squares of all the individual $\Theta$s in the entire network
  • The $i$ in the triple sum does not refer to training example $i$.

Backpropagation Intuition

"Backpropagation" is neural-network terminology for minimising our cost function, just like gradient descent for logistic and linear regression.

We want to compute $min_\Theta J( \Theta )$ - so we want to minimise our cost function $J$ using an optimal set of parameters in theta.

In backprop, we're going to compute $\delta_j^{(l)}$ = "error" of node $j$ in layer $l$.

For the last layer it's very easy - we can compute the vector of delta values (remember $a^{(l)}_j$ is the activation of node $j$ in layer $l$) where $L$ is the number of layers:

\[\delta^{(L)} = a^{(L)} - y\]

So the "error values" for the last layer are just the difference between the results we get and the correct outputs in $y$.

For the previous layers it's trickier. We use an equation that steps us back from right to left:

\[ \delta^{(l)} = (( \Theta^{(l)} )^T \delta^{(l+1)} .* g'(z^{(l)}) \]

The delta values of layer $l$ are calculated by multiplying the delta values in the next layer with the theta matrix of layer $l$. This is then element-wise multiplied with a function called $g'$ , which is the derivative of the activation function $g$:

\[g'(z^{(l)}) = a^{(l)} .* (1 - a^{(l)})\]

The full equation can be written as:

\[ \delta^{(l)} = (( \Theta^{(l)} )^T \delta^{(l+1)} ) .* a^{(l)} .* (1 - a^{(l)}) \]

Backpropagation Algorithm

  • Start with training set $\lbrace ( x^{(1)}, y^{(1)} ), \cdots, ( x^{(m)}, y^{(m)} )\rbrace$
  • Set $\Delta^{(l)}_{i,j} := 0$ for all $(l, i, j)$
  • Iterate over training examples $t = 1$ to $m$:
    • Set $a^{(1)}$ to the training example ( $a^{(1)} := x^{(t)}$ )
    • Perform forward propagation to compute the "activation" of each layer $a^{(l)}$ for $l = 2,3,\cdots,L$
    • Compute the error for the output layer. Create a vector of zeros of size $a^{(L)}$ (num output units), and set the $y^{(t)}$th element to $1$. Use this to compute the vector $\delta^{(L)} = a^{(L)} - y^{(t)}$
    • Compute $\delta^{(L-1)}, \delta^{(L-1)}, \cdots, \delta^{(2)}$ using $\delta^{(l)} = (( \Theta^{(l)} )^T \delta^{(l+1}) .* a^{(l)} .* (1 - a^{(l)})$
    • Accumulate $\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)} ( a^{(l)} )^T$
  • Perform regularisation:
    • $D^{(l)}_{i,j} := \frac{1}{m} \left( \Delta^{(l)}_{i,j} + \lambda \Theta^{(l)}_{i,j} \right)$ if $j \ne 0$
    • $D^{(l)}_{i,j} := \frac{1}{m} \left( \Delta^{(l)}_{i,j} \right)$ if $j = 0$

The $\Delta$ matrix is used as an accumulator to add up our values as we go along and eventually compute the partial derivative.

Random Initialisation

If we initialise all theta weights to zero, all the nodes will update to the same value repeatedly during backpropagation.

Instead, we randomly initialise our weights - each $\Theta^{(l)}_{i,j}$ is set to a random value between $[ -\epsilon, \epsilon ]$:

\[ \epsilon = \frac{ \sqrt{6} }{ \sqrt{L_{output} + L_{input}} } \\ \Theta^{(l)} = 2 \epsilon \ rand( L_{output}, L_{input} + 1 ) - \epsilon\]

Putting it together

First pick a network architecture: the layout of the neural network, including how many hidden units in each layer, and how many layers total.

  • Number of input units = dimension of features $x^{(i)}$
  • Number of output units = number of classes
  • Number of hidden units per layer = usually more the better (balance cost of computation, which increases with more hidden units)
  • Defaults: 1 hidden layer. If more than 1 hidden layer, then the same number of units in every hidden layer

Training a Neural Network

  1. Randomly initialise the weights
  2. Implement forward propagation to get $h_\theta( x^{(i)} )$
  3. Implement the cost function
  4. Implement back propagation to compute partial derivatives
  5. Use gradient checking to confirm that backpropagation works, then disable gradient checking
  6. Use gradient descent or a built-in optimisation function to minimise the cost function with the weights in theta