10a Large Scale Machine Learning

Aug 13, 2017 14:10 · 635 words · 3 minutes read

Large Scale Machine Learning

Learning with Large Datasets

We mainly benefit from a large dataset when our algorithm has high variance when $m$ is small. If it has high bias, more data won't help.

Datasets can approach sizes of $m=100,000,000$, and in such cases the gradient descent step will have to make a summation over all examples. There are several approaches to avoid this:

Stochastic Gradient Descent

This is an alternative to classic gradient descent that is more efficient and scalable to large data sets.

The cost function is written similarly to normal, but replaces the constant $m$ with $\frac{1}{2}$.

The algorithm is as follows:

Randomly shuffle the dataset
For $i = 1...m$, compute $\Theta_j = \Theta_j - \alpha( h_{\Theta}( x^{(i)} ) - y^{(i)} \cdot x_j^{(i)}$ (for $j = 0,...,n$)

This means that we try and fit one training example at a time, and can make progress in gradient descent without having to scan all $m$ training examples first.

Stochastic gradient decent is unlikely to converge at the global minimum, but instead wander around randomly. Usually the result is close enough. It usually takes 1-10 passes through the data set to get near the global minimum.

So you still need to scan the whole data set a couple of times, but it makes far faster progress than standard gradient descent.

Mini-Batch Gradient Descent

This can sometimes be even faster than stochastic gradient descent. Instead of using all $m$ examples as in batch (normal) gradient descent, and instead of 1 example in stochastic gradient descent, we use an in-between number of examples $b$. Typical values range from 2-100.

For $b = 10$ and $m = 1000$:

Repeat:
- For $i = 1, 11, 21, 31, ..., 991$:
- $\theta_j := \theta_j - \alpha \dfrac{1}{10} \displaystyle \sum_{k=i}^{i+9} (h_\theta(x^{(k)}) - y^{(k)})x_j^{(k)}$

We're summing over ten examples at a time.

The advantage over one example at a time is that we can use vectorised implementations over the $b$ examples.

Stochastic Gradient Descent Convergence

Choosing a learning rate $\alpha$ and debugging stochastic gradient descent can be tricky.

One way is to plot the average cost of the hypothesis applied to every 1000 or so training examples.

With a smaller learning rate, you may get a slightly better solution, as it oscillates and jumps around the global minimum, and it makes smaller jumps with a smaller learning rate.

One strategy for trying to actually converge the global minimum is to slowly decrease $\alpha$ over time. For example $\alpha = \frac{const1}{iterationNumber + const2}$. However, this does involve fiddling with the new parameters.

Online Learning

With a continuous stream of data, we can run an endless loop that gets $(x,y)$, where we collect user actions for the features in $x$ to predict some behaviour $y$.

$\theta$ can be updated for each $(x,y)$ as you collect them, and adapt to new pools of users as you're continuously updating theta.

Map Reduce and Data Parallelism

We can divide batch gradient descent and dispatch the cost function for a subset of the data to different machines to train the algorithm in parallel.

You can split the training set into $z$ subsets, corresponding to the number of machines you have. On each machine, compute over the batch.

Once all have completed, take the results from the dispatched jobs and reduce them by calculating:

$\Theta_j := \Theta_j - \alpha \frac{1}{z}( part_1 + ... + part_z )$ for all $j = 0,...,n$.

This is just taking the computed costs, calculating their average, multiplying by the learning rate, and updating theta.

A learning algorithm is map-reduceable if it can be expressed as computing sums of functions over the training set. For example, linear regression and logistic regression.

For neural networks, you can compute forward and back propagation on subsets of your data on many machines. A master server can then combine them.