3c Regularisation

Aug 13, 2017 14:03 · 496 words · 3 minutes read

Regularisation

Regularisation addresses the problem of overfitting.

Underfitting is where the form of the hypothesis maps poorly to the trend of the data. It's usually caused by a function that is too simple or uses too few features.

For example, if we take $h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2$, then we're making the assumption that a linear model will fit the training data well.

At the other extreme, overfitting is caused by a hypothesis that fits the available data but does not generalise well to predict new data. It's usually caused by a high order function that creates a lot of unnecessary curves unrelated to the data.

There are two main options to address overfitting:

Reduce the number of features
- Manually select which features to keep
- Use a model selection algorithm
Regularisation
- Keep all the features, but reduce the parameters $\theta_j$.

Regularisation works well when we have a lot of slightly useful features.

Regularised Linear Regression

Cost Function

This gives a good intuition about how regularisation works for logistic regression too.

If we have overfitting from our hypothesis function, we can reduce the weight of some terms by increasing their cost. Say we want to make the following function more quadratic:

\[\theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 + \theta_4 x^4\]

We want to eliminate the influence of the $x^3$ and $x^4$ terms without actually getting rid of them. So, we modify our cost fuction:

\[ min_\theta \; \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + 1000 \cdot \theta_3^2 + 1000 \cdot \theta_4^2\]

The extra terms inflate the cost of the $\theta_3$ and $\theta_4$ terms - in order for the cost to be small, these values will also have to be small.

We can also regularise all of our theta parameters:

\[ min_\theta \; \frac{1}{2m} \left[ \sum_{i=1}^m ( h_\theta(x^{(i)}) - y^{(i)} )^2 + \lambda \sum_{j=1}^n \theta_j^2 \right] \]

The $\lambda$ parameter is the regularisation parameter that determines how much the costs of our $\theta$ parameters are inflated. This can be used to smooth the output of the hypothesis function to reduce overfitting.

Gradient Descent

We don't want to penalise $\theta_0$, we only modify the formula for $j \in { 1,2...n }$. It's the same as before, with a regularisation term:

\[ \theta_j := \theta_j - \alpha \left[ \left( \frac{1}{m} \sum_{i=1}{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \right) + \frac{\lambda}{m} \theta_j \right] \]

Regularised Logistic Regression

Cost Function

We can regularise this by adding a term to the end:

\[ J(\theta) = - \frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} log( h_\theta (x^{(i)} ) ) + (1 - y^{(i)}) log( 1 - h_\theta( x^{(i)} ) ) \right] + \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2 \]

The second sum expecitly excludes the bias term $\theta_0$ but running from 1 to n.

Gradient Descent

Just like with linear regression, we don't want to regularise $\theta_0$, so the other terms are:

\[ \theta_j := \theta_j - \alpha \left[ \left( \frac{1}{m} \sum_{i=1}^m ( h_\theta( x^{(i)} ) - y^{(i)}) x_j^{(i)} \right) + \frac{\lambda}{m} \theta_j \right] \]