1b Linear Regression
Aug 13, 2017 14:01 · 927 words · 5 minutes read
Linear Regression
Linear regression is used when we are trying to predict results within a continuous output, meaning we are trying to map input variables to some continuous function.
e.g. Given some data about real estate prices, trying to predict the price of a house.
Linear Regression with One Variable
This is a nice introduction to linear regression, before moving on to the more generic form.
Hypothesis Function
Our hypothesis function is our attempt to create a function $h_\theta$ that reliably maps our input data $x$ to our output data $y$. The hypothesis function has the general form:
\[h_\theta(x) = \theta_0 + \theta_1x\]
Cost Function
We can measure the accuracy of our hypothesis function by using a cost function. This takes a fancy average of all the results of the hypothesis with inputs from $x$'s compared to the actual output $y$'s.
\[J(\theta_1, \theta_2) = \frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x_i)-y_i)^2\]
Breaking it down, it's just $\frac{1}{2}\bar{x}$ where $\bar{x}$ is the mean of the squares of $h_\theta(x_i)-y_i$.
Gradient Descent
We have a hypothesis function and a way of measuring how accurate it is. Now we need a way to automatically improve our hypothesis function.
Imagine that we graph our hypothesis function based on the values of $\theta$, with $\theta_0$ on the $x$ axis, $\theta_1$ on the $y$ axis, and the cost function on the vertical $z$ axis. We have optimised our hypothesis function when we are at the very bottom in our graph.
We do this by taking the derivative (the slope of the graph) and moving in that direction. We make steps down based on $\alpha$, the learning rate.
\[ repeat\ until\ convergence: \{ \\ \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x_i) - y_i) \\ \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum_{i=1}^m ((h_\theta(x_i) - y_i) x_i) \\ \} \]
where $m$ is the size of the training set, $\theta_0$ and $\theta_1$ are updated simultaneously, and $x_i$, $y_i$ are the values of the given training set.
The equation from which this is derived is:
\[\theta_j := \theta_j - \alpha \frac{\delta}{\delta \theta_j} J(\theta_0, \theta_1) \\ for\ j=0\ and\ j=1\]
Linear Regression with Multiple Variables
Also known as "multivariate linear regression".
- $x_j^{(i)}$ = value of feature $j$ in the $i^{th}$ training example
- $x^{(i)}$ = the column vector of all the feature inputs of the $i^{th}$ training example
- $m$ = the number of training examples
- $n = \left|x^{(i)}\right|$ (the number of features)
Hypothesis Function
Using matrix multiplication, the multivariate hypothesis function for one training example can be represented as:
\[ h_\theta(x) = \begin{bmatrix} x_0 \hspace{1em} x_1 \hspace{1em} ... \hspace{1em} x_n \end{bmatrix} \begin{bmatrix} \theta_0 \newline \theta_1 \newline \vdots \newline \theta_n \end{bmatrix} \\ = \theta_0 x_0 + \theta_1 x_1 + ... + \theta_n x_n \\ = x \theta \]
So:
\[h_\theta(x) = x\theta\]
If we collect all $m$ training examples, each with $n$ features, and record them in an $n+1$ by $m$ matrix, we can vectorise the hypothesis function:
\[ h_\theta(X) = \\ \begin{bmatrix} x_0^{(1)} \hspace{1em} x_1^{(1)} \hspace{1em} ... \hspace{1em} x_n^{(1)} \newline x_0^{(2)} \hspace{1em} x_1^{(2)} \hspace{1em} ... \hspace{1em} x_n^{(2)} \newline \vdots \newline x_0^{(m)} \hspace{1em} x_1^{(m)} \hspace{1em} ... \hspace{1em} x_n^{(m)} \newline \end{bmatrix} * \begin{bmatrix} \theta_0 \newline \theta_1 \newline \vdots \newline \theta_n \end{bmatrix} \\ = \begin{bmatrix} x_0^{(1)} * \theta_0 + x_1^{(1)} * \theta_1 + ... + x_n^{(1)} * \theta_n \newline \vdots \newline x_0^{(m)} * \theta_0 + x_1^{(m)} * \theta_1 + ... + x_n^{(m)} * \theta_n \newline \end{bmatrix} \\ = X \theta \]
Giving:
\[h_\theta(X) = X \theta\]
Cost Function
If you plug the vectorised version of the Hypothesis Function into the standard Cost Function, you basically get this:
\[J(\theta) = \frac{1}{2m} ( X \theta - \vec{y} )^T ( X \theta - \vec{y} )\]
Where $\vec{y}$ denotes the vector of all $y$ values.
Gradient Descent for Multiple Variables
This is exactly the same as single-variable gradient descent - you compute the formula for each feature and repeat until convergence. In other words:
\[ \theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m ((h_\theta(x_i) - y_i) x_j^{(i)}) \ \ \ \ \text{for j := 0..n} \]
and then update all $\theta_j$ simultaneously. Repeat until convergence.
The matrix notation (vectorised) of Gradient Descent is:
\[ \theta := \theta - \frac{\alpha}{m} X^T ( X \theta - \vec{y} ) \]
Feature Normalisation
Gradient descent works best when our input values are in roughly the same range.
We want all our values roughly in the range:
\[-1 \le x \le 1 \]
Two techniques for this are feature scaling and mean normalisation. Feature scaling involves dividing the input values by the range, so gives a new range of just 1. Mean normalisation involves subtracting the average value, resulting in a new average of 0.
Polynomial Regression
Our hypothesis function need not be a linear (straight line) if that doesn't fit the data well.
If we have $h_\theta(x) = \theta_0 + \theta_1 x_1$ then we can simply duplicate the instances of \(x_1\) to get the quadratic function $h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_x x_1^2$. The same also works with the cubic function, or square root function.
Feature scaling is very important with polynomial regression.
Normal Equation
The "normal equation" will find the optimum without iteration.
\[\theta = (X^T X)^{-1} X^T y\]
You don't need to do feature scaling or choose $\alpha$ with the normal equation. However, it can be slow if the number of features is very large.
Vectorized Formulae
Hypothesis Function:
\[h_\theta(X) = X \theta\]
Cost Function:
\[J(\theta) = \frac{1}{2m} ( X \theta - \vec{y} )^T ( X \theta - \vec{y} )\]
Gradient Descent:
\[\theta := \theta - \frac{\alpha}{m} X^T ( X \theta - \vec{y} )\]
Normal Equation:
\[\theta = (X^T X)^{-1} X^T y\]