8a Clustering
Aug 13, 2017 14:08 · 530 words · 3 minutes read
Clustering
Unsupervised Learning
Clustering is a form of unsupervised learning, because it uses an unlabeled training set. In other words, we don't have a labeled training set - just a dataset of features where we can find structure.
K-Means Algorithm
K-Means is popular and widely used for clustering data into subsets.
- Randomly initialise K cluster centroids to two points in the dataset
- For each data point, assign it to the closest cluster centroid
- Compute average for all points inside each cluster, and assign each cluster centroid to those averages
Rerun steps 2 and 3 until no points change groups
Randomly initialize K cluster centroids mu(1), mu(2), ..., mu(K) Repeat: for i = 1 to m: c(i) := index (from 1 to K) of cluster centroid closest to x(i) for k = 1 to K: mu(k) := average (mean) of points assigned to cluster k`
We can write the Cluster Assignment step as
\[c^{(i)} = argmin_k ||x^{(i)} - \mu_k||^2\]
That is, each $c^{(i)}$ contains the index of the centroid that has minimal distance to $x^{(i)}$.
In the Move Centroid step, we compute the cluster centroid $\mu_k$ as
\[\mu_k = \frac{1}{n} [ x^{(k_1)} + x^{(k_2)} + ... + x^{(k_n)} ] \in \mathbb{R}^n \]
where each of $x^{(k_1)}, x^{(k_2)}, ..., x^{(k_n)}$ are the training examples assigned to group $\mu_k$.
If you have a cluster centroid with 0 points assigned to it, you can randomly re-initialise that centroid to a new point.
Optimisation Objective
Given:
- $c^{(i)} = $ index of cluster $(1,2,...K)$ to which example $x^{(i)}$ is currently assigned
- $\mu_k = $ cluster centroid $k ( \mu_k \in \mathbb{R}^n)$
- $\mu_{c^{(i)}} = $ cluster centroid to which example $x^{(i)}$ has been assigned
We can define our cost function:
\[J(c^{(i)},\dots,c^{(m)},\mu_1,\dots,\mu_K) = \dfrac{1}{m}\sum_{i=1}^m ||x^{(i)} - \mu_{c^{(i)}}||^2\]
Our optimisation objective is to minimise all our parameters using the above cost function.
We've finding all values in sets $c$ and $\mu$ that will minimise the average of the distances of every training example to its corresponding cluster centroid.
With K-means, it's not possible for the cost function to sometimes increase - it should always descent.
Random Initialisation
The best method for randomly initialising cluster centroids.
- Have $K < m$ - that is, ensure the number of clusters is less than the number of training examples
- Randomly pick $K$ training examples (ensure they're unique)
- Set $\mu_1, ... \mu_k$ equal to these $K$ examples
K-means can get stuck in local optima. To decrease the chance of this happening, run the algorithm multiple times with different random initialisations.
Choosing the number of clusters
Choosing $K$ can be quite arbitrary and ambiguous
The elbow method: plot the cost $J$ and number of clusters $K$. The cost function should reduce as we increase the number of clusters, and then flatten out. Choose $K$ at the point where the cost function starts to flatten out.
However, often the curve is very gradual so there's no clear elbow.
Another way is to observe how well it performs on some downstream purpose - choose $K$ that proves to be most useful for some goal you're trying to achieve.
Discussion of the drawbacks of K-Means
From StackExchange. Shows various situations in which K-means gives correct but unexpected results.