9a Anomaly Detection
Aug 13, 2017 14:09 · 884 words · 5 minutes read
Anomaly Detection
Problem Motivation
We have a dataset with $m$ examples.
We're then given a new example $x_{test}$, and we want to know whether this example is abnormal/anomalous.
- $p(x)$ - a model that tells us the probability the example is not anomalous.
- $\epsilon$ - a threshold so we can say which examples are anomalous
If our anomaly detect is flagging too many examples, we need to decrease the threshold $\epsilon$.
Gaussian Distribution
This is the bell-shaped distribution that can be described by $\mathcal{N}(\mu, \sigma^2)$. It's parameterised by:
- the mean $\mu$, which describes the center of the curve
- the standard deviation $\sigma$, which describes the width of the curve
If we have an $x \in \mathbb{R}$ that has a Gaussian probability distribution with mean $\mu$ and variance $\sigma^2$, then:
\[x \sim \mathcal{N}(\mu, \sigma^2)\]
We can estimate $\mu$ from a given dataset by taking the average of all the examples:
\[\mu = \frac{1}{m} \sum_{i=1}^m x^{(i)}\]
We can estimate $\sigma^2$ with the squared error formula:
\[\sigma^2 = \frac{1}{m} \sum_{i=1}^m ( x^{(i)} - \mu )^2\]
Once we have $\mu$ and $\sigma^2$ we can compute $p(x)$ using the formula for the Gaussian distribution:
\[p(x; \mu, \sigma^2) = \frac{1}{\sigma \sqrt{(2 \pi)}} e^{ - \frac{1}{2} ( \frac{x-u}{\sigma} )^2 }\]
Algorithm
Given a training set of $\lbrace x^{(1)}, ..., x^{(m)} \rbrace$ where each example is a vector, $x \in \mathbb{R}^n$, we perform the following steps:
- Choose parameters $x_i$ that you think might be indicative of anomalous examples.
- Fit parameters $\mu_1, ..., \mu_n$, $\sigma^2_1, ..., \sigma^2_n$ using the above formulae.
Then given a new example $x$, we can compute the probability $p(x)$ using:
\[p(x) = p(x_1; \mu_1, \sigma^2_1) * p(x_2; \mu_2, \sigma^2_2) * ... * p(x_n; \mu_n, \sigma^2_n)\]
which can be written as:
\[\prod_{j=1}^n p( x_j; u_j, \sigma_j^2 )\]
If $p(x) < \sigma$, then $x$ is an anomaly.
Note that this is not a multivariate Gaussian distribution, so assumes the features are all independent. If there are relations between the examples, see the MVGD section below.
Developing and Evaluating an Anomaly Detection System
The evaluate our system, we can use labeled data, that's categorised into anomalous and non-anomalous examples ($y=0$ if normal, $y=1$ if anomalous).
From the data, take a large proportion of non-anomalous data for the training set, on which to train $p(x)$.
Take a smaller proportion of mixed anomalous and non-anomalous examples for the cross-validation and test sets.
e.g. Given a set where 0.2% of the data is anomalous:
- Take 60% of the good ($y=0$) examples for the training set
- Take 20% of the examples for the cross-validation set (with half of the anomalous examples)
- Take 20% of the examples for the test set (with the other half of the anomalous examples)
To evaluate the algorithm:
- Fit model $p(x)$ on training set
- On cross validation example $x$, predict:
- If $p(x) \lt \epsilon$ (anomaly), then $y=1$
- If $p(x) \ge \epsilon$ (normal), then $y=0$
Possible evaluation metrics include the $F_1$ score. Note we use the cross-validation set to choose parameter $\epsilon$.
Anomaly Detection vs Supervised Learning
Use anomaly detection when:
- We have a very small number of positive examples ($y = 1$ around $0-20$ examples is common) and a large number of negative ($y=0$)
- We have many different types of anomalies and it's hard for an algorithm to learn from the positive examples what the anomalies look like. Future anomalies may look nothing like any of the anomalous examples we've seen so far
Use supervised learning when:
- We have a large number of both positive and negative examples - the training set is more evenly divided into classes
- We have enough positive examples for the algorithm to get a sense of what positive examples look like. The future positive examples are likely similar to the ones in the training set
Choosing what features to use
The features will greatly affect how well your anomaly detection algorithm works.
We can check our features are gaussian by plotting a histogram of our data and checking for the bell-shaped curve. If it doesn't, we can try several transforms on example feature $x$:
- $log(x)$
- $log(x+1)$
- $log(x+c)$ for some constant
- $\sqrt{x}$
- $x^{1/3}$
Our goal for $p(x)$ to be large for a normal example and small for anomalous examples.
A common problem is when $p(x)$ is similar for both types of example. In such cases, examining the anomalous examples that are giving high probability and find new features that will better distinguish the data.
Multivariate Gaussian Distribution
The multivariate gaussian distribution is an extension of anomaly detection, and may (or may not) catch more anomalies.
Instead of modelling $p(x_1), p(x_2), ...$ separately, we will model $p(x)$ all in one go. Our parameters will be $\mu \in \mathbb{R}^n$ and $\Sigma \in \mathbb{R}^{n \times n}$
We then compute $p(x)$ using:
\[p(x;\mu,\Sigma) = \dfrac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} exp(-1/2(x-\mu)^T\Sigma^{-1}(x-\mu))\]
This all means that we can model oblong gaussian contours, allowing us to fit data that might not fit in the normal circular contours.
Anomaly Detection using Multivariate Gaussian Distribution
We compute $\mu$ and $\Sigma$ normally. We compute $p(x)$ using the formula in the previous section, and flag an anomaly if $p(x) \lt \epsilon$.
The multivariate Gaussian model can automatically capture correlations between different features of $x$. However, the original model is computationally cheaper (no matrix to invert), and it performs well with a small training set size ($\Sigma$ in multivariate might not be invertible otherwise).