9a Anomaly Detection

Aug 13, 2017 14:09 · 884 words · 5 minutes read

Anomaly Detection

Problem Motivation

We have a dataset with $m$ examples.

We're then given a new example $x_{test}$, and we want to know whether this example is abnormal/anomalous.

$p(x)$ - a model that tells us the probability the example is not anomalous.
$\epsilon$ - a threshold so we can say which examples are anomalous

If our anomaly detect is flagging too many examples, we need to decrease the threshold $\epsilon$.

Gaussian Distribution

This is the bell-shaped distribution that can be described by $\mathcal{N}(\mu, \sigma^2)$. It's parameterised by:

the mean $\mu$, which describes the center of the curve
the standard deviation $\sigma$, which describes the width of the curve

If we have an $x \in \mathbb{R}$ that has a Gaussian probability distribution with mean $\mu$ and variance $\sigma^2$, then:

\[x \sim \mathcal{N}(\mu, \sigma^2)\]

We can estimate $\mu$ from a given dataset by taking the average of all the examples:

\[\mu = \frac{1}{m} \sum_{i=1}^m x^{(i)}\]

We can estimate $\sigma^2$ with the squared error formula:

\[\sigma^2 = \frac{1}{m} \sum_{i=1}^m ( x^{(i)} - \mu )^2\]

Once we have $\mu$ and $\sigma^2$ we can compute $p(x)$ using the formula for the Gaussian distribution:

\[p(x; \mu, \sigma^2) = \frac{1}{\sigma \sqrt{(2 \pi)}} e^{ - \frac{1}{2} ( \frac{x-u}{\sigma} )^2 }\]

Algorithm

Given a training set of $\lbrace x^{(1)}, ..., x^{(m)} \rbrace$ where each example is a vector, $x \in \mathbb{R}^n$, we perform the following steps:

Choose parameters $x_i$ that you think might be indicative of anomalous examples.
Fit parameters $\mu_1, ..., \mu_n$, $\sigma^2_1, ..., \sigma^2_n$ using the above formulae.

Then given a new example $x$, we can compute the probability $p(x)$ using:

\[p(x) = p(x_1; \mu_1, \sigma^2_1) * p(x_2; \mu_2, \sigma^2_2) * ... * p(x_n; \mu_n, \sigma^2_n)\]

which can be written as:

\[\prod_{j=1}^n p( x_j; u_j, \sigma_j^2 )\]

If $p(x) < \sigma$, then $x$ is an anomaly.

Note that this is not a multivariate Gaussian distribution, so assumes the features are all independent. If there are relations between the examples, see the MVGD section below.

Developing and Evaluating an Anomaly Detection System

The evaluate our system, we can use labeled data, that's categorised into anomalous and non-anomalous examples ($y=0$ if normal, $y=1$ if anomalous).

From the data, take a large proportion of non-anomalous data for the training set, on which to train $p(x)$.

Take a smaller proportion of mixed anomalous and non-anomalous examples for the cross-validation and test sets.

e.g. Given a set where 0.2% of the data is anomalous:

Take 60% of the good ($y=0$) examples for the training set
Take 20% of the examples for the cross-validation set (with half of the anomalous examples)
Take 20% of the examples for the test set (with the other half of the anomalous examples)

To evaluate the algorithm:

Fit model $p(x)$ on training set
On cross validation example $x$, predict:
- If $p(x) \lt \epsilon$ (anomaly), then $y=1$
- If $p(x) \ge \epsilon$ (normal), then $y=0$

Possible evaluation metrics include the $F_1$ score. Note we use the cross-validation set to choose parameter $\epsilon$.

Anomaly Detection vs Supervised Learning

Use anomaly detection when:

We have a very small number of positive examples ($y = 1$ around $0-20$ examples is common) and a large number of negative ($y=0$)
We have many different types of anomalies and it's hard for an algorithm to learn from the positive examples what the anomalies look like. Future anomalies may look nothing like any of the anomalous examples we've seen so far

Use supervised learning when:

We have a large number of both positive and negative examples - the training set is more evenly divided into classes
We have enough positive examples for the algorithm to get a sense of what positive examples look like. The future positive examples are likely similar to the ones in the training set

Choosing what features to use

The features will greatly affect how well your anomaly detection algorithm works.

We can check our features are gaussian by plotting a histogram of our data and checking for the bell-shaped curve. If it doesn't, we can try several transforms on example feature $x$:

$log(x)$
$log(x+1)$
$log(x+c)$ for some constant
$\sqrt{x}$
$x^{1/3}$

Our goal for $p(x)$ to be large for a normal example and small for anomalous examples.

A common problem is when $p(x)$ is similar for both types of example. In such cases, examining the anomalous examples that are giving high probability and find new features that will better distinguish the data.

Multivariate Gaussian Distribution

The multivariate gaussian distribution is an extension of anomaly detection, and may (or may not) catch more anomalies.

Instead of modelling $p(x_1), p(x_2), ...$ separately, we will model $p(x)$ all in one go. Our parameters will be $\mu \in \mathbb{R}^n$ and $\Sigma \in \mathbb{R}^{n \times n}$

We then compute $p(x)$ using:

\[p(x;\mu,\Sigma) = \dfrac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} exp(-1/2(x-\mu)^T\Sigma^{-1}(x-\mu))\]

This all means that we can model oblong gaussian contours, allowing us to fit data that might not fit in the normal circular contours.

Anomaly Detection using Multivariate Gaussian Distribution

We compute $\mu$ and $\Sigma$ normally. We compute $p(x)$ using the formula in the previous section, and flag an anomaly if $p(x) \lt \epsilon$.

The multivariate Gaussian model can automatically capture correlations between different features of $x$. However, the original model is computationally cheaper (no matrix to invert), and it performs well with a small training set size ($\Sigma$ in multivariate might not be invertible otherwise).