6b System Design

Aug 13, 2017 14:06 · 491 words · 3 minutes read

Machine Learning System Design

Prioritising what to work on

There are a lot of ways to approach a machine learning problem, but it's difficult to tell which will be helpful.

Error analysis

The recommended approach is:

  • Start with a simple algorithm, implement it quickly, and test it early
  • Plot learning curves to decide if more data, more features, etc. will help
  • Error analysis: manually examine the errors on examples in the cross validation set to try and spot a trend

It's important to get error results as a single numerical value, otherwise it's difficult to assess the algorithm's performance.

Data may require preprocessing before it's useful - eg stemming on words.

Error metrics for skewed classes

In problems where one of the classes is very rare, say 0.5%, we can get a 99.5% error rate just by classifying everything in the other class.

For this, we use Precision and Recall.

  • Predicted: 1, Actual: 1 --- True Positive
  • Predicted: 0, Actual: 0 --- True Negative
  • Predicted: 0, Actual: 1 --- False Negative
  • Predicted: 1, Actual: 0 --- False Positive

Precision: of all patients we predicted where $y=1$, what fraction actually has cancer?

\[ \frac{ \text{True Positives} }{ \text{ Total number of predicted positives } } = \frac{ \text{True Positives} }{ \text{ True Positives + False Positives } } \]

Recall: of all the patients that actually have cancer, what fraction did we correctly detect as having cancer?

\[ \frac{ \text{True Positives} }{ \text{ Number of actual positives } } = \frac{ \text{True Positives} }{ \text{ True Positives + False Negatives } } \]

These metrics give us a better sense of how our classifier is doing - we want both to be high.

Trading off Precision and Recall

We might want a confident prediction of two classes - one way is to increase our threshold:

  • Predict 1 if: $h_\theta(x) \geq 0.7$
  • Predict 0 if: $h_\theta(x) \lt 0.7$

This way we only predict cancer if the patient has a 70% chance.

This gives higher precision by lower recall.

Alternatively we can lower our threshold:

  • Predict 1 if: $h_\theta(x) \geq 0.3$
  • Predict 0 if: $h_\theta(x) \lt 0.3$

This way we get a very safe prediction. This causes lower precision but higher recall.

To turn these two metrics into one number, we take the F value. The average is bad because it's too highly influenced by one of the numbers.

\[\text{F Score} = 2 \frac{PR}{P+R}\]

We should train precision and recall on our cross validation set, so as not to bias our test set.

Data

In certain cases, an inferior algorithm with more data can outperform a superior algorithm with less data.

We must choose our features to have enough information: given input $x$, would a human export be able to confidently predict $y$?

Rationale for large data: if we have a low bias algorithm (many features/hidden units, making it complex), then the larger the training set we use, the less overfitting we will have.