6b System Design
Aug 13, 2017 14:06 · 491 words · 3 minutes read
Machine Learning System Design
Prioritising what to work on
There are a lot of ways to approach a machine learning problem, but it's difficult to tell which will be helpful.
Error analysis
The recommended approach is:
- Start with a simple algorithm, implement it quickly, and test it early
- Plot learning curves to decide if more data, more features, etc. will help
- Error analysis: manually examine the errors on examples in the cross validation set to try and spot a trend
It's important to get error results as a single numerical value, otherwise it's difficult to assess the algorithm's performance.
Data may require preprocessing before it's useful - eg stemming on words.
Error metrics for skewed classes
In problems where one of the classes is very rare, say 0.5%, we can get a 99.5% error rate just by classifying everything in the other class.
For this, we use Precision and Recall.
- Predicted: 1, Actual: 1 --- True Positive
- Predicted: 0, Actual: 0 --- True Negative
- Predicted: 0, Actual: 1 --- False Negative
- Predicted: 1, Actual: 0 --- False Positive
Precision: of all patients we predicted where $y=1$, what fraction actually has cancer?
\[ \frac{ \text{True Positives} }{ \text{ Total number of predicted positives } } = \frac{ \text{True Positives} }{ \text{ True Positives + False Positives } } \]
Recall: of all the patients that actually have cancer, what fraction did we correctly detect as having cancer?
\[ \frac{ \text{True Positives} }{ \text{ Number of actual positives } } = \frac{ \text{True Positives} }{ \text{ True Positives + False Negatives } } \]
These metrics give us a better sense of how our classifier is doing - we want both to be high.
Trading off Precision and Recall
We might want a confident prediction of two classes - one way is to increase our threshold:
- Predict 1 if: $h_\theta(x) \geq 0.7$
- Predict 0 if: $h_\theta(x) \lt 0.7$
This way we only predict cancer if the patient has a 70% chance.
This gives higher precision by lower recall.
Alternatively we can lower our threshold:
- Predict 1 if: $h_\theta(x) \geq 0.3$
- Predict 0 if: $h_\theta(x) \lt 0.3$
This way we get a very safe prediction. This causes lower precision but higher recall.
To turn these two metrics into one number, we take the F value. The average is bad because it's too highly influenced by one of the numbers.
\[\text{F Score} = 2 \frac{PR}{P+R}\]
We should train precision and recall on our cross validation set, so as not to bias our test set.
Data
In certain cases, an inferior algorithm with more data can outperform a superior algorithm with less data.
We must choose our features to have enough information: given input $x$, would a human export be able to confidently predict $y$?
Rationale for large data: if we have a low bias algorithm (many features/hidden units, making it complex), then the larger the training set we use, the less overfitting we will have.