Guilty Grandmas
Understanding precision, recall, and accuracy.
The Madness of Machine Learning Metrics
If you are interested in machine learning, then you’ve seen the words precision, accuracy, and recall. Maybe you’ve heard of True Negatives and False Positives, and perhaps some misguided soul has tried to explain everything with a picture of arrows on a bullseye, which unfortunately has no bearing on the ML definitions.
It took me the longest time to fully internalize these concepts, and I think that was in large part because of how they were explained to me.
Guilty Grandmas
I think the first hurdle is wrapping your head around these obscure terms. What the heck is a false negative anyway?
Let’s say you are Chad, the local district judge. And on your docket today, you have 10 naughty grandmas. Actually, half of them are guilty and the other half are innocent, but you don’t know that. You’re just the anthropomorphization of a machine learning algorithm designed for comedic illustration.
There are only 4 possible types of convictions you can make:
- True Positive: Send a guilty grandma to jail.
- False Positive: Send an innocent grandma to jail.
- True Negative: Let an innocent grandma free.
- False Negative: Let a guilty grandma free.
The Metrics
How can we evaluate your success at convicting grandmas? Well, two metrics in particular are often used to understand how a machine learning model performs. And interestingly, they are often at odds with one another. As you are in the final stages of training your model, you might have to ask yourself, would I rather optimize for high precision, or high recall?
Precision
Did you convict any innocent grandmas?
Well, technically precision asks the less memorable question, ‘what percent of jailed grandmas were guilty?’ If you want to actually calculate precision, you can use:
\[\frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}\]or, to put it more simply:
\[\frac{\text{Jailed Grandmas who were Guilty}}{\text{Jailed Grandmas}}\]Recall
Did you put all the guilty grandmas in jail?
\[\frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}\]or
\[\frac{\text{Jailed Grandmas who were Guilty}}{\text{Guilty Grandmas}}\]The Balancing Act
As we’ve seen, often you have to choose whether to optimize for precision or recall. So let’s take this to extremes with a concrete example.
High Precision, Low Recall
What would it mean for a model to be ultra high precision, but low recall?
Essentially, whenever you do send a grandma to jail you are always right, but you miss a lot of actually guilty grandmas in the process.
TP is short for True Positive, FN for False Negative, etc. A green x is a correct judgement and a red x is incorrect. (On mobile you may need to scroll the table to the left or right. )
Grandmas | Guilty? | Jail | TP | TN | FP | FN |
---|---|---|---|---|---|---|
Edith | guilty | yes | x | |||
Mabel | guilty | x | ||||
Florence | guilty | x | ||||
Eleanor | guilty | x | ||||
Beatrice | guilty | x | ||||
Agnes | innocent | x | ||||
Mildred | innocent | x | ||||
Gertrude | innocent | x | ||||
Dorothy | innocent | x | ||||
Gladys | innocent | x |
- Precision: What percent of jailed grandmas were guilty? – 100%.
- Recall: What percent of guilty grandmas were jailed? – 20%.
- Accuracy: What percent of convictions were accurate? – 60%.
Low Precision, High Recall
What about low precision, high recall?
Well here, the judge successfully condemns all the guilty people, but in the process condemns many innocents as well.
Grandmas | Guilty? | Jail | TP | TN | FP | FN |
---|---|---|---|---|---|---|
Edith | guilty | yes | x | |||
Mabel | guilty | yes | x | |||
Florence | guilty | yes | x | |||
Eleanor | guilty | yes | x | |||
Beatrice | guilty | yes | x | |||
Agnes | innocent | yes | x | |||
Mildred | innocent | yes | x | |||
Gertrude | innocent | yes | x | |||
Dorothy | innocent | yes | x | |||
Gladys | innocent | x |
- Precision: What percent of jailed grandmas were guilty? – 56%.
- Recall: What percent of guilty grandmas were jailed? – 100%.
- Accuracy: What percent of convictions were accurate? – 60%.
Meaningful Metrics Matter
Did scientists really need to come up with two new metrics no one has ever heard of just to explain how accurate a judge is?
Accuracy
Well, you may have noticed that despite our two wildly different examples, accuracy was the exact same.
Because accuracy doesn’t care if a mistake means sending an innocent grandma to jail or letting a guilty grandma free. They are both the same mistake as far as accuracy is concerned. And in our examples the percent of the time the judge correctly decided whether a grandma was guilty or innocent was the same, 60%.
\[\frac{\text{True Positives} + \text{True Negatives}}{\text{Total Number of Observations}}\]or
\[\frac{\text{Correctly Sentenced Grandmas}}{\text{All Sentenced Grandmas}}\]To Visit or Not to Visit?
Ok sure, in my contrived example, these metrics seem really important, but what about in the real world?
Well let’s say you’ve got a cough and you are about to visit your immunocompromised grandma in high security prison, so you take a COVID test. Thankfully, it says you are negative! But then you read the back of the box and you see that the test has 100% precision but only 20% recall.
Can you really be certain you don’t have COVID? Well that 100% precision means you could have been certain of a positive diagnosis…but since you had a negative result, and recall is only 20%, there is still a decent chance that you have COVID.
Technically to know the exact percent, you would need to know your a priori probability of having COVID considering you have a cough, perhaps I’ll write a follow-up post on this…but for now, maybe you shouldn’t visit your guilty grandma after all.
Machine Learning
In truth, medical statistics has its own specialized metrics and terminology; they don’t actually use the concepts of precision and recall. So for our final example let’s look at a real-world machine learning problem.
At my work, we have created a curated Science Discovery Engine that contains, among other things, 52,000 Earth Science datasets. And we’d like a subset of these datasets to be available in a specialized portal dedicated to Environmental Justice.
There are 8 possible types of EJ datasets (climate change, extreme heat, food availability, etc), so we plan to train a classification model to tag each of the 52,000 broader Earth Science datasets as either not-EJ, or with one or more of the 8 EJ indicators.
High Precision, Low Recall
So what would the end user experience be like if we optimized the model for high precision but low recall?
This is the same as only sending guilty grandmas to jail, but missing many in the process. So we could be confident that only EJ content was on the portal, but we would also be missing lots of relevant content.
High Recall, Low Precision
If we instead optimized for high recall, then this would be the same as convicting all the guilty grandmas, but putting innocent grandmas in jail in the process.
The portal would be guaranteed to contain all the EJ datasets from the broader corpus, but it would also have unrelated content polluting the results.
What’s Better?
Well it depends on the goal of the platform and the end user. If you want to ensure that every bit of relevant data is available, then you go with high recall. But if you want to guarantee that no unrelated datasets appear in the portal, then you should optimize for high precision.
Minimum Metrics
But real life doesn’t deal in absolutes. Typically you neither want nor need to completely maximize one metric at the cost of the other. Instead, you will set a minimum acceptable threshold for each metric, and as the model reaches the limits of its performance there will begin to be tradeoffs as you maximize for one or the other.
But how do you choose these minimums? Well, in the Environmental Justice portal there are at least two factors to consider: user trust and data availability. An incorrect classification could lead to a user or the community losing trust in the EJ portal and abandoning it. But likewise, if the portal is missing too many relevant datasets, users won’t be able to find information they need and will abandon even the most accurate portal.
So you might decide that the portal needs to contain at least 85% of the NASA EJ data to be useful, and that only 1 out of every 10 classifications can be wrong or users won’t trust the portal. The following table shows how enforcing these two minimums might compare to only maximizing for precision.
Precision | Recall | |
---|---|---|
Only Maximize Precision | 98% | 70% |
Min. 90% Prec, 85% Recall | 92% | 85% |
In the first row, we are able to drive precision all the way to 98%, but can only achieve 70% recall at that high precision number.
However if we require a minimum precision of 90% and a minimum recall of 85%, we can meet our recall goal of 85% and further maximize precision to 92%.
Enforcing a minimum recall limited our precision ceiling, but we were able to ensure that the portal contained at least 85% of the relevant data at an acceptable error rate, resulting in a more useful end product.
Cover image generated using DALL-E 3