Guilty Grandmas

Understanding precision, recall, and accuracy.

Posted Jun 1, 2024 Updated Jun 2, 2024

By Carson Davis

8 min read

Guilty Grandmas

The Madness of Machine Learning Metrics

If you are interested in machine learning, then you’ve seen the words precision, accuracy, and recall. Maybe you’ve heard of True Negatives and False Positives, and perhaps some misguided soul has tried to explain everything with a picture of arrows on a bullseye, which unfortunately has no bearing on the ML definitions.

It took me the longest time to fully internalize these concepts, and I think that was in large part because of how they were explained to me.

Guilty Grandmas

I think the first hurdle is wrapping your head around these obscure terms. What the heck is a false negative anyway?

Let’s say you are Chad, the local district judge. And on your docket today, you have 10 naughty grandmas. Actually, half of them are guilty and the other half are innocent, but you don’t know that. You’re just the anthropomorphization of a machine learning algorithm designed for comedic illustration.

There are only 4 possible types of convictions you can make:

True Positive: Send a guilty grandma to jail.
False Positive: Send an innocent grandma to jail.
True Negative: Let an innocent grandma free.
False Negative: Let a guilty grandma free.

The Metrics

How can we evaluate your success at convicting grandmas? Well, two metrics in particular are often used to understand how a machine learning model performs. And interestingly, they are often at odds with one another. As you are in the final stages of training your model, you might have to ask yourself, would I rather optimize for high precision, or high recall?

Precision

Did you convict any innocent grandmas?

Well, technically precision asks the less memorable question, ‘what percent of jailed grandmas were guilty?’ If you want to actually calculate precision, you can use:

\[\frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}\]

or, to put it more simply:

\[\frac{\text{Jailed Grandmas who were Guilty}}{\text{Jailed Grandmas}}\]

Recall

Did you put all the guilty grandmas in jail?

\[\frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}\]

\[\frac{\text{Jailed Grandmas who were Guilty}}{\text{Guilty Grandmas}}\]

The Balancing Act

As we’ve seen, often you have to choose whether to optimize for precision or recall. So let’s take this to extremes with a concrete example.

High Precision, Low Recall

What would it mean for a model to be ultra high precision, but low recall?

Essentially, whenever you do send a grandma to jail you are always right, but you miss a lot of actually guilty grandmas in the process.

TP is short for True Positive, FN for False Negative, etc. A green x is a correct judgement and a red x is incorrect. (On mobile you may need to scroll the table to the left or right. )

Grandmas	Guilty?	Jail	TP	TN	FN
Edith	guilty	yes	x
Mabel	guilty				x
Florence	guilty				x
Eleanor	guilty				x
Beatrice	guilty				x
Agnes	innocent			x
Mildred	innocent			x
Gertrude	innocent			x
Dorothy	innocent			x
Gladys	innocent			x

Precision: What percent of jailed grandmas were guilty? – 100%.
Recall: What percent of guilty grandmas were jailed? – 20%.
Accuracy: What percent of convictions were accurate? – 60%.

Low Precision, High Recall

What about low precision, high recall?

Well here, the judge successfully condemns all the guilty people, but in the process condemns many innocents as well.

Grandmas	Guilty?	Jail	TP	TN	FP
Edith	guilty	yes	x
Mabel	guilty	yes	x
Florence	guilty	yes	x
Eleanor	guilty	yes	x
Beatrice	guilty	yes	x
Agnes	innocent	yes			x
Mildred	innocent	yes			x
Gertrude	innocent	yes			x
Dorothy	innocent	yes			x
Gladys	innocent			x

Precision: What percent of jailed grandmas were guilty? – 56%.
Recall: What percent of guilty grandmas were jailed? – 100%.
Accuracy: What percent of convictions were accurate? – 60%.

Meaningful Metrics Matter

Did scientists really need to come up with two new metrics no one has ever heard of just to explain how accurate a judge is?

Accuracy

Well, you may have noticed that despite our two wildly different examples, accuracy was the exact same.

Because accuracy doesn’t care if a mistake means sending an innocent grandma to jail or letting a guilty grandma free. They are both the same mistake as far as accuracy is concerned. And in our examples the percent of the time the judge correctly decided whether a grandma was guilty or innocent was the same, 60%.

\[\frac{\text{True Positives} + \text{True Negatives}}{\text{Total Number of Observations}}\]

\[\frac{\text{Correctly Sentenced Grandmas}}{\text{All Sentenced Grandmas}}\]

To Visit or Not to Visit?

Ok sure, in my contrived example, these metrics seem really important, but what about in the real world?

Well let’s say you’ve got a cough and you are about to visit your immunocompromised grandma in high security prison, so you take a COVID test. Thankfully, it says you are negative! But then you read the back of the box and you see that the test has 100% precision but only 20% recall.

Can you really be certain you don’t have COVID? Well that 100% precision means you could have been certain of a positive diagnosis…but since you had a negative result, and recall is only 20%, there is still a decent chance that you have COVID.

Technically to know the exact percent, you would need to know your a priori probability of having COVID considering you have a cough, perhaps I’ll write a follow-up post on this…but for now, maybe you shouldn’t visit your guilty grandma after all.

Machine Learning

In truth, medical statistics has its own specialized metrics and terminology; they don’t actually use the concepts of precision and recall. So for our final example let’s look at a real-world machine learning problem.

At my work, we have created a curated Science Discovery Engine that contains, among other things, 52,000 Earth Science datasets. And we’d like a subset of these datasets to be available in a specialized portal dedicated to Environmental Justice.

There are 8 possible types of EJ datasets (climate change, extreme heat, food availability, etc), so we plan to train a classification model to tag each of the 52,000 broader Earth Science datasets as either not-EJ, or with one or more of the 8 EJ indicators.

High Precision, Low Recall

So what would the end user experience be like if we optimized the model for high precision but low recall?

This is the same as only sending guilty grandmas to jail, but missing many in the process. So we could be confident that only EJ content was on the portal, but we would also be missing lots of relevant content.

High Recall, Low Precision

If we instead optimized for high recall, then this would be the same as convicting all the guilty grandmas, but putting innocent grandmas in jail in the process.

The portal would be guaranteed to contain all the EJ datasets from the broader corpus, but it would also have unrelated content polluting the results.

What’s Better?

Well it depends on the goal of the platform and the end user. If you want to ensure that every bit of relevant data is available, then you go with high recall. But if you want to guarantee that no unrelated datasets appear in the portal, then you should optimize for high precision.

Minimum Metrics

But real life doesn’t deal in absolutes. Typically you neither want nor need to completely maximize one metric at the cost of the other. Instead, you will set a minimum acceptable threshold for each metric, and as the model reaches the limits of its performance there will begin to be tradeoffs as you maximize for one or the other.

But how do you choose these minimums? Well, in the Environmental Justice portal there are at least two factors to consider: user trust and data availability. An incorrect classification could lead to a user or the community losing trust in the EJ portal and abandoning it. But likewise, if the portal is missing too many relevant datasets, users won’t be able to find information they need and will abandon even the most accurate portal.

So you might decide that the portal needs to contain at least 85% of the NASA EJ data to be useful, and that only 1 out of every 10 classifications can be wrong or users won’t trust the portal. The following table shows how enforcing these two minimums might compare to only maximizing for precision.

	Precision	Recall
Only Maximize Precision	98%	70%
Min. 90% Prec, 85% Recall	92%	85%

In the first row, we are able to drive precision all the way to 98%, but can only achieve 70% recall at that high precision number.

However if we require a minimum precision of 90% and a minimum recall of 85%, we can meet our recall goal of 85% and further maximize precision to 92%.

Enforcing a minimum recall limited our precision ceiling, but we were able to ensure that the portal contained at least 85% of the relevant data at an acceptable error rate, resulting in a more useful end product.

Cover image generated using DALL-E 3

machine leaning, metrics

This post is licensed under CC BY 4.0 by the author.

The Madness of Machine Learning Metrics

Guilty Grandmas

The Metrics

Precision

Recall

The Balancing Act

High Precision, Low Recall

Low Precision, High Recall

Meaningful Metrics Matter

Accuracy

To Visit or Not to Visit?

Machine Learning

High Precision, Low Recall

High Recall, Low Precision

What’s Better?

Minimum Metrics

Trending Tags