koaning.io: Mean Squared Terror

Vincent Warmerdam

2020-06-29 This blogpost was edited the day after it got posted. During my morning walk I realised that my code had a glaring bug in the statistics. I was reporting training set figures, not the validation set. I re-ran the system with the appropriate numbers and while the conclusions did not change, some of the figures/numbers did.

When I started in data, many years ago, people were actively working on heuristics. These are rules that could look at data and then make appropriate decisions. In schematics, this was the norm;

Figure 1: Those were the days.

You had to understand the data to come up with these rules and this was very time consuming. But then machine learning changed everything. Instead of writing rules that fit a problem, we provide labels and have a machine figure out the rules instead.

Figure 2: The latest and greatest.

The latter approach has merit to it but it requires a new way of working. It also tends to be compute intensive because we’re interested in finding the best possible settings.

Figure 3: All those settings.

To find the best settings for all of these models we’ve even got gridsearch via cross validation; an industry standard to make sure that we pick the best model.

Figure 4: All this expensive compute better be worth it.

Don’t get me wrong, there’s merit to the methodology. We want to demonstrate consistent results in our models. But we’re representing all the possible properties of a model as a single number. That’s a massive reduction of information, and frankly, an overly simplified way to think about models.

Dangerzone

I’ve written before that “GridSearch is not Enough[tm]” but in this document I wanted to give an example of how the technique can also be a problem. GridSearch is insufficient at best, it can also be dangerous at worst.

The main reason is that I’ve seen people put blind faith into the methodology causing a blind spot for other issues with the model. I’ve also spoken to data scientists who think that machine learning algorithms are unbiased tools that find the “truth” in the data and that grid search is a methodological purity. This is a dangerous attitude so I figured I’d create a counter-example.

Experiment

So let’s do a small experiment. The goal is to see the effect of machine learning on a biased dataset. Given a biased dataset, will grid search amplify the bias in the dataset or reduce it?

I’ve draw a dataset using drawdata.xyz. Let’s pretend that this usecase represents predicting salaries based on skill level. The skill level is plotted on the x-axis and the salary/reward is plotted on the y-axis. There’s two groups of datapoints and one represents a marginalized group. This group is under represented, only appears with low salaries and it also has a lower salary than it’s non-marginalised peers. The current illustration is a blunt representation of what might happen in a real life dataset (the bias can be more subtle) but the question is; what pattern will be picked up after grid search?

Below we’ll have a look at the results of three different models. We’ll also list the mean squared error that was reported back to us from the gridsearch.

Linear Model with Dummy Variable

Figure 5: Mean Squared Error, Train: 1166.64 Test: 1767.36

Nearest Neighbor Regression with Scaling

Figure 6: Mean Squared Error, Train: 909.66 Test: 2337.65

Grouped Linear Model

Figure 7: Mean Squared Error, Train: 993.57 Test: 1343.09

Notice how the model that has the lowest mean squared error is also the model that causes the most bias between groups for higher skill levels. This increase in bias cannot be blamed merely on the data. It’s the choice of the model that increases this bias which is the responsibility of the algorithm designer.

Now, you could argue that this is a theoretical dataset. No professional would include something like gender into a model predicting salary. Right? We can repeat the same exercise with something that correlates with a sensitive attribute instead but you can expect similar result. You can get a smaller mean squared error when you can learn the bias in the dataset. No matter how subtle.

This is what I find scary about gridsearch. People have been learned to trust it even though it can pick a model you should not consider. Notice how the grid search would tell us that there should be a big payment gap between the two groups even if they display high levels of skill. In algorithmic terms some might call this behavior “optimal” while in real life, you would call this sexism, ageism or racism.

Figure 8: Mean Squared Terror: over 9000.

The thing to remember here is that 100% unbiased datasets don’t really exist. This phenomenon of overfitting on a metric is something that gridsearch contributes to. Combine that with algorithm designers who don’t have domain knowledge in the domain that they’re applying your algorithm to and you might just get this.

Figure 9: Mean Squared Terror: np.inf

Appendix

This post was prompted by a discussion on twitter as well as a discussion on linkedin. It’s dissapointing to see data professionals be dismissive of how bias is amplified by algorithms. It’s not something the profession should be casual about.

Mean Squared Terror