Common Sense reduced to Priors
The goal of this document is to summarise a lesson we’ve had in the last year. We’ve done a lot of work on algorithmic bias (and open-sourced it). We reflected and concluded that constraints are an amazing idea that deserve to be used more often in machine learning.
This point drove us write the following formula on a whiteboard;
\[ \text{model} = \text{data} \times \text{constraints} \]
After writing it down, we noticed that we’ve seen this before but in a different notation.
\[ p(\theta | D) = p(\text{D} | \theta) p(\theta) \]
It’s poetic: maybe … just maybe … priors can be interpreted as constraints that we wish to impose on models. It is knowledge that we have about how the model should work even if the data wants to push us in another direction.
So what we’d like to do in this blogpost is explore the idea of constraints a bit more. First by showcasing how our open source package deals with it but then by showing how a probibalistic approach might be able to use bayes rule to go an extra mile.
We will use a dataset from scikit-lego. It contains traffic arrests in Toronto and it is our job to predict if somebody is released after they are arrested. It has attributes for skin color, gender, age, employment, citizenship, past interactions and date. We consider date, employment and citizenship to be proxies that go into the model while we keep gender, skin color and age seperate as sensitive attributes that we want to remain fair on.
Here’s a preview of the dataset.
released | colour | year | age | sex | employed | citizen | checks |
---|---|---|---|---|---|---|---|
Yes | False | 2002 | True | False | Yes | Yes | 3 |
No | True | 1999 | True | False | Yes | Yes | 3 |
Yes | False | 2000 | True | False | Yes | Yes | 3 |
No | True | 2000 | False | False | Yes | Yes | 1 |
Yes | True | 1999 | False | True | Yes | Yes | 1 |
The dataset is interesting for many reasons. Not only is fairness a risk; there is also a balancing issue. The balancing issue can be dealt with by adding a class_weight
parameter while the fairness can be dealt with in many ways (exibit A, exibit B). A favorable method (we think so) is to apply a hard constraint. Our implementation of EqualOpportunityClassifier
does this running a logistic regression constrained by the distance to the decision boundary in two groups.
from sklearn.linear_model import LogisticRegression
from sklego.linear_model import EqualOpportunityClassifier
= LogisticRegression(class_weight='balanced')
unfair_model = EqualOpportunityClassifier(
fair_model =0.9, # strictness of threshold
covariance_threshold='Yes', # name of the preferable label
positive_target=[0, 1, 2] # columns in X that are considered sensitive
sensitive_cols
)
unfair_model.fit(X, y) fair_model.fit(X, y)
Logstic Regression works by optimising the log likelihood.
\[ \begin{array}{cl}{\operatorname{minimize}} & -\sum_{i=1}^{N} \log p\left(y_{i} | \mathbf{x}_{i}, \boldsymbol{\theta}\right) \\\end{array} \]
But what if we add constraints here? That’s what the EqualOpportunityClassifier
does.
\[ \begin{array}{cl}{\operatorname{minimize}} & -\sum_{i=1}^{N} \log p\left(y_{i} | \mathbf{x}_{i}, \boldsymbol{\theta}\right) \\ {\text { subject to }} & {\frac{1}{POS} \sum_{i=1}^{POS}\left(\mathbf{z}_{i}-\overline{\mathbf{z}}\right) d \boldsymbol{\theta}\left(\mathbf{x}_{i}\right) \leq \mathbf{c}} \\ {} & {\frac{1}{POS} \sum_{i=1}^{POS}\left(\mathbf{z}_{i}-\overline{\mathbf{z}}\right) d_{\boldsymbol{\theta}}\left(\mathbf{x}_{i}\right) \geq-\mathbf{c}}\end{array} \]
It the log loss while constraining the correlation between the specified sensitive_cols
and the distance to the decision boundary of the classifier for those examples that have a y_true
of 1.
The main difference between the two approaches is that in the Logistic Regression scheme we drop the sensitive columns while the other approach actively corrects for them. The table below shows the cross-validated summary of the mean test performance of both models.
model | eqo_color | eqo_age | eqo_sex | precision | recall |
---|---|---|---|---|---|
LR | 0.6986 | 0.7861 | 0.8309 | 0.9187 | 0.6345 |
EOC | 0.9740 | 0.9929 | 0.9892 | 0.8353 | 0.9893 |
One way of measuring fairness could be to measure equal opportunity, which is abbreviated above as eqo. The idea is that we have a sensity attribute, say race, for which we don’t want unfairness with regards to the positive outcome \(y = 1\). Then equal opportunity is defined as follows;
\[ \text{equality of opportunity} = \min \left(\frac{P(\hat{y}=1 | z=1, y=1)}{P(\hat{y}=1 | z=0, y=1)}, \frac{P(\hat{y}=1 | z=0, y=1)}{P(\hat{y}=1 | z=1, y=1)}\right) \] Extra details can be found here.You can also confirm the difference between the two models by looking at their coefficients.
model | intercept | employed | citizen | year | checks |
---|---|---|---|---|---|
LR | -1.0655 | 0.7913 | 0.7537 | -0.0101 | -0.5951 |
EOC | 0.5833 | 0.7710 | 0.6826 | -0.0196 | -0.5798 |
There are a few things to note at this stage;
This brings us back to the formulae that we started with.
\[ \text{model} = \text{data} \times \text{constraints} \]
In our case the constraints we want concern fairness.
\[ p(\theta | D) \propto \underbrace{p(D | \theta)}_{\text{data}} \underbrace{p(\theta)}_{\text{fairness?}} \]
So can we come up with a prior for that?
To explore this idea we set out to reproduce our results from earlier in PyMC3. We started with an implementation of logistic regression but found that it did not match our earlier results. The results of the trace are listed below. We show the distribution of the weights as well as a distribution over the unfairness.
with pm.Model() as unbalanced_model:
= pm.Normal('intercept', 0, 1)
intercept = pm.Normal('weights', 0, 1, shape=X.shape[1])
weights
= pm.math.sigmoid(intercept + pm.math.dot(X, weights))
p
= intercept + pm.math.dot(X_colour, weights)
dist_colour = intercept + pm.math.dot(X_non_colour, weights)
dist_non_colour = pm.Deterministic('mu_diff', dist_colour.mean() - dist_non_colour.mean())
mu_diff
'released', p, observed=df['released'])
pm.Bernoulli(
= pm.sample(tune=1000, draws=1000, chains=6) unbalanced_trace
Figure 1: Standard Logistic Regression in PyMC3.
You might notice that the results of the traceplot do not agree with the coefficients that we’ve found earlier. This was because our original logistic regression had a balanced
setting which is not included in our PyMC3 approach. Luckily for us PyMC3 has a feature to address this; pm.Potential
.
pm.Potential
The idea behind the potential is that you add a prior on a combination of parameters instead of just having it on a single one. For example, this is how you’d usually set parameters;
= pm.Normal('mu', 0, 1)
mu = pm.HalfNormal('sigma', 0, 1) sigma
By setting the sigma
prior to be HalfNormal
we prevent it from ever becoming negative. But what if we’d like to set another prior, namely that \(\mu \approx \sigma\)? This is what pm.Potential
can be used for.
'balance', pm.Normal.dist(0, 0.1).logp(mu - sigma)) pm.Potential(
Adding a potential has an effect on the likelihood of a tracepoint.
Figure 2: Example of a tracepoint that is both less (left) and more likely (right) given the potential.
This in turn will make the posterior look different.
Figure 3: The effect that the potential might have.
So we made a second version of the logistic regression.
with pm.Model() as balanced_model:
= pm.Normal('intercept', 0, 1)
intercept = pm.Normal('weights', 0, 1, shape=X.shape[1])
weights
= pm.math.sigmoid(intercept + pm.math.dot(X, weights))
p
= intercept + pm.math.dot(X_colour, weights)
dist_colour = intercept + pm.math.dot(X_non_colour, weights)
dist_non_colour = pm.Deterministic('mu_diff', dist_colour.mean() - dist_non_colour.mean())
mu_diff
'balance', sample_weights.values * pm.Bernoulli.dist(p).logp(df['released'].values))
pm.Potential(
= pm.sample(tune=1000, draws=1000, chains=6) balanced_trace
These results were in line with our previous result again.
But that pm.Potential
can also be used for other things!
Figure 4: From trace to posterior.
Suppose we have our original trace that generates our posterior.
Figure 5: Two belief systems …
Now also suppose that we have a function that describes our potential.
Figure 6: Two belief systems … merged!
Then these two can be combined. Our prior can span beyond a single parameter! Our belief of how the model should behave before seeing data can influence the entire posterior. So we’ve literally come up with a prior that has a huge potential for fairness.
= split_groups(X, key="colour")
X_colour, X_non_colour
...= intercept + pm.math.dot(X_colour, weights)
dist_colour = intercept + pm.math.dot(X_non_colour, weights)
dist_non_colour = pm.Deterministic('mu_diff', dist_colour.mean() - dist_non_colour.mean())
mu_diff 'dist', pm.Normal.dist(0, 0.01).logp(mu_diff)) pm.Potential(
Note the 0.01
value on the bottom line. This value can be interpreted as strictness for fairness. The lower it is, the less wiggle room the sampler has to explore areas that are not fair.
The results can be seen below.
with pm.Model() as dem_par_model:
= pm.Normal('intercept', 0, 1)
intercept = pm.Normal('weights', 0, 1, shape=X.shape[1])
weights
= pm.math.sigmoid(intercept + pm.math.dot(X, weights))
p
= intercept + pm.math.dot(X_colour, weights)
dist_colour = intercept + pm.math.dot(X_non_colour, weights)
dist_non_colour = pm.Deterministic('mu_diff', dist_colour.mean() - dist_non_colour.mean())
mu_diff
'dist', pm.Normal.dist(0, 0.01).logp(mu_diff))
pm.Potential('balance', sample_weights.values * pm.Bernoulli.dist(p).logp(df['released'].values))
pm.Potential(
= pm.sample(tune=1000, draws=1000, chains=6) dem_par_trace
Figure 7: Potential Fairness in PyMC3.
There’s still an issue. We’ve gotten a flexible approach here. Compared to the scikit-learn pipeline we can have more flexible definitions of fairness and we can have more flexible models (hierarchical models, non-linear models) but at the moment our models does not guarantee fairness.
But then Matthijs came up with a neat little hack.
Figure 8: Posterior Belief and Potential Direction
We use our potential to push samples in a direction. This push must be continous if we want gradients in our sampler method to be of any use to us. But after this push is done, we would we would like to make a hard cutoff on our fairness. So why don’t we just filter out the sampled points in our trace that we don’t like?
Figure 9: After the data is pushed we do a hard filter.
This way, we still get a distribution over the models parameters out but this distribution is guaranteed to never assign any probability mass in regions where we deem the predictions to be ‘unfair’.
def hard_constraint_model(df):
def predict(trace, df):
= df[['year', 'employed', 'citizen', 'checks']].values
X return expit((trace['intercept'][:, None] + trace['weights'] @ X.T).mean(axis=0))
= df[['year', 'employed', 'citizen', 'checks']].values
X = X[df['colour'] == 1], X[df['colour'] == 0]
X_colour, X_non_colour
= len(df) / df['released'].value_counts()
class_weights = df['released'].map(class_weights)
sample_weights with pm.Model() as dem_par_model:
= pm.Normal('intercept', 0, 1)
intercept = pm.Normal('weights', 0, 1, shape=X.shape[1])
weights
= pm.math.sigmoid(intercept + pm.math.dot(X, weights))
p
= intercept + pm.math.dot(X_colour, weights)
dist_colour = intercept + pm.math.dot(X_non_colour, weights)
dist_non_colour = pm.Deterministic('mu_diff', dist_colour.mean() - dist_non_colour.mean())
mu_diff
'dist', pm.Normal.dist(0, 0.01).logp(mu_diff))
pm.Potential('balance', sample_weights.values * pm.Bernoulli.dist(p).logp(df['released'].values))
pm.Potential(
= pm.sample(tune=1000, draws=1000, chains=6)
dem_par_trace return trace_filter(dem_par_trace, 0.16), predict
Note that in the results, the fairness metric has a hard cutoff.
Figure 10: Enforce Fairness in PyMC3.
The approach that we propose here is relatively generic. You can make hierarchical models and you have more flexiblity in your definition of fairness. You start with constraints which you need to translate into a potential after which you can apply a strict filter.
Figure 11: The general recipe.
We’ve just come up with an approach where our potential represents fairness. Since we filter the trace afterwards we have an algorithm with properties we like. We don’t want to suggest this approach is perfect though, so here’s some valid points of critique;
Despite not guaranteeing fairness, we’re pretty exited about this way of thinking about models. The reason why is best described with an analogy in fashion and is summarized in this photo;
Figure 12: This is what OneSizeForAll().fit()
looks like. It never fits perfectly.
We think scikit-learn is an amazing tool. It sparked the familiar .fit()
/.predict()
interface that the ecosystem has grown accostomed to and it introduced a wonderful concept via its Pipeline
-API. But all this greatness comes at a cost; people seem to be getting lazy.
Every problem get’s reduced to something that can be put into a .fit()
/.predict()
pipeline. The clearest examples of this can be found on the kaggle platform. Kaggle competitions are won by reducing a problem to a single metric, optimising it religiously and not worrying about the application. They’re not won by understanding the problem, modelling towards it or by wondering how the algorithm might have side-effects that you don’t want.
It is exactly on this axis that this approach gives us hope. Instead of calling model.fit()
you get to enact tailor.model()
because you’re forced to think in constraints. This means that we actually get to model again! We can add common sense as a friggin’ prior! How amazing is that!
To add a cherry on top; in our example we’re using fairness as a driving argument but the reason to be exited goes beyond that.
The act of thinking about constraints immediately makes you seriously consider the problem before modelling and that … that’s got a lot of potential.
This document is written by both myself and Matthijs Brouns and it is posted on two blogs. The code used here can be found in this github repository.
It’s also resulted in a conference talk at PyMCon.
For attribution, please cite this work as
Brouns & Warmerdam (2020, Feb. 22). koaning.io: Priors of Great Potential. Retrieved from https://koaning.io/posts/priors-of-great-potential/
BibTeX citation
@misc{brouns2020priors, author = {Brouns, Matthijs and Warmerdam, Vincent}, title = {koaning.io: Priors of Great Potential}, url = {https://koaning.io/posts/priors-of-great-potential/}, year = {2020} }