Human Preprocessing

In python the most popular data analysis tool is pandas while the most popular tool for making models is scikit-learn. We love the data wrangling tools of pandas while we appreciate the benchmarking capability of scikit-learn.

The fact that these tools don't fully interact is slightly awkward. The data going into the model has an big effect on the output.

So how might we more easily combine the two?

Pipe

In pandas there's an amazing trick that you can do with the .pipe method. We'll give a quick overview on how it works but if you're new to this idea you may appreciate this resource or this blogpost.

from hulearn.datasets import load_titanic

df = load_titanic(as_frame=True)
X, y = df.drop(columns=['survived']), df['survived']
X.head(4)

The goal of the titanic dataset is to predict weather or not a passenger survived the disaster. The X variable represents a dataframe with variables that we're going to use to predict survival (stored in y). Here's a preview of what X might have.

pclass name sex age fare sibsp parch
3 Braund, Mr. Owen Harris male 22 7.25 1 0
3 Heikkinen, Miss. Laina female 26 7.925 0 0
3 Allen, Mr. William Henry male 35 8.05 0 0
1 McCarthy, Mr. Timothy J male 54 51.8625 0 0

Let's say we want to do some preprocessing. Maybe the length of name of somebody says something about their status so we'd like to capture that. We could add this feature with this line of code.

X['nchar'] = X['name'].str.len()

This line of code has downsides though. It changes the original dataset. If we do a lot of this then our code is going to turn into something unmaintainable rather quickly. To prevent this, we might want to change the code into a function.

def process(dataf):
    # Make a copy of the dataframe to prevent it from overwriting the original data.
    dataf = dataf.copy()
    # Make the changes
    dataf['nchar'] = dataf['name'].str.len()
    # Return the name dataframe
    return dataf

We now have a nice function that makes our changes and we can use it like so;

X_new = process(X)

We can do something more powerful though.

Paramaters

Let's make some more changes to our process function.

def preprocessing(dataf, n_char=True, gender=True):
    dataf = dataf.copy()
    if n_char:
        dataf['nchar'] = dataf['name'].str.len()
    if gender:
        dataf['gender'] = (dataf['sex'] == 'male').astype("float")
    return dataf.drop(columns=["name", "sex"])

This function works slightly differently now. The most important part is that the function now accepts arguments that change the way it behaves internally. The function also drops the non-numeric columns at the end.

We've changed the way we've defined our function but we're also changing the way that we're going to apply it.

# This is equivalent to preprocessing(X)
X.pipe(preprocessing)

The benefit of this notation is that if we have more functions that handle data processing that it would remain a clean overview.

With .pipe()

(df
  .pipe(set_col_types)
  .pipe(preprocessing, nchar=True, gender=False)
  .pipe(add_time_info))

Without .pipe()

add_time_info(preprocessing(set_col_types(df), nchar=True, gender=False))

Let's be honest, this looks messy.

PipeTransformer

It would be great if we could use the preprocessing-function as part of a scikit-learn pipeline that we can benchmark. It'd be great if we could use a function with a pandas .pipe-line in general!

For that we've got another feature in our library, the PipeTransformer.

from hulearn.preprocessing import PipeTransformer

def preprocessing(dataf, n_char=True, gender=True):
    dataf = dataf.copy()
    if n_char:
        dataf['nchar'] = dataf['name'].str.len()
    if gender:
        dataf['gender'] = (dataf['sex'] == 'male').astype("float")
    return dataf.drop(columns=["name", "sex"])

# Important, don't forget to declare `n_char` and `gender` here.
tfm = PipeTransformer(preprocessing, n_char=True, gender=True))

The tfm variable now represents a component that can be used in a scikit-learn pipeline. We can also perform a cross-validated benchmark on the parameters our preprocessing function.

from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV


pipe = Pipeline([
    ('prep', tfm),
    ('mod', GaussianNB())
])

params = {
    "prep__n_char": [True, False],
    "prep__gender": [True, False]
}

grid = GridSearchCV(pipe, cv=3, param_grid=params).fit(X, y)

Once trained we can fetch the grid.cv_results_ to get a glimpse at the results of our pipeline.

param_prep__gender param_prep__n_char mean_test_score
True True 0.785714
True False 0.778711
False True 0.70028
False False 0.67507

It seems that we gender of the passenger has more of an effect on their survival than the length of their name.

Utility

The use-case here has been a relatively simple demonstration on a toy dataset but hopefully you can recognize that this opens up a lot of flexibility for your machine learning pipelines. You can keep the preprocessing interpretable but you can keep everything running by just writing pandas code.

There's a few small caveats to be aware of.

Don't remove data

Pandas pipelines allow you to filter away rows, scikit-learn on the other hand assumes this does not happen. Please be mindful of this.

Don't sort data

You need to keep the order in your dataframe the same because otherwise it will no longer correspond to the y variable that you're trying to predict.

Don't use lambda

There's two ways that you can add a new column to pandas.

# Method 1
dataf_new = dataf.copy() # Don't overwrite data!
dataf_new['new_column'] = dataf_new['old_column'] * 2

# Method 2
dataf_new = dataf.assign(lambda d: d['old_column'] * 2)

In many cases you might argue that method #2 is safer because you do not need to worry about the dataf.copy() that needs to happen. In our case however, we cannot use it. The grid-search no longer works inside of scikit-learn if you use lambda functions because it cannot pickle the code.

Don't Cheat!

The functions that you write are supposed to be stateless in the sense that they don't learn from the data that goes in. You could theoretically bypass this with global variables but by doing so you're doing yourself a disservice. If you do this you'll be cheating the statistics by leaking information.