scikit-playtime
Rethinking machine learning pipelines a bit.
What does scikit-playtime
do?
I was wondering if there might be an easier way to construct scikit-learn pipelines. Don't get me wrong, scikit-learn is amazing when you want elaborate pipelines (exibit A, exibit B) but maybe there is also a place for something more lightweight and playful. This library is all about exploring that.
Imagine that you are dealing with the titanic dataset.
Here's what the dataset looks like.
survived | pclass | name | sex | age | fare | sibsp | parch |
---|---|---|---|---|---|---|---|
0 | 3 | Braund, Mr. Owen Harris | male | 22 | 7.25 | 1 | 0 |
1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 71.2833 | 1 | 0 |
1 | 3 | Heikkinen, Miss. Laina | female | 26 | 7.925 | 0 | 0 |
1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 53.1 | 1 | 0 |
0 | 3 | Allen, Mr. William Henry | male | 35 | 8.05 | 0 | 0 |
The goal of this dataset is to predict who survived, so survived is the target column for a classification task. But in order to make the right predictions you would need to encode the features in the right way. So to do that, you might construct a preprocessing pipeline like this:
from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import OneHotEncoder
from skrub import SelectCols
pipe = make_union(
SelectCols(["age", "fare", "sibsp", "parch"]),
make_pipeline(
SelectCols(["sex", "pclass"]),
OneHotEncoder()
)
)
This pipeline takes the age, fare, sibsp and parch features as-is. These features are already numeric so these do not need to be changed. But the sex and pclass features are candidates to one-hot encode first. These are categorical features, so it helps to encode them as such.
Here's what the HTML render of the pipeline looks like.
FeatureUnion(transformer_list=[('selectcols', SelectCols(cols=['age', 'fare', 'sibsp', 'parch'])), ('pipeline', Pipeline(steps=[('selectcols', SelectCols(cols=['sex', 'pclass'])), ('onehotencoder', OneHotEncoder())]))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
FeatureUnion(transformer_list=[('selectcols', SelectCols(cols=['age', 'fare', 'sibsp', 'parch'])), ('pipeline', Pipeline(steps=[('selectcols', SelectCols(cols=['sex', 'pclass'])), ('onehotencoder', OneHotEncoder())]))])
SelectCols(cols=['age', 'fare', 'sibsp', 'parch'])
SelectCols(cols=['sex', 'pclass'])
OneHotEncoder()
The pipeline works, and it's fine, but you could wonder if this is easy. After all, you do need to know scikit-learn fairly well in order to build a pipeline this way and you may also need to appreciate Python. There's some nesting happening in here as well, so for a novice or somebody who just immediately wants to make a quick model ... there's some stuff that gets in the way. All of this is fine when you consider that scikit-learn needs to allow for elaborate pipelines ... but if you just want something dead simple ... then you may appreciate another syntax instead.
Enter playtime.
Playtime offers an API that allows you to declare the aforementioned pipeline by doing this instead:
from playtime.formula import feats, onehot
formula = feats("age", "fare", "sibsp", "parch") + onehot("sex", "pclass")
This forumla
object is just an object that can accumulate components and you can access the generated pipeline by checking the .pipeline
property.
It's pretty much the same pipeline, but it's a lot easier to go ahead and declare. You're mostly dealing with column names and how to encode them, instead of thinking about how scikit-learn constructs a pipeline.
Lets also do text.
Right now we're just exploring base features and one-hot encoding ... but why stop there? We can also encode the name of the passenger using a bag of words representation!
from playtime.formula import feats, onehot, bag_of_words
formula = feats("age", "fare", "sibsp", "parch") + onehot("sex", "pclass") + bag_of_words("name")
FeatureUnion(transformer_list=[('selectcols', SelectCols(cols=['age', 'fare'])), ('pipeline-1', Pipeline(steps=[('selectcols', SelectCols(cols=('pclass', 'sex'))), ('onehotencoder', OneHotEncoder())])), ('pipeline-2', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function column_pluck at 0x287266700>, kw_args={'column': 'name'})), ('countvectorizer', CountVectorizer())]))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
FeatureUnion(transformer_list=[('selectcols', SelectCols(cols=['age', 'fare'])), ('pipeline-1', Pipeline(steps=[('selectcols', SelectCols(cols=('pclass', 'sex'))), ('onehotencoder', OneHotEncoder())])), ('pipeline-2', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function column_pluck at 0x287266700>, kw_args={'column': 'name'})), ('countvectorizer', CountVectorizer())]))])
SelectCols(cols=['age', 'fare'])
SelectCols(cols=('pclass', 'sex'))
OneHotEncoder()
FunctionTransformer(func=<function column_pluck at 0x287266700>, kw_args={'column': 'name'})
CountVectorizer()
Again, as a user you don't need to worry about the internals of the pipeline, you just declare how you want to model.
About that bag_of_words
representation
The CountVectorizer
in scikit-learn is great for making bag of words representations, but it assumes an iterable of texts as input. That means we can't get use the SelectCols
object from skrub
because that will always return a dataframe, even if we only want a single column for it.
Again, this is a detail that a modeller should not be concerned with, so playtime
fixes this internally on your behalf. Part of this involves leveraging narwhals which even allows us to support both polars and pandas in one go.
Lets also do timeseries.
Sofar we've shown how you might use one hot encoded variables and bag of words representations to preprocess data for a machine learning use-case. This covers a lot of ground already, but why stop here?
We're still exploring all the ways that you might encode data, but just to give one more example, let's consider timeseries. We could generate some features that can help predict seasonal patterns. Internally we're using this technique, but again, here's all you need:
Again, this formula contains a pipeline that we can pass to a model.
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
import matplotlib.pylab as plt
import numpy as np
# Load data that has a timestamp column and a `y` target column
df = pd.read_csv('datasets/me-temperatures.csv')
# Use a linear model for these seasonal features
pipe = make_pipeline(formula.pipeline, Ridge())
# Make the predictions
pred = pipe.fit(df, df['y']).predict(df)
# Plot the predictions to show the effect
pltr = df.assign(pred=pred)
plt.figure(figsize=(12, 5))
plt.plot(np.arange(0, pltr.shape[0]), pltr['y'])
plt.plot(np.arange(0, pltr.shape[0]), pltr['pred'], linewidth=4);
The future
Feel like playing around with this? You can do this right now by installing via pip:
That said, please consider this to be an experimental project where things may break. There is still much to explore here and that will be done in public. In the future this project will explore:
- How we might come up with more clever featurisation methods. We may be able to capture plenty more common feature patterns with simple functions that we can chain add together.
- How different operators might help improve things. Maybe the
*
operator can be used to generate a cross product between features and maybe the|
operator can be used to pass features to actual scikit-learn components likePCA()
. - How we might consider methods that can accept a
playtime
pipeline and can do more elaborate modelling on top. Maybe we can be clever about how we generate multi-output models for timeseries tasks. Think about quantiles or multi label use-cases.
Thanks
This project was originally part of my work over at calmcode labs but my employer probabl has been very supportive and has allowed me to work on this project during my working hours. This was super cool and I wanted to make sure I recognise them for it.