Model Mining
In this example, we will demonstrate that you can use visual data mining techniques to discover meaningful patterns in your data. These patterns can be easily translated into a machine learning model by using the tools found in this package.
You can find a full tutorial of this technique on calmcode but the main video can be viewed below.
The Task¶
We're going to make a rule based model for the creditcard dataset. The main feature of the dataset is that it is suffering from a class imbalance. Instead of training a machine learning model, let's try to instead explore it with a parallel coordinates chart. If you scroll all the way to the bottom of this tutorial you'll see an example of such a chart. It shows a "train"-set.
We explored the data just like in the video and that led us to define the following model.
from hulearn.classification import FunctionClassifier
from hulearn.experimental import CaseWhenRuler
def make_prediction(dataf, age=15):
ruler = CaseWhenRuler(default=0)
(ruler
.add_rule(lambda d: (d['V11'] > 4), 1)
.add_rule(lambda d: (d['V17'] < -3), 1)
.add_rule(lambda d: (d['V14'] < -8), 1))
return ruler.predict(dataf)
clf = FunctionClassifier(make_prediction)
Full Code
First we load the data.
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
df_credit = fetch_openml(
data_id=1597,
as_frame=True
)
credit_train, credit_test = train_test_split(df_credit, test_size=0.5, shuffle=True)
Next, we create a hiplot in jupyter.
import json
import hiplot as hip
samples = [credit_train.loc[lambda d: d['group'] == True], credit_train.sample(5000)]
json_data = pd.concat(samples).to_json(orient='records')
hip.Experiment.from_iterable(json.loads(json_data)).display()
Given that we have our model, we can make a classification report.
from sklearn.metrics import classification_report
# Note that `fit` is a no-op here.
preds = clf.fit(credit_train, credit_train['group']).predict(credit_test))
print(classification_report(credit_test['group'], preds)
When we ran the benchmark locally, we got the following classification report.
precision recall f1-score support
False 1.00 1.00 1.00 142165
True 0.70 0.73 0.71 239
accuracy 1.00 142404
macro avg 0.85 0.86 0.86 142404
weighted avg 1.00 1.00 1.00 142404
Deep Learning¶
It's not a perfect benchmark, but we could compare this result to the one that's demonstrated on the keras blog. The trained model there lists 86.67% precision but only 23.9% recall. Depending on your preferences for false-positives, you could argue that our model is outperforming the deep learning model.
It's not 100% a fair comparison. You can imagine that the keras blogpost is written to explain keras. The auther likely didn't attempt to make a state-of-the-art model. But what this demo does show is the merit of turning an exploratory data analysis into a model. You can end up with a very interpretable model, you might learn something about your data along the way and the model might certainly still perform well.
Parallel Coordinates¶
If you hover of the group
name and right-click, you'll be able to set it for coloring and
repeat the experience in the video. By doing that it becomes quite easy to eyeball how to separate the two classes. The V17
column especially seems powerful here. In real life we might ask "why?" this column is so distinctive but for now we'll just play around until we find a sensible model.