Mixture Methods¶

Gaussian Mixture Models (GMMs) are flexible building blocks for other machine learning algorithms.

This is in part because they are great approximations for general probability distributions but also because they remain somewhat interpretable even when the dataset gets very complex.

This package makes use of GMMs to construct other algorithms. In addition to the GMMClassifier and GMMDetector, this library also features a BayesianGMMClassifier and BayesianGMMDetector as well. These methods offer pretty much the same API, but will have internal methods to figure out what number of components to estimate. These methods tend to take significantly more time to train, so alternatively you may also try doing a proper grid search to figure out the best number of components for your use-case.

Classification¶

Below is some example code of how you might use a GMMClassifier from sklego to perform classification.

GMMClassifier

import numpy as np
import matplotlib.pylab as plt

from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler

from sklego.mixture import GMMClassifier

n = 1000
X, y = make_moons(n)
X = X + np.random.normal(n, 0.12, (n, 2))
X = StandardScaler().fit_transform(X)
U = np.random.uniform(-2, 2, (10000, 2))

mod = GMMClassifier(n_components=4).fit(X, y)

plt.figure(figsize=(14, 5))
plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1], c=mod.predict(X), s=8)
plt.title("classes of points");

plt.subplot(122)
plt.scatter(U[:, 0], U[:, 1], c=mod.predict_proba(U)[:, 1], s=8)
plt.title("classifier boundary");

gmm-classifier

Outlier Detection¶

Below is some example code of how you might use a GMM from sklego to do outlier detection.

Note that the GMMOutlierDetector generates prediction values that are either -1 (outlier) or +1 (normal).

GMMOutlierDetector

import numpy as np
import matplotlib.pylab as plt

from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler

from sklego.mixture import GMMOutlierDetector

n = 1000
X = make_moons(n)[0] + np.random.normal(n, 0.12, (n, 2))
X = StandardScaler().fit_transform(X)
U = np.random.uniform(-2, 2, (10000, 2))

mod = GMMOutlierDetector(n_components=16, threshold=0.95).fit(X)

plt.figure(figsize=(14, 5))
plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1], c=mod.score_samples(X), s=8)
plt.title("likelihood of points given mixture of 16 gaussians");

plt.subplot(122)
plt.scatter(U[:, 0], U[:, 1], c=mod.predict(U), s=8)
plt.title("outlier selection");

gmm-outlier-detector

Remark that with a GMM there are multiple ways to select outliers. Instead of selection points that are beyond the likely quantile threshold one can also specify the number of standard deviations away from the most likely standard deviations a given point it.

Different thresholds

plt.figure(figsize=(14, 5))
for i in range(1, 5):
    mod = GMMOutlierDetector(n_components=16, threshold=i, method="stddev").fit(X)
    plt.subplot(140 + i)
    plt.scatter(U[:, 0], U[:, 1], c=mod.predict(U), s=8)
    plt.title(f"outlier sigma={i}");

gmm-outlier-multi-threshold

Detection Details¶

The outlier detection methods that we use are based on the likelihoods that come out of the estimated Gaussian Mixture.

Depending on the setting you choose we have a different method for determining if a point is inside or outside the threshold.

If the "quantile" method is used, we take all the likelihood scores found that the GMM associates on a training dataset to determine where to set a threshold. The threshold value must be between 0 and 1 here.
If the "stddev" method is used, then the threshold value is now interpreted as the number of standard deviations lower than the mean we are. We only calculate the standard deviation on the lower scores because there's usually more variance here. !!! note This setting allows you to be much more picky in selecting than the "quantile" one since this method allows you to be more exclusive than the "quantile" method with threshold equal to one.

outlier-mixture-threshold

As a sidenote: this image was generated with some dummy data, but its code can be found below:

Code for plot generation

import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from scipy.stats import gaussian_kde

sns.set_theme()

score_samples = np.random.beta(220, 10, 3000)
density = gaussian_kde(score_samples)
likelihood_range = np.linspace(0.80, 1.0, 10000)

index_max_y = np.argmax(density(likelihood_range))
mean_likelihood = likelihood_range[index_max_y]
new_likelihoods = score_samples[score_samples < mean_likelihood]
new_likelihoods_std = np.sqrt(np.sum((new_likelihoods - mean_likelihood) ** 2) / (len(new_likelihoods) - 1))

plt.figure(figsize=(14, 3))
plt.subplot(121)
plt.plot(likelihood_range, density(likelihood_range), "k")
xs = np.linspace(0.8, 1.0, 2000)
plt.fill_between(xs, density(xs), alpha=0.8)
plt.title("log-lik values from with GMM, quantile is based on blue part")

plt.subplot(122)
plt.plot(likelihood_range, density(likelihood_range), "k")
plt.vlines(mean_likelihood, 0, density(mean_likelihood), linestyles="dashed")
xs = np.linspace(0.8, mean_likelihood, 2000)
plt.fill_between(xs, density(xs), alpha=0.8)
plt.title("log-lik values from with GMM, stddev is based on blue part")