Skip to content

Decomposition

sklego.decomposition.pca_reconstruction.PCAOutlierDetection

Bases: BaseEstimator, OutlierMixin

PCAOutlierDetection is an outlier detector based on the reconstruction error from PCA.

If the difference between original and reconstructed data is larger than the threshold, the point is considered an outlier.

Parameters:

Name Type Description Default
n_components int | None

Number of components of the PCA model.

None
threshold float | None

The threshold used for the decision function.

None
variant Literal[relative, absolute]

The variant used for the difference calculation. "relative" means that the difference between original and reconstructed data is divided by the sum of the original data.

"relative"
whiten bool

whiten parameter of PCA model.

False
svd_solver Literal[auto, full, arpack, randomized]

svd_solver parameter of PCA model.

"auto"
tol float

tol parameter of PCA model.

0.0
iterated_power int | Literal[auto]

iterated_power parameter of PCA model.

"auto"
random_state int | None

random_state parameter of PCA model.

None

Attributes:

Name Type Description
pca_ PCA

The underlying PCA model.

offset_ float

The offset used for the decision function.

Examples:

import numpy as np
from sklego.decomposition import PCAOutlierDetection

X = np.array([[-1, -1, -1], [-2, -1, -2], [5, -1, 0], [1, 1, 1], [2, 1, 1], [3, 2, 3]])

pca_model = PCAOutlierDetection(n_components=2, threshold=0.05)
pca_model.fit(X)
pca_pred = pca_model.predict(X)
pca_pred
# [ 1  1  1 -1 -1  1]
Source code in sklego/decomposition/pca_reconstruction.py
class PCAOutlierDetection(BaseEstimator, OutlierMixin):
    """`PCAOutlierDetection` is an outlier detector based on the reconstruction error from PCA.

    If the difference between original and reconstructed data is larger than the `threshold`, the point is
    considered an outlier.

    Parameters
    ----------
    n_components : int | None, default=None
        Number of components of the PCA model.
    threshold : float | None, default=None
        The threshold used for the decision function.
    variant : Literal["relative", "absolute"], default="relative"
        The variant used for the difference calculation. "relative" means that the difference between original and
        reconstructed data is divided by the sum of the original data.
    whiten : bool, default=False
        `whiten` parameter of PCA model.
    svd_solver : Literal["auto", "full", "arpack", "randomized"], default="auto"
        `svd_solver` parameter of PCA model.
    tol : float, default=0.0
        `tol` parameter of PCA model.
    iterated_power : int | Literal["auto"], default="auto"
        `iterated_power` parameter of PCA model.
    random_state : int | None, default=None
        `random_state` parameter of PCA model.

    Attributes
    ----------
    pca_ : PCA
        The underlying PCA model.
    offset_ : float
        The offset used for the decision function.

    Examples
    --------
    ```py
    import numpy as np
    from sklego.decomposition import PCAOutlierDetection

    X = np.array([[-1, -1, -1], [-2, -1, -2], [5, -1, 0], [1, 1, 1], [2, 1, 1], [3, 2, 3]])

    pca_model = PCAOutlierDetection(n_components=2, threshold=0.05)
    pca_model.fit(X)
    pca_pred = pca_model.predict(X)
    pca_pred
    # [ 1  1  1 -1 -1  1]
    ```
    """

    def __init__(
        self,
        n_components=None,
        threshold=None,
        variant="relative",
        whiten=False,
        svd_solver="auto",
        tol=0.0,
        iterated_power="auto",
        random_state=None,
    ):
        self.n_components = n_components
        self.threshold = threshold
        self.whiten = whiten
        self.variant = variant
        self.svd_solver = svd_solver
        self.tol = tol
        self.iterated_power = iterated_power
        self.random_state = random_state

    def fit(self, X, y=None):
        """Fit the `PCAOutlierDetection` model using `X` as training data by fitting the underlying PCA model, and
        checking the `threshold` value.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features )
            The training data.
        y : array-like of shape (n_samples,) or None, default=None
            Ignored, present for compatibility.

        Returns
        -------
        self : PCAOutlierDetection
            The fitted estimator.

        Raises
        ------
        ValueError
            If `threshold` is `None`.
        """
        X = check_array(X, estimator=self, dtype=FLOAT_DTYPES)
        if not self.threshold:
            raise ValueError("The `threshold` value cannot be `None`.")

        self.pca_ = PCA(
            n_components=self.n_components,
            whiten=self.whiten,
            svd_solver=self.svd_solver,
            tol=self.tol,
            iterated_power=self.iterated_power,
            random_state=self.random_state,
        )
        self.pca_.fit(X, y)
        self.offset_ = -self.threshold

        self.n_features_in_ = X.shape[1]
        return self

    def difference(self, X):
        """Return the calculated difference between original and reconstructed data. Row by row.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features )
            Data to calculate the difference for.

        Returns
        -------
        array-like of shape (n_samples,)
            The calculated difference.
        """
        check_is_fitted(self, ["pca_", "offset_"])
        reduced = self.pca_.transform(X)
        diff = np.sum(np.abs(self.pca_.inverse_transform(reduced) - X), axis=1)
        if self.variant == "relative":
            diff = diff / X.sum(axis=1)
        return diff

    def decision_function(self, X):
        """Calculate the decision function for the data as the difference between `threshold` and the `.difference(X)`
        (which is the difference between original data and reconstructed data)."""
        return self.threshold - self.difference(X)

    def score_samples(self, X):
        """Calculate the score for the samples"""
        return -self.difference(X)

    def predict(self, X):
        """Predict if a point is an outlier using fitted estimator.

        If the difference between original and reconstructed data is larger than the `threshold`, the point is
        considered an outlier.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data to predict.

        Returns
        -------
        array-like of shape (n_samples,)
            The predicted data. 1 for inliers, -1 for outliers.
        """
        X = check_array(X, estimator=self, dtype=FLOAT_DTYPES)
        check_is_fitted(self, ["pca_", "offset_"])
        result = np.ones(X.shape[0])
        result[self.difference(X) > self.threshold] = -1
        return result.astype(int)

decision_function(X)

Calculate the decision function for the data as the difference between threshold and the .difference(X) (which is the difference between original data and reconstructed data).

Source code in sklego/decomposition/pca_reconstruction.py
def decision_function(self, X):
    """Calculate the decision function for the data as the difference between `threshold` and the `.difference(X)`
    (which is the difference between original data and reconstructed data)."""
    return self.threshold - self.difference(X)

difference(X)

Return the calculated difference between original and reconstructed data. Row by row.

Parameters:

Name Type Description Default
X array-like of shape (n_samples, n_features )

Data to calculate the difference for.

required

Returns:

Type Description
array-like of shape (n_samples,)

The calculated difference.

Source code in sklego/decomposition/pca_reconstruction.py
def difference(self, X):
    """Return the calculated difference between original and reconstructed data. Row by row.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features )
        Data to calculate the difference for.

    Returns
    -------
    array-like of shape (n_samples,)
        The calculated difference.
    """
    check_is_fitted(self, ["pca_", "offset_"])
    reduced = self.pca_.transform(X)
    diff = np.sum(np.abs(self.pca_.inverse_transform(reduced) - X), axis=1)
    if self.variant == "relative":
        diff = diff / X.sum(axis=1)
    return diff

fit(X, y=None)

Fit the PCAOutlierDetection model using X as training data by fitting the underlying PCA model, and checking the threshold value.

Parameters:

Name Type Description Default
X array-like of shape (n_samples, n_features )

The training data.

required
y array-like of shape (n_samples,) or None

Ignored, present for compatibility.

None

Returns:

Name Type Description
self PCAOutlierDetection

The fitted estimator.

Raises:

Type Description
ValueError

If threshold is None.

Source code in sklego/decomposition/pca_reconstruction.py
def fit(self, X, y=None):
    """Fit the `PCAOutlierDetection` model using `X` as training data by fitting the underlying PCA model, and
    checking the `threshold` value.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features )
        The training data.
    y : array-like of shape (n_samples,) or None, default=None
        Ignored, present for compatibility.

    Returns
    -------
    self : PCAOutlierDetection
        The fitted estimator.

    Raises
    ------
    ValueError
        If `threshold` is `None`.
    """
    X = check_array(X, estimator=self, dtype=FLOAT_DTYPES)
    if not self.threshold:
        raise ValueError("The `threshold` value cannot be `None`.")

    self.pca_ = PCA(
        n_components=self.n_components,
        whiten=self.whiten,
        svd_solver=self.svd_solver,
        tol=self.tol,
        iterated_power=self.iterated_power,
        random_state=self.random_state,
    )
    self.pca_.fit(X, y)
    self.offset_ = -self.threshold

    self.n_features_in_ = X.shape[1]
    return self

predict(X)

Predict if a point is an outlier using fitted estimator.

If the difference between original and reconstructed data is larger than the threshold, the point is considered an outlier.

Parameters:

Name Type Description Default
X array-like of shape (n_samples, n_features)

The data to predict.

required

Returns:

Type Description
array-like of shape (n_samples,)

The predicted data. 1 for inliers, -1 for outliers.

Source code in sklego/decomposition/pca_reconstruction.py
def predict(self, X):
    """Predict if a point is an outlier using fitted estimator.

    If the difference between original and reconstructed data is larger than the `threshold`, the point is
    considered an outlier.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data to predict.

    Returns
    -------
    array-like of shape (n_samples,)
        The predicted data. 1 for inliers, -1 for outliers.
    """
    X = check_array(X, estimator=self, dtype=FLOAT_DTYPES)
    check_is_fitted(self, ["pca_", "offset_"])
    result = np.ones(X.shape[0])
    result[self.difference(X) > self.threshold] = -1
    return result.astype(int)

score_samples(X)

Calculate the score for the samples

Source code in sklego/decomposition/pca_reconstruction.py
def score_samples(self, X):
    """Calculate the score for the samples"""
    return -self.difference(X)

sklego.decomposition.umap_reconstruction.UMAPOutlierDetection

Bases: BaseEstimator, OutlierMixin

UMAPOutlierDetection is an outlier detector based on the reconstruction error from UMAP.

If the difference between original and reconstructed data is larger than the threshold, the point is considered an outlier.

Parameters:

Name Type Description Default
n_components int

Number of components of the UMAP model.

2
threshold float | None

The threshold used for the decision function.

None
variant Literal[relative, absolute]

The variant used for the difference calculation. "relative" means that the difference between original and reconstructed data is divided by the sum of the original data.

"relative"
n_neighbors int

n_neighbors parameter of UMAP model.

15
min_dist float

min_dist parameter of UMAP model.

0.1
metric str

metric parameter of UMAP model (see UMAP documentation for the full list of available metrics and more information).

"euclidean"
random_state int | None

random_state parameter of UMAP model.

None

Attributes:

Name Type Description
umap_ UMAP

The underlying UMAP model.

offset_ float

The offset used for the decision function.

Examples:

import numpy as np
from sklego.decomposition import UMAPOutlierDetection

X = np.array([[-1, -1, -1], [-2, -1, -2], [5, -1, 0], [-1, -1, -1], [2, 1, 1], [3, 2, 3]])

umap_model = UMAPOutlierDetection(n_components=2, threshold=0.2, n_neighbors=5)
umap_model.fit(X)
umap_pred = umap_model.predict(X)
umap_pred
# [ 1  1 -1  1 -1 -1]
Source code in sklego/decomposition/umap_reconstruction.py
class UMAPOutlierDetection(BaseEstimator, OutlierMixin):
    """`UMAPOutlierDetection` is an outlier detector based on the reconstruction error from UMAP.

    If the difference between original and reconstructed data is larger than the `threshold`, the point is
        considered an outlier.

    Parameters
    ----------
    n_components : int, default=2
        Number of components of the UMAP model.
    threshold : float | None, default=None
        The threshold used for the decision function.
    variant : Literal["relative", "absolute"], default="relative"
        The variant used for the difference calculation. "relative" means that the difference between original and
        reconstructed data is divided by the sum of the original data.
    n_neighbors : int, default=15
        `n_neighbors` parameter of UMAP model.
    min_dist : float, default=0.1
        `min_dist` parameter of UMAP model.
    metric : str, default="euclidean"
        `metric` parameter of UMAP model
        (see [UMAP documentation](https://umap-learn.readthedocs.io/en/latest/parameters.html#metric) for the full list
        of available metrics and more information).
    random_state : int | None, default=None
        `random_state` parameter of UMAP model.

    Attributes
    ----------
    umap_ : UMAP
        The underlying UMAP model.
    offset_ : float
        The offset used for the decision function.

    Examples
    --------
    ```py
    import numpy as np
    from sklego.decomposition import UMAPOutlierDetection

    X = np.array([[-1, -1, -1], [-2, -1, -2], [5, -1, 0], [-1, -1, -1], [2, 1, 1], [3, 2, 3]])

    umap_model = UMAPOutlierDetection(n_components=2, threshold=0.2, n_neighbors=5)
    umap_model.fit(X)
    umap_pred = umap_model.predict(X)
    umap_pred
    # [ 1  1 -1  1 -1 -1]
    ```
    """

    def __init__(
        self,
        n_components=2,
        threshold=None,
        variant="relative",
        n_neighbors=15,
        min_dist=0.1,
        metric="euclidean",
        random_state=None,
    ):
        self.n_components = n_components
        self.threshold = threshold
        self.variant = variant
        self.n_neighbors = n_neighbors
        self.min_dist = min_dist
        self.metric = metric
        self.random_state = random_state

    def fit(self, X, y=None):
        """Fit the `UMAPOutlierDetection` model using `X` as training data by fitting the underlying UMAP model, and
        checking the `threshold` value.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features )
            The training data.
        y : array-like of shape (n_samples,) or None, default=None
            Ignored, present for compatibility.

        Returns
        -------
        self : UMAPOutlierDetection
            The fitted estimator.

        Raises
        ------
        ValueError
            - If `n_components` is less than 2.
            - If `threshold` is `None`.
        """
        X = check_array(X, estimator=self, dtype=FLOAT_DTYPES)
        if y is not None:
            y = check_array(y, estimator=self, ensure_2d=False)

        if not self.threshold:
            raise ValueError("The `threshold` value cannot be `None`.")

        self.umap_ = umap.UMAP(
            n_components=self.n_components,
            n_neighbors=self.n_neighbors,
            min_dist=self.min_dist,
            metric=self.metric,
            random_state=self.random_state,
        )
        self.umap_.fit(X, y)
        self.offset_ = -self.threshold
        self.n_features_in_ = X.shape[1]
        return self

    def difference(self, X):
        """Return the calculated difference between original and reconstructed data. Row by row.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features )
            Data to calculate the difference for.

        Returns
        -------
        array-like of shape (n_samples,)
            The calculated difference.
        """
        check_is_fitted(self, ["umap_", "offset_"])
        reduced = self.umap_.transform(X)
        diff = np.sum(np.abs(self.umap_.inverse_transform(reduced) - X), axis=1)
        if self.variant == "relative":
            diff = diff / X.sum(axis=1)
        return diff

    def predict(self, X):
        """Predict if a point is an outlier using fitted estimator.

        If the difference between original and reconstructed data is larger than the `threshold`, the point is
        considered an outlier.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data to predict.

        Returns
        -------
        array-like of shape (n_samples,)
            The predicted data. 1 for inliers, -1 for outliers.
        """
        X = check_array(X, estimator=self, dtype=FLOAT_DTYPES)
        check_is_fitted(self, ["umap_", "offset_"])
        result = np.ones(X.shape[0])
        result[self.difference(X) > self.threshold] = -1
        return result.astype(int)

    def decision_function(self, X):
        """Calculate the decision function for the data as the difference between `threshold` and the `.difference(X)`
        (which is the difference between original data and reconstructed data)."""
        return self.threshold - self.difference(X)

    def score_samples(self, X):
        """Calculate the score for the samples"""
        return -self.difference(X)

    def _more_tags(self):
        return {"non_deterministic": True}

decision_function(X)

Calculate the decision function for the data as the difference between threshold and the .difference(X) (which is the difference between original data and reconstructed data).

Source code in sklego/decomposition/umap_reconstruction.py
def decision_function(self, X):
    """Calculate the decision function for the data as the difference between `threshold` and the `.difference(X)`
    (which is the difference between original data and reconstructed data)."""
    return self.threshold - self.difference(X)

difference(X)

Return the calculated difference between original and reconstructed data. Row by row.

Parameters:

Name Type Description Default
X array-like of shape (n_samples, n_features )

Data to calculate the difference for.

required

Returns:

Type Description
array-like of shape (n_samples,)

The calculated difference.

Source code in sklego/decomposition/umap_reconstruction.py
def difference(self, X):
    """Return the calculated difference between original and reconstructed data. Row by row.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features )
        Data to calculate the difference for.

    Returns
    -------
    array-like of shape (n_samples,)
        The calculated difference.
    """
    check_is_fitted(self, ["umap_", "offset_"])
    reduced = self.umap_.transform(X)
    diff = np.sum(np.abs(self.umap_.inverse_transform(reduced) - X), axis=1)
    if self.variant == "relative":
        diff = diff / X.sum(axis=1)
    return diff

fit(X, y=None)

Fit the UMAPOutlierDetection model using X as training data by fitting the underlying UMAP model, and checking the threshold value.

Parameters:

Name Type Description Default
X array-like of shape (n_samples, n_features )

The training data.

required
y array-like of shape (n_samples,) or None

Ignored, present for compatibility.

None

Returns:

Name Type Description
self UMAPOutlierDetection

The fitted estimator.

Raises:

Type Description
ValueError
  • If n_components is less than 2.
  • If threshold is None.
Source code in sklego/decomposition/umap_reconstruction.py
def fit(self, X, y=None):
    """Fit the `UMAPOutlierDetection` model using `X` as training data by fitting the underlying UMAP model, and
    checking the `threshold` value.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features )
        The training data.
    y : array-like of shape (n_samples,) or None, default=None
        Ignored, present for compatibility.

    Returns
    -------
    self : UMAPOutlierDetection
        The fitted estimator.

    Raises
    ------
    ValueError
        - If `n_components` is less than 2.
        - If `threshold` is `None`.
    """
    X = check_array(X, estimator=self, dtype=FLOAT_DTYPES)
    if y is not None:
        y = check_array(y, estimator=self, ensure_2d=False)

    if not self.threshold:
        raise ValueError("The `threshold` value cannot be `None`.")

    self.umap_ = umap.UMAP(
        n_components=self.n_components,
        n_neighbors=self.n_neighbors,
        min_dist=self.min_dist,
        metric=self.metric,
        random_state=self.random_state,
    )
    self.umap_.fit(X, y)
    self.offset_ = -self.threshold
    self.n_features_in_ = X.shape[1]
    return self

predict(X)

Predict if a point is an outlier using fitted estimator.

If the difference between original and reconstructed data is larger than the threshold, the point is considered an outlier.

Parameters:

Name Type Description Default
X array-like of shape (n_samples, n_features)

The data to predict.

required

Returns:

Type Description
array-like of shape (n_samples,)

The predicted data. 1 for inliers, -1 for outliers.

Source code in sklego/decomposition/umap_reconstruction.py
def predict(self, X):
    """Predict if a point is an outlier using fitted estimator.

    If the difference between original and reconstructed data is larger than the `threshold`, the point is
    considered an outlier.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data to predict.

    Returns
    -------
    array-like of shape (n_samples,)
        The predicted data. 1 for inliers, -1 for outliers.
    """
    X = check_array(X, estimator=self, dtype=FLOAT_DTYPES)
    check_is_fitted(self, ["umap_", "offset_"])
    result = np.ones(X.shape[0])
    result[self.difference(X) > self.threshold] = -1
    return result.astype(int)

score_samples(X)

Calculate the score for the samples

Source code in sklego/decomposition/umap_reconstruction.py
def score_samples(self, X):
    """Calculate the score for the samples"""
    return -self.difference(X)