Decomposition¶

`sklego.decomposition.pca_reconstruction.PCAOutlierDetection` ¶

Bases: OutlierMixin, BaseEstimator

PCAOutlierDetection is an outlier detector based on the reconstruction error from PCA.

If the difference between original and reconstructed data is larger than the threshold, the point is considered an outlier.

Parameters:

Name	Type	Description	Default
`n_components`	`int \| None`	Number of components of the PCA model.	`None`
`threshold`	`float \| None`	The threshold used for the decision function.	`None`
`variant`	`Literal[relative, absolute]`	The variant used for the difference calculation. "relative" means that the difference between original and reconstructed data is divided by the sum of the original data.	`"relative"`
`whiten`	`bool`	`whiten` parameter of PCA model.	`False`
`svd_solver`	`Literal[auto, full, arpack, randomized]`	`svd_solver` parameter of PCA model.	`"auto"`
`tol`	`float`	`tol` parameter of PCA model.	`0.0`
`iterated_power`	`int \| Literal[auto]`	`iterated_power` parameter of PCA model.	`"auto"`
`random_state`	`int \| None`	`random_state` parameter of PCA model.	`None`

Attributes:

Name	Type	Description
`pca_`	`PCA`	The underlying PCA model.
`offset_`	`float`	The offset used for the decision function.

Examples:

import numpy as np
from sklego.decomposition import PCAOutlierDetection

X = np.array([[-1, -1, -1], [-2, -1, -2], [5, -1, 0], [1, 1, 1], [2, 1, 1], [3, 2, 3]])

pca_model = PCAOutlierDetection(n_components=2, threshold=0.05)
pca_model.fit(X)
pca_pred = pca_model.predict(X)
pca_pred
# [ 1  1  1 -1 -1  1]

Source code in sklego/decomposition/pca_reconstruction.py

class PCAOutlierDetection(OutlierMixin, BaseEstimator):
    """`PCAOutlierDetection` is an outlier detector based on the reconstruction error from PCA.

    If the difference between original and reconstructed data is larger than the `threshold`, the point is
    considered an outlier.

    Parameters
    ----------
    n_components : int | None, default=None
        Number of components of the PCA model.
    threshold : float | None, default=None
        The threshold used for the decision function.
    variant : Literal["relative", "absolute"], default="relative"
        The variant used for the difference calculation. "relative" means that the difference between original and
        reconstructed data is divided by the sum of the original data.
    whiten : bool, default=False
        `whiten` parameter of PCA model.
    svd_solver : Literal["auto", "full", "arpack", "randomized"], default="auto"
        `svd_solver` parameter of PCA model.
    tol : float, default=0.0
        `tol` parameter of PCA model.
    iterated_power : int | Literal["auto"], default="auto"
        `iterated_power` parameter of PCA model.
    random_state : int | None, default=None
        `random_state` parameter of PCA model.

    Attributes
    ----------
    pca_ : PCA
        The underlying PCA model.
    offset_ : float
        The offset used for the decision function.

    Examples
    --------
    ```py
    import numpy as np
    from sklego.decomposition import PCAOutlierDetection

    X = np.array([[-1, -1, -1], [-2, -1, -2], [5, -1, 0], [1, 1, 1], [2, 1, 1], [3, 2, 3]])

    pca_model = PCAOutlierDetection(n_components=2, threshold=0.05)
    pca_model.fit(X)
    pca_pred = pca_model.predict(X)
    pca_pred
    # [ 1  1  1 -1 -1  1]
    ```
    """

    def __init__(
        self,
        n_components=None,
        threshold=None,
        variant="relative",
        whiten=False,
        svd_solver="auto",
        tol=0.0,
        iterated_power="auto",
        random_state=None,
    ):
        self.n_components = n_components
        self.threshold = threshold
        self.whiten = whiten
        self.variant = variant
        self.svd_solver = svd_solver
        self.tol = tol
        self.iterated_power = iterated_power
        self.random_state = random_state

    def fit(self, X, y=None):
        """Fit the `PCAOutlierDetection` model using `X` as training data by fitting the underlying PCA model, and
        checking the `threshold` value.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features )
            The training data.
        y : array-like of shape (n_samples,) or None, default=None
            Ignored, present for compatibility.

        Returns
        -------
        self : PCAOutlierDetection
            The fitted estimator.

        Raises
        ------
        ValueError
            If `threshold` is `None`.
        """
        X = validate_data(self, X=X, dtype=FLOAT_DTYPES, reset=True)
        if not self.threshold:
            raise ValueError("The `threshold` value cannot be `None`.")

        self.pca_ = PCA(
            n_components=self.n_components,
            whiten=self.whiten,
            svd_solver=self.svd_solver,
            tol=self.tol,
            iterated_power=self.iterated_power,
            random_state=self.random_state,
        )
        self.pca_.fit(X, y)
        self.offset_ = -self.threshold
        return self

    def difference(self, X):
        """Return the calculated difference between original and reconstructed data. Row by row.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features )
            Data to calculate the difference for.

        Returns
        -------
        array-like of shape (n_samples,)
            The calculated difference.
        """
        check_is_fitted(self, ["pca_", "offset_"])
        X = validate_data(self, X=X, dtype=FLOAT_DTYPES, reset=False)

        reduced = self.pca_.transform(X)
        diff = np.sum(np.abs(self.pca_.inverse_transform(reduced) - X), axis=1)
        if self.variant == "relative":
            diff = diff / X.sum(axis=1)
        return diff

    def decision_function(self, X):
        """Calculate the decision function for the data as the difference between `threshold` and the `.difference(X)`
        (which is the difference between original data and reconstructed data)."""
        return self.threshold - self.difference(X)

    def score_samples(self, X):
        """Calculate the score for the samples"""
        return -self.difference(X)

    def predict(self, X):
        """Predict if a point is an outlier using fitted estimator.

        If the difference between original and reconstructed data is larger than the `threshold`, the point is
        considered an outlier.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data to predict.

        Returns
        -------
        array-like of shape (n_samples,)
            The predicted data. 1 for inliers, -1 for outliers.
        """
        check_is_fitted(self, ["pca_", "offset_"])
        X = validate_data(self, X=X, dtype=FLOAT_DTYPES, reset=False)
        result = np.ones(X.shape[0])
        result[self.difference(X) > self.threshold] = -1
        return result.astype(int)

`decision_function(X)` ¶

Calculate the decision function for the data as the difference between threshold and the .difference(X) (which is the difference between original data and reconstructed data).

Source code in sklego/decomposition/pca_reconstruction.py

def decision_function(self, X):
    """Calculate the decision function for the data as the difference between `threshold` and the `.difference(X)`
    (which is the difference between original data and reconstructed data)."""
    return self.threshold - self.difference(X)

`difference(X)` ¶

Return the calculated difference between original and reconstructed data. Row by row.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features )`	Data to calculate the difference for.	required

Returns:

Type	Description
`array-like of shape (n_samples,)`	The calculated difference.

Source code in sklego/decomposition/pca_reconstruction.py

def difference(self, X):
    """Return the calculated difference between original and reconstructed data. Row by row.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features )
        Data to calculate the difference for.

    Returns
    -------
    array-like of shape (n_samples,)
        The calculated difference.
    """
    check_is_fitted(self, ["pca_", "offset_"])
    X = validate_data(self, X=X, dtype=FLOAT_DTYPES, reset=False)

    reduced = self.pca_.transform(X)
    diff = np.sum(np.abs(self.pca_.inverse_transform(reduced) - X), axis=1)
    if self.variant == "relative":
        diff = diff / X.sum(axis=1)
    return diff

`fit(X, y=None)` ¶

Fit the PCAOutlierDetection model using X as training data by fitting the underlying PCA model, and checking the threshold value.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features )`	The training data.	required
`y`	`array-like of shape (n_samples,) or None`	Ignored, present for compatibility.	`None`

Returns:

Name	Type	Description
`self`	`PCAOutlierDetection`	The fitted estimator.

Raises:

Type	Description
`ValueError`	If `threshold` is `None`.

Source code in sklego/decomposition/pca_reconstruction.py

def fit(self, X, y=None):
    """Fit the `PCAOutlierDetection` model using `X` as training data by fitting the underlying PCA model, and
    checking the `threshold` value.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features )
        The training data.
    y : array-like of shape (n_samples,) or None, default=None
        Ignored, present for compatibility.

    Returns
    -------
    self : PCAOutlierDetection
        The fitted estimator.

    Raises
    ------
    ValueError
        If `threshold` is `None`.
    """
    X = validate_data(self, X=X, dtype=FLOAT_DTYPES, reset=True)
    if not self.threshold:
        raise ValueError("The `threshold` value cannot be `None`.")

    self.pca_ = PCA(
        n_components=self.n_components,
        whiten=self.whiten,
        svd_solver=self.svd_solver,
        tol=self.tol,
        iterated_power=self.iterated_power,
        random_state=self.random_state,
    )
    self.pca_.fit(X, y)
    self.offset_ = -self.threshold
    return self

`predict(X)` ¶

Predict if a point is an outlier using fitted estimator.

If the difference between original and reconstructed data is larger than the threshold, the point is considered an outlier.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	The data to predict.	required

Returns:

Type	Description
`array-like of shape (n_samples,)`	The predicted data. 1 for inliers, -1 for outliers.

Source code in sklego/decomposition/pca_reconstruction.py

def predict(self, X):
    """Predict if a point is an outlier using fitted estimator.

    If the difference between original and reconstructed data is larger than the `threshold`, the point is
    considered an outlier.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data to predict.

    Returns
    -------
    array-like of shape (n_samples,)
        The predicted data. 1 for inliers, -1 for outliers.
    """
    check_is_fitted(self, ["pca_", "offset_"])
    X = validate_data(self, X=X, dtype=FLOAT_DTYPES, reset=False)
    result = np.ones(X.shape[0])
    result[self.difference(X) > self.threshold] = -1
    return result.astype(int)

`score_samples(X)` ¶

Calculate the score for the samples

Source code in sklego/decomposition/pca_reconstruction.py

def score_samples(self, X):
    """Calculate the score for the samples"""
    return -self.difference(X)

`sklego.decomposition.umap_reconstruction.UMAPOutlierDetection` ¶

Bases: OutlierMixin, BaseEstimator

UMAPOutlierDetection is an outlier detector based on the reconstruction error from UMAP.

If the difference between original and reconstructed data is larger than the threshold, the point is considered an outlier.

Parameters:

Name	Type	Description	Default
`n_components`	`int`	Number of components of the UMAP model.	`2`
`threshold`	`float \| None`	The threshold used for the decision function.	`None`
`variant`	`Literal[relative, absolute]`	The variant used for the difference calculation. "relative" means that the difference between original and reconstructed data is divided by the sum of the original data.	`"relative"`
`n_neighbors`	`int`	`n_neighbors` parameter of UMAP model.	`15`
`min_dist`	`float`	`min_dist` parameter of UMAP model.	`0.1`
`metric`	`str`	`metric` parameter of UMAP model (see UMAP documentation for the full list of available metrics and more information).	`"euclidean"`
`random_state`	`int \| None`	`random_state` parameter of UMAP model.	`None`

Attributes:

Name	Type	Description
`umap_`	`UMAP`	The underlying UMAP model.
`offset_`	`float`	The offset used for the decision function.

Examples:

import numpy as np
from sklego.decomposition import UMAPOutlierDetection

X = np.array([[-1, -1, -1], [-2, -1, -2], [5, -1, 0], [-1, -1, -1], [2, 1, 1], [3, 2, 3]])

umap_model = UMAPOutlierDetection(n_components=2, threshold=0.2, n_neighbors=5)
umap_model.fit(X)
umap_pred = umap_model.predict(X)
umap_pred
# [ 1  1 -1  1 -1 -1]

Source code in sklego/decomposition/umap_reconstruction.py

class UMAPOutlierDetection(OutlierMixin, BaseEstimator):
    """`UMAPOutlierDetection` is an outlier detector based on the reconstruction error from UMAP.

    If the difference between original and reconstructed data is larger than the `threshold`, the point is
        considered an outlier.

    Parameters
    ----------
    n_components : int, default=2
        Number of components of the UMAP model.
    threshold : float | None, default=None
        The threshold used for the decision function.
    variant : Literal["relative", "absolute"], default="relative"
        The variant used for the difference calculation. "relative" means that the difference between original and
        reconstructed data is divided by the sum of the original data.
    n_neighbors : int, default=15
        `n_neighbors` parameter of UMAP model.
    min_dist : float, default=0.1
        `min_dist` parameter of UMAP model.
    metric : str, default="euclidean"
        `metric` parameter of UMAP model
        (see [UMAP documentation](https://umap-learn.readthedocs.io/en/latest/parameters.html#metric) for the full list
        of available metrics and more information).
    random_state : int | None, default=None
        `random_state` parameter of UMAP model.

    Attributes
    ----------
    umap_ : UMAP
        The underlying UMAP model.
    offset_ : float
        The offset used for the decision function.

    Examples
    --------
    ```py
    import numpy as np
    from sklego.decomposition import UMAPOutlierDetection

    X = np.array([[-1, -1, -1], [-2, -1, -2], [5, -1, 0], [-1, -1, -1], [2, 1, 1], [3, 2, 3]])

    umap_model = UMAPOutlierDetection(n_components=2, threshold=0.2, n_neighbors=5)
    umap_model.fit(X)
    umap_pred = umap_model.predict(X)
    umap_pred
    # [ 1  1 -1  1 -1 -1]
    ```
    """

    def __init__(
        self,
        n_components=2,
        threshold=None,
        variant="relative",
        n_neighbors=15,
        min_dist=0.1,
        metric="euclidean",
        random_state=None,
    ):
        self.n_components = n_components
        self.threshold = threshold
        self.variant = variant
        self.n_neighbors = n_neighbors
        self.min_dist = min_dist
        self.metric = metric
        self.random_state = random_state

    def fit(self, X, y=None):
        """Fit the `UMAPOutlierDetection` model using `X` as training data by fitting the underlying UMAP model, and
        checking the `threshold` value.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features )
            The training data.
        y : array-like of shape (n_samples,) or None, default=None
            Ignored, present for compatibility.

        Returns
        -------
        self : UMAPOutlierDetection
            The fitted estimator.

        Raises
        ------
        ValueError
            - If `n_components` is less than 2.
            - If `threshold` is `None`.
        """
        if y is not None:
            X, y = validate_data(self, X=X, y=y, dtype=FLOAT_DTYPES, reset=True)
        else:
            X = validate_data(self, X=X, dtype=FLOAT_DTYPES, reset=True)

        if not self.threshold:
            raise ValueError("The `threshold` value cannot be `None`.")

        self.umap_ = umap.UMAP(
            n_components=self.n_components,
            n_neighbors=self.n_neighbors,
            min_dist=self.min_dist,
            metric=self.metric,
            random_state=self.random_state,
        )
        self.umap_.fit(X, y)
        self.offset_ = -self.threshold
        return self

    def difference(self, X):
        """Return the calculated difference between original and reconstructed data. Row by row.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features )
            Data to calculate the difference for.

        Returns
        -------
        array-like of shape (n_samples,)
            The calculated difference.
        """
        check_is_fitted(self, ["umap_", "offset_"])
        X = validate_data(self, X=X, dtype=FLOAT_DTYPES, reset=False)

        reduced = self.umap_.transform(X)
        diff = np.sum(np.abs(self.umap_.inverse_transform(reduced) - X), axis=1)
        if self.variant == "relative":
            diff = diff / X.sum(axis=1)
        return diff

    def predict(self, X):
        """Predict if a point is an outlier using fitted estimator.

        If the difference between original and reconstructed data is larger than the `threshold`, the point is
        considered an outlier.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data to predict.

        Returns
        -------
        array-like of shape (n_samples,)
            The predicted data. 1 for inliers, -1 for outliers.
        """
        check_is_fitted(self, ["umap_", "offset_"])
        X = validate_data(self, X=X, dtype=FLOAT_DTYPES, reset=False)
        result = np.ones(X.shape[0])
        result[self.difference(X) > self.threshold] = -1
        return result.astype(int)

    def decision_function(self, X):
        """Calculate the decision function for the data as the difference between `threshold` and the `.difference(X)`
        (which is the difference between original data and reconstructed data)."""
        return self.threshold - self.difference(X)

    def score_samples(self, X):
        """Calculate the score for the samples"""
        return -self.difference(X)

    def _more_tags(self):
        return {"non_deterministic": True}

    def __sklearn_tags__(self):
        tags = super().__sklearn_tags__()
        tags.non_deterministic = True
        return tags

`decision_function(X)` ¶

Calculate the decision function for the data as the difference between threshold and the .difference(X) (which is the difference between original data and reconstructed data).

Source code in sklego/decomposition/umap_reconstruction.py

def decision_function(self, X):
    """Calculate the decision function for the data as the difference between `threshold` and the `.difference(X)`
    (which is the difference between original data and reconstructed data)."""
    return self.threshold - self.difference(X)

`difference(X)` ¶

Return the calculated difference between original and reconstructed data. Row by row.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features )`	Data to calculate the difference for.	required

Returns:

Type	Description
`array-like of shape (n_samples,)`	The calculated difference.

Source code in sklego/decomposition/umap_reconstruction.py

def difference(self, X):
    """Return the calculated difference between original and reconstructed data. Row by row.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features )
        Data to calculate the difference for.

    Returns
    -------
    array-like of shape (n_samples,)
        The calculated difference.
    """
    check_is_fitted(self, ["umap_", "offset_"])
    X = validate_data(self, X=X, dtype=FLOAT_DTYPES, reset=False)

    reduced = self.umap_.transform(X)
    diff = np.sum(np.abs(self.umap_.inverse_transform(reduced) - X), axis=1)
    if self.variant == "relative":
        diff = diff / X.sum(axis=1)
    return diff

`fit(X, y=None)` ¶

Fit the UMAPOutlierDetection model using X as training data by fitting the underlying UMAP model, and checking the threshold value.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features )`	The training data.	required
`y`	`array-like of shape (n_samples,) or None`	Ignored, present for compatibility.	`None`

Returns:

Name	Type	Description
`self`	`UMAPOutlierDetection`	The fitted estimator.

Raises:

Type	Description
`ValueError`	If `n_components` is less than 2. If `threshold` is `None`.

Source code in sklego/decomposition/umap_reconstruction.py

def fit(self, X, y=None):
    """Fit the `UMAPOutlierDetection` model using `X` as training data by fitting the underlying UMAP model, and
    checking the `threshold` value.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features )
        The training data.
    y : array-like of shape (n_samples,) or None, default=None
        Ignored, present for compatibility.

    Returns
    -------
    self : UMAPOutlierDetection
        The fitted estimator.

    Raises
    ------
    ValueError
        - If `n_components` is less than 2.
        - If `threshold` is `None`.
    """
    if y is not None:
        X, y = validate_data(self, X=X, y=y, dtype=FLOAT_DTYPES, reset=True)
    else:
        X = validate_data(self, X=X, dtype=FLOAT_DTYPES, reset=True)

    if not self.threshold:
        raise ValueError("The `threshold` value cannot be `None`.")

    self.umap_ = umap.UMAP(
        n_components=self.n_components,
        n_neighbors=self.n_neighbors,
        min_dist=self.min_dist,
        metric=self.metric,
        random_state=self.random_state,
    )
    self.umap_.fit(X, y)
    self.offset_ = -self.threshold
    return self

`predict(X)` ¶

Predict if a point is an outlier using fitted estimator.

If the difference between original and reconstructed data is larger than the threshold, the point is considered an outlier.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	The data to predict.	required

Returns:

Type	Description
`array-like of shape (n_samples,)`	The predicted data. 1 for inliers, -1 for outliers.

Source code in sklego/decomposition/umap_reconstruction.py

def predict(self, X):
    """Predict if a point is an outlier using fitted estimator.

    If the difference between original and reconstructed data is larger than the `threshold`, the point is
    considered an outlier.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data to predict.

    Returns
    -------
    array-like of shape (n_samples,)
        The predicted data. 1 for inliers, -1 for outliers.
    """
    check_is_fitted(self, ["umap_", "offset_"])
    X = validate_data(self, X=X, dtype=FLOAT_DTYPES, reset=False)
    result = np.ones(X.shape[0])
    result[self.difference(X) > self.threshold] = -1
    return result.astype(int)

`score_samples(X)` ¶

Calculate the score for the samples

Source code in sklego/decomposition/umap_reconstruction.py

def score_samples(self, X):
    """Calculate the score for the samples"""
    return -self.difference(X)

Decomposition¶

sklego.decomposition.pca_reconstruction.PCAOutlierDetection ¶

decision_function(X) ¶

difference(X) ¶

fit(X, y=None) ¶

predict(X) ¶

score_samples(X) ¶

sklego.decomposition.umap_reconstruction.UMAPOutlierDetection ¶

decision_function(X) ¶

difference(X) ¶

fit(X, y=None) ¶

predict(X) ¶

score_samples(X) ¶

`sklego.decomposition.pca_reconstruction.PCAOutlierDetection` ¶

`decision_function(X)` ¶

`difference(X)` ¶

`fit(X, y=None)` ¶

`predict(X)` ¶

`score_samples(X)` ¶

`sklego.decomposition.umap_reconstruction.UMAPOutlierDetection` ¶

`decision_function(X)` ¶

`difference(X)` ¶

`fit(X, y=None)` ¶

`predict(X)` ¶

`score_samples(X)` ¶