Decomposition¶
sklego.decomposition.pca_reconstruction.PCAOutlierDetection
¶
Bases: BaseEstimator
, OutlierMixin
PCAOutlierDetection
is an outlier detector based on the reconstruction error from PCA.
If the difference between original and reconstructed data is larger than the threshold
, the point is
considered an outlier.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_components
|
int | None
|
Number of components of the PCA model. |
None
|
threshold
|
float | None
|
The threshold used for the decision function. |
None
|
variant
|
Literal[relative, absolute]
|
The variant used for the difference calculation. "relative" means that the difference between original and reconstructed data is divided by the sum of the original data. |
"relative"
|
whiten
|
bool
|
|
False
|
svd_solver
|
Literal[auto, full, arpack, randomized]
|
|
"auto"
|
tol
|
float
|
|
0.0
|
iterated_power
|
int | Literal[auto]
|
|
"auto"
|
random_state
|
int | None
|
|
None
|
Attributes:
Name | Type | Description |
---|---|---|
pca_ |
PCA
|
The underlying PCA model. |
offset_ |
float
|
The offset used for the decision function. |
Examples:
import numpy as np
from sklego.decomposition import PCAOutlierDetection
X = np.array([[-1, -1, -1], [-2, -1, -2], [5, -1, 0], [1, 1, 1], [2, 1, 1], [3, 2, 3]])
pca_model = PCAOutlierDetection(n_components=2, threshold=0.05)
pca_model.fit(X)
pca_pred = pca_model.predict(X)
pca_pred
# [ 1 1 1 -1 -1 1]
Source code in sklego/decomposition/pca_reconstruction.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
|
decision_function(X)
¶
Calculate the decision function for the data as the difference between threshold
and the .difference(X)
(which is the difference between original data and reconstructed data).
Source code in sklego/decomposition/pca_reconstruction.py
difference(X)
¶
Return the calculated difference between original and reconstructed data. Row by row.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
array-like of shape (n_samples, n_features )
|
Data to calculate the difference for. |
required |
Returns:
Type | Description |
---|---|
array-like of shape (n_samples,)
|
The calculated difference. |
Source code in sklego/decomposition/pca_reconstruction.py
fit(X, y=None)
¶
Fit the PCAOutlierDetection
model using X
as training data by fitting the underlying PCA model, and
checking the threshold
value.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
array-like of shape (n_samples, n_features )
|
The training data. |
required |
y
|
array-like of shape (n_samples,) or None
|
Ignored, present for compatibility. |
None
|
Returns:
Name | Type | Description |
---|---|---|
self |
PCAOutlierDetection
|
The fitted estimator. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
Source code in sklego/decomposition/pca_reconstruction.py
predict(X)
¶
Predict if a point is an outlier using fitted estimator.
If the difference between original and reconstructed data is larger than the threshold
, the point is
considered an outlier.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
The data to predict. |
required |
Returns:
Type | Description |
---|---|
array-like of shape (n_samples,)
|
The predicted data. 1 for inliers, -1 for outliers. |
Source code in sklego/decomposition/pca_reconstruction.py
sklego.decomposition.umap_reconstruction.UMAPOutlierDetection
¶
Bases: BaseEstimator
, OutlierMixin
UMAPOutlierDetection
is an outlier detector based on the reconstruction error from UMAP.
If the difference between original and reconstructed data is larger than the threshold
, the point is
considered an outlier.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_components
|
int
|
Number of components of the UMAP model. |
2
|
threshold
|
float | None
|
The threshold used for the decision function. |
None
|
variant
|
Literal[relative, absolute]
|
The variant used for the difference calculation. "relative" means that the difference between original and reconstructed data is divided by the sum of the original data. |
"relative"
|
n_neighbors
|
int
|
|
15
|
min_dist
|
float
|
|
0.1
|
metric
|
str
|
|
"euclidean"
|
random_state
|
int | None
|
|
None
|
Attributes:
Name | Type | Description |
---|---|---|
umap_ |
UMAP
|
The underlying UMAP model. |
offset_ |
float
|
The offset used for the decision function. |
Examples:
import numpy as np
from sklego.decomposition import UMAPOutlierDetection
X = np.array([[-1, -1, -1], [-2, -1, -2], [5, -1, 0], [-1, -1, -1], [2, 1, 1], [3, 2, 3]])
umap_model = UMAPOutlierDetection(n_components=2, threshold=0.2, n_neighbors=5)
umap_model.fit(X)
umap_pred = umap_model.predict(X)
umap_pred
# [ 1 1 -1 1 -1 -1]
Source code in sklego/decomposition/umap_reconstruction.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
|
decision_function(X)
¶
Calculate the decision function for the data as the difference between threshold
and the .difference(X)
(which is the difference between original data and reconstructed data).
Source code in sklego/decomposition/umap_reconstruction.py
difference(X)
¶
Return the calculated difference between original and reconstructed data. Row by row.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
array-like of shape (n_samples, n_features )
|
Data to calculate the difference for. |
required |
Returns:
Type | Description |
---|---|
array-like of shape (n_samples,)
|
The calculated difference. |
Source code in sklego/decomposition/umap_reconstruction.py
fit(X, y=None)
¶
Fit the UMAPOutlierDetection
model using X
as training data by fitting the underlying UMAP model, and
checking the threshold
value.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
array-like of shape (n_samples, n_features )
|
The training data. |
required |
y
|
array-like of shape (n_samples,) or None
|
Ignored, present for compatibility. |
None
|
Returns:
Name | Type | Description |
---|---|---|
self |
UMAPOutlierDetection
|
The fitted estimator. |
Raises:
Type | Description |
---|---|
ValueError
|
|
Source code in sklego/decomposition/umap_reconstruction.py
predict(X)
¶
Predict if a point is an outlier using fitted estimator.
If the difference between original and reconstructed data is larger than the threshold
, the point is
considered an outlier.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
The data to predict. |
required |
Returns:
Type | Description |
---|---|
array-like of shape (n_samples,)
|
The predicted data. 1 for inliers, -1 for outliers. |