Preprocessing¶

`sklego.preprocessing.columncapper.ColumnCapper` ¶

Bases: TransformerMixin, BaseEstimator

The ColumnCapper transformer caps the values of columns according to the given quantile thresholds.

The capping is performed independently for each column of the input data. The quantile thresholds are computed during the fitting phase. The capping is performed during the transformation phase.

Parameters:

Name	Type	Description	Default
`quantile_range`	`Tuple[float, float] \| List[float]`	The quantile ranges to perform the capping. Their values must be in the interval [0; 100].	`(5.0, 95.0)`
`interpolation`	`Literal[linear, lower, higher, midpoint, nearest]`	The interpolation method to compute the quantiles when the desired quantile lies between two data points `i` and `j`. This value is passed to `numpy.nanquantile` function. The Available values are: `"linear"`: `i + (j - i) * fraction`, where `fraction` is the fractional part of the index surrounded by `i` and `j`. `"lower"`: `i`. `"higher"`: `j`. `"nearest"`: `i` or `j` whichever is nearest. `"midpoint"`: (`i` + `j`) / 2.	`"linear"`
`discard_infs`	`bool`	Whether to discard `-np.inf` and `np.inf` values or not. If False, such values will be capped. If True, they will be replaced by `np.nan`. Info Setting `discard_infs=True` is important if the `inf` values are results of divisions by 0, which are interpreted by `pandas` as `-np.inf` or `np.inf` depending on the sign of the numerator.	`False`
`copy`	`bool`	If False, try to avoid a copy and do inplace capping instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.	`True`

Attributes:

Name	Type	Description
`quantiles_`	`np.ndarray of shape (2, n_features)`	The computed quantiles for each column of the input data. The first row contains the lower quantile, the second row contains the upper quantile.
`n_features_in_`	`int`	Number of features seen during `fit`.
`n_columns_`	`int`	Deprecated, please use `n_features_in_` instead.

Examples:

import pandas as pd
import numpy as np
from sklego.preprocessing import ColumnCapper

df = pd.DataFrame({'a':[2, 4.5, 7, 9], 'b':[11, 12, np.inf, 14]})
df
'''
     a     b
0  2.0  11.0
1  4.5  12.0
2  7.0   inf
3  9.0  14.0
'''

capper = ColumnCapper()
capper.fit_transform(df)
'''
array([[ 2.375, 11.1  ],
       [ 4.5  , 12.   ],
       [ 7.   , 13.8  ],
       [ 8.7  , 13.8  ]])
'''

capper = ColumnCapper(discard_infs=True) # Discarding infs
df[['a', 'b']] = capper.fit_transform(df)
df
'''
       a     b
0  2.375  11.1
1  4.500  12.0
2  7.000   NaN
3  8.700  13.8
'''

Source code in sklego/preprocessing/columncapper.py

class ColumnCapper(TransformerMixin, BaseEstimator):
    """The `ColumnCapper` transformer caps the values of columns according to the given quantile thresholds.

    The capping is performed independently for each column of the input data. The quantile thresholds are computed
    during the fitting phase. The capping is performed during the transformation phase.

    Parameters
    ----------
    quantile_range : Tuple[float, float] | List[float], default=(5.0, 95.0)
        The quantile ranges to perform the capping. Their values must be in the interval [0; 100].
    interpolation : Literal["linear", "lower", "higher", "midpoint", "nearest"], default="linear"
        The interpolation method to compute the quantiles when the desired quantile lies between two data points `i`
        and `j`. This value is passed to
        [`numpy.nanquantile`](https://numpy.org/doc/stable/reference/generated/numpy.nanquantile.html) function.

        The Available values are:

        - `"linear"`: `i + (j - i) * fraction`, where `fraction` is the fractional part of the index surrounded by `i`
            and `j`.
        * `"lower"`: `i`.
        * `"higher"`: `j`.
        * `"nearest"`: `i` or `j` whichever is nearest.
        * `"midpoint"`: (`i` + `j`) / 2.
    discard_infs : bool, default=False
        Whether to discard `-np.inf` and `np.inf` values or not. If False, such values will be capped. If True,
        they will be replaced by `np.nan`.

        !!! info
            Setting `discard_infs=True` is important if the `inf` values are results of divisions by 0, which are
            interpreted by `pandas` as `-np.inf` or `np.inf` depending on the sign of the numerator.
    copy : bool, default=True
        If False, try to avoid a copy and do inplace capping instead. This is not guaranteed to always work inplace;
        e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.

    Attributes
    ----------
    quantiles_ : np.ndarray of shape (2, n_features)
        The computed quantiles for each column of the input data. The first row contains the lower quantile, the second
        row contains the upper quantile.
    n_features_in_ : int
        Number of features seen during `fit`.
    n_columns_ : int
        Deprecated, please use `n_features_in_` instead.

    Examples
    --------
    ```py
    import pandas as pd
    import numpy as np
    from sklego.preprocessing import ColumnCapper

    df = pd.DataFrame({'a':[2, 4.5, 7, 9], 'b':[11, 12, np.inf, 14]})
    df
    '''
         a     b
    0  2.0  11.0
    1  4.5  12.0
    2  7.0   inf
    3  9.0  14.0
    '''

    capper = ColumnCapper()
    capper.fit_transform(df)
    '''
    array([[ 2.375, 11.1  ],
           [ 4.5  , 12.   ],
           [ 7.   , 13.8  ],
           [ 8.7  , 13.8  ]])
    '''

    capper = ColumnCapper(discard_infs=True) # Discarding infs
    df[['a', 'b']] = capper.fit_transform(df)
    df
    '''
           a     b
    0  2.375  11.1
    1  4.500  12.0
    2  7.000   NaN
    3  8.700  13.8
    '''
    ```
    """

    def __init__(
        self,
        quantile_range=(5.0, 95.0),
        interpolation="linear",
        discard_infs=False,
        copy=True,
    ):
        self._check_quantile_range(quantile_range)
        self._check_interpolation(interpolation)

        self.quantile_range = quantile_range
        self.interpolation = interpolation
        self.discard_infs = discard_infs
        self.copy = copy

    def fit(self, X, y=None):
        """Fit the `ColumnCapper` transformer by computing quantiles for each column of `X`.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data used to compute the quantiles for capping.
        y : array-like of shape (n_samples,), default=None
            Ignored, present for compatibility.

        Returns
        -------
        self : ColumnCapper
            The fitted transformer.

        Raises
        ------
        ValueError
            If `X` contains non-numeric columns.
        """
        X = check_array(X, copy=True, force_all_finite=False, dtype=FLOAT_DTYPES, estimator=self)

        # If X contains infs, we need to replace them by nans before computing quantiles
        np.putmask(X, (X == np.inf) | (X == -np.inf), np.nan)

        # There should be no column containing only nan cells at this point. If that's not the case,
        # it means that the user asked ColumnCapper to fit some column containing only nan or inf cells.
        nans_mask = np.isnan(X)
        invalid_columns_mask = nans_mask.sum(axis=0) == X.shape[0]  # Contains as many nans as rows
        if invalid_columns_mask.any():
            raise ValueError("ColumnCapper cannot fit columns containing only inf/nan values")

        q = [quantile_limit / 100 for quantile_limit in self.quantile_range]
        self.quantiles_ = np.nanquantile(a=X, q=q, axis=0, overwrite_input=True, method=self.interpolation)

        # Saving the number of columns to ensure coherence between fit and transform inputs
        self.n_features_in_ = X.shape[1]

        return self

    def transform(self, X):
        """Performs the capping on the column(s) of `X` according to the quantile thresholds computed during fitting.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data for which the capping limit(s) will be applied.

        Returns
        -------
        X : np.ndarray of shape (n_samples, n_features)
            `X` values with capped limits.

        Raises
        ------
        ValueError
            If the number of columns from `X` differs from the number of columns when fitting.
        """
        check_is_fitted(self, "quantiles_")
        X = check_array(
            X,
            copy=self.copy,
            force_all_finite=False,
            dtype=FLOAT_DTYPES,
            estimator=self,
        )

        if X.shape[1] != self.n_features_in_:
            raise ValueError("X must have the same number of columns in fit and transform")

        if self.discard_infs:
            np.putmask(X, (X == np.inf) | (X == -np.inf), np.nan)

        # Actually capping
        X = np.minimum(X, self.quantiles_[1, :])
        X = np.maximum(X, self.quantiles_[0, :])

        return X

    @staticmethod
    def _check_quantile_range(quantile_range):
        """Checks for the validity of quantile_range.

        Parameters
        ----------
        quantile_range : Tuple[float, float] | List[float]
            The quantile ranges to perform the capping. Their values must be in the interval [0; 100].

        Raises
        ------
        TypeError
            If `quantile_range` is not a tuple or a list.
        ValueError
            - If `quantile_range` does not contain exactly 2 elements.
            - If `quantile_range` contains values outside of [0; 100].
            - If `quantile_range` contains values in the wrong order.
        """
        if not isinstance(quantile_range, tuple) and not isinstance(quantile_range, list):
            raise TypeError("quantile_range must be a tuple or a list")
        if len(quantile_range) != 2:
            raise ValueError("quantile_range must contain 2 elements: min_quantile and max_quantile")

        min_quantile, max_quantile = quantile_range

        for quantile in min_quantile, max_quantile:
            if not isinstance(quantile, float) and not isinstance(quantile, int):
                raise TypeError("min_quantile and max_quantile must be numbers")
            if quantile < 0 or 100 < quantile:
                raise ValueError("min_quantile and max_quantile must be in [0; 100]")

        if min_quantile > max_quantile:
            raise ValueError("min_quantile must be less than or equal to max_quantile")

    @staticmethod
    def _check_interpolation(interpolation):
        """Checks for the validity of interpolation.

        Parameters
        ----------
        interpolation : Literal["linear", "lower", "higher", "midpoint", "nearest"]
            Interpolation method to compute the quantiles

        Raises
        ------
        ValueError
            If `interpolation` is not one of the allowed values.
        """
        allowed_interpolations = ("linear", "lower", "higher", "midpoint", "nearest")
        if interpolation not in allowed_interpolations:
            raise ValueError("Available interpolation methods: {}".format(", ".join(allowed_interpolations)))

    @property
    def n_columns_(self):
        warn(
            "Please use `n_features_in_` instead of `n_columns_`, `n_columns_` will be deprecated in future versions",
            DeprecationWarning,
        )
        return self.n_features_in_

    def _more_tags(self):
        return {"allow_nan": True}

`fit(X, y=None)` ¶

Fit the ColumnCapper transformer by computing quantiles for each column of X.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	The data used to compute the quantiles for capping.	required
`y`	`array-like of shape (n_samples,)`	Ignored, present for compatibility.	`None`

Returns:

Name	Type	Description
`self`	`ColumnCapper`	The fitted transformer.

Raises:

Type	Description
`ValueError`	If `X` contains non-numeric columns.

Source code in sklego/preprocessing/columncapper.py

def fit(self, X, y=None):
    """Fit the `ColumnCapper` transformer by computing quantiles for each column of `X`.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data used to compute the quantiles for capping.
    y : array-like of shape (n_samples,), default=None
        Ignored, present for compatibility.

    Returns
    -------
    self : ColumnCapper
        The fitted transformer.

    Raises
    ------
    ValueError
        If `X` contains non-numeric columns.
    """
    X = check_array(X, copy=True, force_all_finite=False, dtype=FLOAT_DTYPES, estimator=self)

    # If X contains infs, we need to replace them by nans before computing quantiles
    np.putmask(X, (X == np.inf) | (X == -np.inf), np.nan)

    # There should be no column containing only nan cells at this point. If that's not the case,
    # it means that the user asked ColumnCapper to fit some column containing only nan or inf cells.
    nans_mask = np.isnan(X)
    invalid_columns_mask = nans_mask.sum(axis=0) == X.shape[0]  # Contains as many nans as rows
    if invalid_columns_mask.any():
        raise ValueError("ColumnCapper cannot fit columns containing only inf/nan values")

    q = [quantile_limit / 100 for quantile_limit in self.quantile_range]
    self.quantiles_ = np.nanquantile(a=X, q=q, axis=0, overwrite_input=True, method=self.interpolation)

    # Saving the number of columns to ensure coherence between fit and transform inputs
    self.n_features_in_ = X.shape[1]

    return self

`transform(X)` ¶

Performs the capping on the column(s) of X according to the quantile thresholds computed during fitting.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	The data for which the capping limit(s) will be applied.	required

Returns:

Name	Type	Description
`X`	`np.ndarray of shape (n_samples, n_features)`	`X` values with capped limits.

Raises:

Type	Description
`ValueError`	If the number of columns from `X` differs from the number of columns when fitting.

Source code in sklego/preprocessing/columncapper.py

def transform(self, X):
    """Performs the capping on the column(s) of `X` according to the quantile thresholds computed during fitting.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data for which the capping limit(s) will be applied.

    Returns
    -------
    X : np.ndarray of shape (n_samples, n_features)
        `X` values with capped limits.

    Raises
    ------
    ValueError
        If the number of columns from `X` differs from the number of columns when fitting.
    """
    check_is_fitted(self, "quantiles_")
    X = check_array(
        X,
        copy=self.copy,
        force_all_finite=False,
        dtype=FLOAT_DTYPES,
        estimator=self,
    )

    if X.shape[1] != self.n_features_in_:
        raise ValueError("X must have the same number of columns in fit and transform")

    if self.discard_infs:
        np.putmask(X, (X == np.inf) | (X == -np.inf), np.nan)

    # Actually capping
    X = np.minimum(X, self.quantiles_[1, :])
    X = np.maximum(X, self.quantiles_[0, :])

    return X

`sklego.preprocessing.pandastransformers.ColumnDropper` ¶

Bases: BaseEstimator, TransformerMixin

The ColumnDropper transformer allows dropping specific columns from a DataFrame by name. Can be useful in a sklearn Pipeline.

Parameters:

Name	Type	Description	Default
`columns`	`str \| list[str]`	Column name(s) to be selected.	required

Attributes:

Name	Type	Description
`feature_names_`	`list[str]`	The names of the features to keep during transform.

Notes

Native cross-dataframe support is achieved using Narwhals.

Supported dataframes are:

pandas
Polars (eager or lazy)
Modin
cuDF

See Narwhals docs for an up-to-date list (and to learn how you can add your dataframe library to it!).

Examples:

# Selecting a single column from a pandas DataFrame
import pandas as pd
from sklego.preprocessing import ColumnDropper

df = pd.DataFrame({
    "name": ["Swen", "Victor", "Alex"],
    "length": [1.82, 1.85, 1.80],
    "shoesize": [42, 44, 45]
})
ColumnDropper(["name"]).fit_transform(df)
'''
   length  shoesize
0    1.82        42
1    1.85        44
2    1.80        45
'''

# Dropping multiple columns from a pandas DataFrame
ColumnDropper(["length", "shoesize"]).fit_transform(df)
'''
     name
0    Swen
1  Victor
2    Alex
'''

# Dropping non-existent columns results in a KeyError
ColumnDropper(["weight"]).fit_transform(df)
# Traceback (most recent call last):
#     ...
# KeyError: "['weight'] column(s) not in DataFrame"

# How to use the ColumnSelector in a sklearn Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipe = Pipeline([
    ("select", ColumnDropper(["name", "shoesize"])),
    ("scale", StandardScaler()),
])
pipe.fit_transform(df)
# array([[-0.16222142],
#        [ 1.29777137],
#        [-1.13554995]])

Raises:

Type	Description
`TypeError`	If input provided is not a DataFrame.
`KeyError`	If columns provided are not in the input DataFrame.
`ValueError`	If dropping the specified columns would result in an empty output DataFrame.

Source code in sklego/preprocessing/pandastransformers.py

class ColumnDropper(BaseEstimator, TransformerMixin):
    """The `ColumnDropper` transformer allows dropping specific columns from a DataFrame by name.
    Can be useful in a sklearn Pipeline.

    Parameters
    ----------
    columns : str | list[str]
        Column name(s) to be selected.

    Attributes
    ----------
    feature_names_ : list[str]
        The names of the features to keep during transform.

    Notes
    -----
    Native cross-dataframe support is achieved using
    [Narwhals](https://narwhals-dev.github.io/narwhals/){:target="_blank"}.

    Supported dataframes are:

    - pandas
    - Polars (eager or lazy)
    - Modin
    - cuDF

    See [Narwhals docs](https://narwhals-dev.github.io/narwhals/extending/){:target="_blank"} for an up-to-date list
    (and to learn how you can add your dataframe library to it!).

    Examples
    --------
    ```py
    # Selecting a single column from a pandas DataFrame
    import pandas as pd
    from sklego.preprocessing import ColumnDropper

    df = pd.DataFrame({
        "name": ["Swen", "Victor", "Alex"],
        "length": [1.82, 1.85, 1.80],
        "shoesize": [42, 44, 45]
    })
    ColumnDropper(["name"]).fit_transform(df)
    '''
       length  shoesize
    0    1.82        42
    1    1.85        44
    2    1.80        45
    '''

    # Dropping multiple columns from a pandas DataFrame
    ColumnDropper(["length", "shoesize"]).fit_transform(df)
    '''
         name
    0    Swen
    1  Victor
    2    Alex
    '''

    # Dropping non-existent columns results in a KeyError
    ColumnDropper(["weight"]).fit_transform(df)
    # Traceback (most recent call last):
    #     ...
    # KeyError: "['weight'] column(s) not in DataFrame"

    # How to use the ColumnSelector in a sklearn Pipeline
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    pipe = Pipeline([
        ("select", ColumnDropper(["name", "shoesize"])),
        ("scale", StandardScaler()),
    ])
    pipe.fit_transform(df)
    # array([[-0.16222142],
    #        [ 1.29777137],
    #        [-1.13554995]])
    ```

    Raises
    ------
    TypeError
        If input provided is not a DataFrame.
    KeyError
        If columns provided are not in the input DataFrame.
    ValueError
        If dropping the specified columns would result in an empty output DataFrame.
    """

    def __init__(self, columns: list):
        self.columns = columns

    def fit(self, X, y=None):
        """Fit the transformer by storing the column names to keep during `.transform()` step.

        Checks:

        1. If input is a supported DataFrame
        2. If column names are in such DataFrame

        Parameters
        ----------
        X : DataFrame
            The data on which we apply the column selection.
        y : Series, default=None
            Ignored, present for compatibility.

        Returns
        -------
        self : ColumnDropper
            The fitted transformer.

        Raises
        ------
        TypeError
            If `X` is not a supported DataFrame.
        KeyError
            If one or more of the columns provided doesn't exist in the input DataFrame.
        ValueError
            If dropping the specified columns would result in an empty output DataFrame.
        """
        self.columns_ = as_list(self.columns)
        X = nw.from_native(X)
        self._check_column_names(X)
        self.feature_names_ = [x for x in X.columns if x not in self.columns_]
        self._check_column_length()
        return self

    def transform(self, X):
        """Returns a DataFrame with only the specified columns.

        Parameters
        ----------
        X : DataFrame
            The data on which we apply the column selection.

        Returns
        -------
        DataFrame
            The data with the specified columns dropped.

        Raises
        ------
        TypeError
            If `X` is not a supported DataFrame object.
        """
        check_is_fitted(self, ["feature_names_"])
        X = nw.from_native(X)
        if self.columns_:
            return nw.to_native(X.drop(self.columns_))
        return nw.to_native(X)

    def get_feature_names(self):
        """Alias for `.feature_names_` attribute"""
        return self.feature_names_

    def _check_column_length(self):
        """Check if all columns are dropped"""
        if len(self.feature_names_) == 0:
            raise ValueError(f"Dropping {self.columns_} would result in an empty output DataFrame")

    def _check_column_names(self, X):
        """Check if one or more of the columns provided doesn't exist in the input DataFrame"""
        non_existent_columns = set(self.columns_).difference(X.columns)
        if len(non_existent_columns) > 0:
            raise KeyError(f"{list(non_existent_columns)} column(s) not in DataFrame")

`fit(X, y=None)` ¶

Fit the transformer by storing the column names to keep during .transform() step.

Checks:

If input is a supported DataFrame
If column names are in such DataFrame

Parameters:

Name	Type	Description	Default
`X`	`DataFrame`	The data on which we apply the column selection.	required
`y`	`Series`	Ignored, present for compatibility.	`None`

Returns:

Name	Type	Description
`self`	`ColumnDropper`	The fitted transformer.

Raises:

Type	Description
`TypeError`	If `X` is not a supported DataFrame.
`KeyError`	If one or more of the columns provided doesn't exist in the input DataFrame.
`ValueError`	If dropping the specified columns would result in an empty output DataFrame.

Source code in sklego/preprocessing/pandastransformers.py

def fit(self, X, y=None):
    """Fit the transformer by storing the column names to keep during `.transform()` step.

    Checks:

    1. If input is a supported DataFrame
    2. If column names are in such DataFrame

    Parameters
    ----------
    X : DataFrame
        The data on which we apply the column selection.
    y : Series, default=None
        Ignored, present for compatibility.

    Returns
    -------
    self : ColumnDropper
        The fitted transformer.

    Raises
    ------
    TypeError
        If `X` is not a supported DataFrame.
    KeyError
        If one or more of the columns provided doesn't exist in the input DataFrame.
    ValueError
        If dropping the specified columns would result in an empty output DataFrame.
    """
    self.columns_ = as_list(self.columns)
    X = nw.from_native(X)
    self._check_column_names(X)
    self.feature_names_ = [x for x in X.columns if x not in self.columns_]
    self._check_column_length()
    return self

`get_feature_names()` ¶

Alias for .feature_names_ attribute

Source code in sklego/preprocessing/pandastransformers.py

def get_feature_names(self):
    """Alias for `.feature_names_` attribute"""
    return self.feature_names_

`transform(X)` ¶

Returns a DataFrame with only the specified columns.

Parameters:

Name	Type	Description	Default
`X`	`DataFrame`	The data on which we apply the column selection.	required

Returns:

Type	Description
`DataFrame`	The data with the specified columns dropped.

Raises:

Type	Description
`TypeError`	If `X` is not a supported DataFrame object.

Source code in sklego/preprocessing/pandastransformers.py

def transform(self, X):
    """Returns a DataFrame with only the specified columns.

    Parameters
    ----------
    X : DataFrame
        The data on which we apply the column selection.

    Returns
    -------
    DataFrame
        The data with the specified columns dropped.

    Raises
    ------
    TypeError
        If `X` is not a supported DataFrame object.
    """
    check_is_fitted(self, ["feature_names_"])
    X = nw.from_native(X)
    if self.columns_:
        return nw.to_native(X.drop(self.columns_))
    return nw.to_native(X)

`sklego.preprocessing.pandastransformers.ColumnSelector` ¶

Bases: BaseEstimator, TransformerMixin

The ColumnSelector transformer allows selecting specific columns from a DataFrame by name. Can be useful in a sklearn Pipeline.

Parameters:

Name	Type	Description	Default
`columns`	`str \| list[str]`	Column name(s) to be selected.	required

Notes

Native cross-dataframe support is achieved using Narwhals.

Supported dataframes are:

pandas
Polars (eager or lazy)
Modin
cuDF

See Narwhals docs for an up-to-date list (and to learn how you can add your dataframe library to it!).

Attributes:

Name	Type	Description
`columns_`	`list[str]`	The names of the features to keep during transform.

Examples:

# Selecting a single column from a pandas DataFrame
import pandas as pd
from sklego.preprocessing import ColumnSelector

df_pd = pd.DataFrame({
    "name": ["Swen", "Victor", "Alex"],
    "length": [1.82, 1.85, 1.80],
    "shoesize": [42, 44, 45]
})

ColumnSelector(["length"]).fit_transform(df_pd)
'''
    length
0    1.82
1    1.85
2    1.80
'''

# Selecting multiple columns from a polars DataFrame
import polars as pl
from sklego.preprocessing import ColumnSelector

df_pl = pl.DataFrame({
    "name": ["Swen", "Victor", "Alex"],
    "length": [1.82, 1.85, 1.80],
    "shoesize": [42, 44, 45]
})

ColumnSelector(["length", "shoesize"]).fit_transform(df_pl)
'''
shape: (3, 2)
┌────────┬──────────┐
│ length ┆ shoesize │
│ ---    ┆ ---      │
│ f64    ┆ i64      │
╞════════╪══════════╡
│ 1.82   ┆ 42       │
│ 1.85   ┆ 44       │
│ 1.8    ┆ 45       │
└────────┴──────────┘
'''


# Selecting non-existent columns results in a KeyError
ColumnSelector(["weight"]).fit_transform(df_pd)
# Traceback (most recent call last):
#     ...
# KeyError: "['weight'] column(s) not in DataFrame"

# How to use the ColumnSelector in a sklearn Pipeline
import polars as pl
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklego.preprocessing import ColumnSelector

pipe = Pipeline([
    ("select", ColumnSelector(["length"])),
    ("scale", StandardScaler()),
])
pipe.fit_transform(df_pl)
# array([[-0.16222142],
#        [ 1.29777137],
#        [-1.13554995]])

Raises:

Type	Description
`TypeError`	If input provided is not a supported DataFrame.
`KeyError`	If columns provided are not in the input DataFrame.
`ValueError`	If provided list of columns to select is empty and would result in an empty output DataFrame.

Source code in sklego/preprocessing/pandastransformers.py

class ColumnSelector(BaseEstimator, TransformerMixin):
    """The `ColumnSelector` transformer allows selecting specific columns from a DataFrame by name.
    Can be useful in a sklearn Pipeline.

    Parameters
    ----------
    columns : str | list[str]
        Column name(s) to be selected.

    Notes
    -----
    Native cross-dataframe support is achieved using
    [Narwhals](https://narwhals-dev.github.io/narwhals/){:target="_blank"}.

    Supported dataframes are:

    - pandas
    - Polars (eager or lazy)
    - Modin
    - cuDF

    See [Narwhals docs](https://narwhals-dev.github.io/narwhals/extending/){:target="_blank"} for an up-to-date list
    (and to learn how you can add your dataframe library to it!).

    Attributes
    ----------
    columns_ : list[str]
        The names of the features to keep during transform.

    Examples
    --------
    ```py
    # Selecting a single column from a pandas DataFrame
    import pandas as pd
    from sklego.preprocessing import ColumnSelector

    df_pd = pd.DataFrame({
        "name": ["Swen", "Victor", "Alex"],
        "length": [1.82, 1.85, 1.80],
        "shoesize": [42, 44, 45]
    })

    ColumnSelector(["length"]).fit_transform(df_pd)
    '''
        length
    0    1.82
    1    1.85
    2    1.80
    '''

    # Selecting multiple columns from a polars DataFrame
    import polars as pl
    from sklego.preprocessing import ColumnSelector

    df_pl = pl.DataFrame({
        "name": ["Swen", "Victor", "Alex"],
        "length": [1.82, 1.85, 1.80],
        "shoesize": [42, 44, 45]
    })

    ColumnSelector(["length", "shoesize"]).fit_transform(df_pl)
    '''
    shape: (3, 2)
    ┌────────┬──────────┐
    │ length ┆ shoesize │
    │ ---    ┆ ---      │
    │ f64    ┆ i64      │
    ╞════════╪══════════╡
    │ 1.82   ┆ 42       │
    │ 1.85   ┆ 44       │
    │ 1.8    ┆ 45       │
    └────────┴──────────┘
    '''


    # Selecting non-existent columns results in a KeyError
    ColumnSelector(["weight"]).fit_transform(df_pd)
    # Traceback (most recent call last):
    #     ...
    # KeyError: "['weight'] column(s) not in DataFrame"

    # How to use the ColumnSelector in a sklearn Pipeline
    import polars as pl
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    from sklego.preprocessing import ColumnSelector

    pipe = Pipeline([
        ("select", ColumnSelector(["length"])),
        ("scale", StandardScaler()),
    ])
    pipe.fit_transform(df_pl)
    # array([[-0.16222142],
    #        [ 1.29777137],
    #        [-1.13554995]])
    ```

    Raises
    ------
    TypeError
        If input provided is not a supported DataFrame.
    KeyError
        If columns provided are not in the input DataFrame.
    ValueError
        If provided list of columns to select is empty and would result in an empty output DataFrame.
    """

    def __init__(self, columns: list):
        # if the columns parameter is not a list, make it into a list
        self.columns = columns

    def fit(self, X, y=None):
        """Fit the transformer by storing the column names to keep during transform.

        Checks:

        1. If input is a supported DataFrame
        2. If column names are in such DataFrame

        Parameters
        ----------
        X : DataFrame
            The data on which we apply the column selection.
        y : Series, default=None
            Ignored, present for compatibility.

        Returns
        -------
        self : ColumnSelector
            The fitted transformer.

        Raises
        ------
        TypeError
            If `X` is not a supported DataFrame
        KeyError
            If one or more of the columns provided doesn't exist in the input DataFrame.
        ValueError
            If provided list of columns to select is empty and would result in an empty output DataFrame.
        """
        self.columns_ = as_list(self.columns)
        X = nw.from_native(X)
        self._check_column_names(X)
        self._check_column_length()
        return self

    def transform(self, X):
        """Returns a DataFrame with only the specified columns.

        Parameters
        ----------
        X : DataFrame
            The data on which we apply the column selection.

        Returns
        -------
        DataFrame
            The data with the specified columns dropped.

        Raises
        ------
        TypeError
            If `X` is not a supported DataFrame.
        """
        X = nw.from_native(X)
        if self.columns:
            return nw.to_native(X.select(self.columns_))
        return nw.to_native(X)

    def get_feature_names(self):
        """Alias for `.columns_` attribute"""
        return self.columns_

    def _check_column_length(self):
        """Check if no column is selected"""
        if len(self.columns_) == 0:
            raise ValueError("Expected columns to be at least of length 1, found length of 0 instead")

    def _check_column_names(self, X):
        """Check if one or more of the columns provided doesn't exist in the input DataFrame"""
        non_existent_columns = set(self.columns_).difference(X.columns)
        if len(non_existent_columns) > 0:
            raise KeyError(f"{list(non_existent_columns)} column(s) not in DataFrame")

`fit(X, y=None)` ¶

Fit the transformer by storing the column names to keep during transform.

Checks:

If input is a supported DataFrame
If column names are in such DataFrame

Parameters:

Name	Type	Description	Default
`X`	`DataFrame`	The data on which we apply the column selection.	required
`y`	`Series`	Ignored, present for compatibility.	`None`

Returns:

Name	Type	Description
`self`	`ColumnSelector`	The fitted transformer.

Raises:

Type	Description
`TypeError`	If `X` is not a supported DataFrame
`KeyError`	If one or more of the columns provided doesn't exist in the input DataFrame.
`ValueError`	If provided list of columns to select is empty and would result in an empty output DataFrame.

Source code in sklego/preprocessing/pandastransformers.py

def fit(self, X, y=None):
    """Fit the transformer by storing the column names to keep during transform.

    Checks:

    1. If input is a supported DataFrame
    2. If column names are in such DataFrame

    Parameters
    ----------
    X : DataFrame
        The data on which we apply the column selection.
    y : Series, default=None
        Ignored, present for compatibility.

    Returns
    -------
    self : ColumnSelector
        The fitted transformer.

    Raises
    ------
    TypeError
        If `X` is not a supported DataFrame
    KeyError
        If one or more of the columns provided doesn't exist in the input DataFrame.
    ValueError
        If provided list of columns to select is empty and would result in an empty output DataFrame.
    """
    self.columns_ = as_list(self.columns)
    X = nw.from_native(X)
    self._check_column_names(X)
    self._check_column_length()
    return self

`get_feature_names()` ¶

Alias for .columns_ attribute

Source code in sklego/preprocessing/pandastransformers.py

def get_feature_names(self):
    """Alias for `.columns_` attribute"""
    return self.columns_

`transform(X)` ¶

Returns a DataFrame with only the specified columns.

Parameters:

Name	Type	Description	Default
`X`	`DataFrame`	The data on which we apply the column selection.	required

Returns:

Type	Description
`DataFrame`	The data with the specified columns dropped.

Raises:

Type	Description
`TypeError`	If `X` is not a supported DataFrame.

Source code in sklego/preprocessing/pandastransformers.py

def transform(self, X):
    """Returns a DataFrame with only the specified columns.

    Parameters
    ----------
    X : DataFrame
        The data on which we apply the column selection.

    Returns
    -------
    DataFrame
        The data with the specified columns dropped.

    Raises
    ------
    TypeError
        If `X` is not a supported DataFrame.
    """
    X = nw.from_native(X)
    if self.columns:
        return nw.to_native(X.select(self.columns_))
    return nw.to_native(X)

`sklego.preprocessing.dictmapper.DictMapper` ¶

Bases: TransformerMixin, BaseEstimator

The DictMapper transformer maps the values of columns according to the input mapper dictionary, fall back to the default value if the key is not present in the dictionary.

Parameters:

Name	Type	Description	Default
`mapper`	`dict[..., int]`	The dictionary containing the mapping of the values.	required
`default`	`int`	The value to fall back to if the value is not in the mapper.	required

Attributes:

Name	Type	Description
`n_features_in_`	`int`	Number of features seen during `fit`.
`dim_`	`int`	Deprecated, please use `n_features_in_` instead.

Examples:

import pandas as pd
from sklego.preprocessing.dictmapper import DictMapper
from sklearn.compose import ColumnTransformer

X = pd.DataFrame({
    "city_pop": ["Amsterdam", "Leiden", "Utrecht", "None", "Haarlem"]
})

mapper = {
    "Amsterdam": 1_181_817,
    "Leiden": 130_181,
    "Utrecht": 367_984,
    "Haarlem": 165_396,
}

ct = ColumnTransformer([("dictmapper", DictMapper(mapper, 0), ["city_pop"])])
X_trans = ct.fit_transform(X)
X_trans
# array([[1181817],
#        [ 130181],
#        [ 367984],
#        [      0],
#        [ 165396]])

Source code in sklego/preprocessing/dictmapper.py

class DictMapper(TransformerMixin, BaseEstimator):
    """The `DictMapper` transformer maps the values of columns according to the input `mapper` dictionary, fall back to
    the `default` value if the key is not present in the dictionary.

    Parameters
    ----------
    mapper : dict[..., int]
        The dictionary containing the mapping of the values.
    default : int
        The value to fall back to if the value is not in the mapper.

    Attributes
    ----------
    n_features_in_ : int
        Number of features seen during `fit`.
    dim_ : int
        Deprecated, please use `n_features_in_` instead.

    Examples
    --------
    ```py
    import pandas as pd
    from sklego.preprocessing.dictmapper import DictMapper
    from sklearn.compose import ColumnTransformer

    X = pd.DataFrame({
        "city_pop": ["Amsterdam", "Leiden", "Utrecht", "None", "Haarlem"]
    })

    mapper = {
        "Amsterdam": 1_181_817,
        "Leiden": 130_181,
        "Utrecht": 367_984,
        "Haarlem": 165_396,
    }

    ct = ColumnTransformer([("dictmapper", DictMapper(mapper, 0), ["city_pop"])])
    X_trans = ct.fit_transform(X)
    X_trans
    # array([[1181817],
    #        [ 130181],
    #        [ 367984],
    #        [      0],
    #        [ 165396]])
    ```
    """

    _required_parameters = ["mapper", "default"]

    def __init__(self, mapper, default):
        self.mapper = mapper
        self.default = default

    def fit(self, X, y=None):
        """Checks the input data and records the number of features.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data to fit.
        y : array-like of shape (n_samples,), default=None
            Ignored, present for compatibility.

        Returns
        -------
        self : DictMapper
            The fitted transformer.
        """
        X = check_array(
            X,
            copy=True,
            estimator=self,
            force_all_finite=False,
            dtype=None,
            ensure_2d=True,
        )
        self.n_features_in_ = X.shape[1]
        return self

    def transform(self, X):
        """Performs the mapping on the column(s) of `X`.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data for which the mapping will be applied.

        Returns
        -------
        np.ndarray of shape (n_samples, n_features)
            The data with the mapping applied.

        Raises
        ------
        ValueError
            If the number of columns from `X` differs from the number of columns when fitting.
        """
        check_is_fitted(self, ["n_features_in_"])
        X = check_array(
            X,
            copy=True,
            estimator=self,
            force_all_finite=False,
            dtype=None,
            ensure_2d=True,
        )

        if X.shape[1] != self.n_features_in_:
            raise ValueError(f"number of columns {X.shape[1]} does not match fit size {self.n_features_in_}")
        return np.vectorize(self.mapper.get, otypes=[int])(X, self.default)

    @property
    def dim_(self):
        warn(
            "Please use `n_features_in_` instead of `dim_`, `dim_` will be deprecated in future versions",
            DeprecationWarning,
        )
        return self.n_features_in_

    def _more_tags(self):
        return {"preserves_dtype": None, "allow_nan": True, "no_validation": True}

`fit(X, y=None)` ¶

Checks the input data and records the number of features.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	The data to fit.	required
`y`	`array-like of shape (n_samples,)`	Ignored, present for compatibility.	`None`

Returns:

Name	Type	Description
`self`	`DictMapper`	The fitted transformer.

Source code in sklego/preprocessing/dictmapper.py

def fit(self, X, y=None):
    """Checks the input data and records the number of features.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data to fit.
    y : array-like of shape (n_samples,), default=None
        Ignored, present for compatibility.

    Returns
    -------
    self : DictMapper
        The fitted transformer.
    """
    X = check_array(
        X,
        copy=True,
        estimator=self,
        force_all_finite=False,
        dtype=None,
        ensure_2d=True,
    )
    self.n_features_in_ = X.shape[1]
    return self

`transform(X)` ¶

Performs the mapping on the column(s) of X.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	The data for which the mapping will be applied.	required

Returns:

Type	Description
`np.ndarray of shape (n_samples, n_features)`	The data with the mapping applied.

Raises:

Type	Description
`ValueError`	If the number of columns from `X` differs from the number of columns when fitting.

Source code in sklego/preprocessing/dictmapper.py

def transform(self, X):
    """Performs the mapping on the column(s) of `X`.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data for which the mapping will be applied.

    Returns
    -------
    np.ndarray of shape (n_samples, n_features)
        The data with the mapping applied.

    Raises
    ------
    ValueError
        If the number of columns from `X` differs from the number of columns when fitting.
    """
    check_is_fitted(self, ["n_features_in_"])
    X = check_array(
        X,
        copy=True,
        estimator=self,
        force_all_finite=False,
        dtype=None,
        ensure_2d=True,
    )

    if X.shape[1] != self.n_features_in_:
        raise ValueError(f"number of columns {X.shape[1]} does not match fit size {self.n_features_in_}")
    return np.vectorize(self.mapper.get, otypes=[int])(X, self.default)

`sklego.preprocessing.identitytransformer.IdentityTransformer` ¶

Bases: BaseEstimator, TransformerMixin

The IdentityTransformer returns what it is fed. Does not apply any transformation.

The reason for having it is because you can build more expressive pipelines.

Parameters:

Name	Type	Description	Default
`check_X`	`bool`	Whether to validate `X` to be non-empty 2D array of finite values and attempt to cast `X` to float. If disabled, the model/pipeline is expected to handle e.g. missing, non-numeric, or non-finite values.	`False`

Attributes:

Name	Type	Description
`n_samples_`	`int`	The number of samples seen during `fit`.
`n_features_in_`	`int`	The number of features seen during `fit`.
`shape_`	`tuple[int, int]`	Deprecated, please use `n_samples_` and `n_features_in_` instead.

Examples:

import pandas as pd
from sklego.preprocessing import IdentityTransformer

df = pd.DataFrame({
    "name": ["Swen", "Victor", "Alex"],
    "length": [1.82, 1.85, 1.80],
    "shoesize": [42, 44, 45]
})

IdentityTransformer().fit_transform(df)
#   name    length  shoesize
# 0 Swen    1.82    42
# 1 Victor  1.85    44
# 2 Alex    1.80    45

#using check_X=True to validate `X` to be non-empty 2D array of finite values and attempt to cast `X` to float
IdentityTransformer(check_X=True).fit_transform(df.drop(columns="name"))
# array([[ 1.82, 42.  ],
#        [ 1.85, 44.  ],
#        [ 1.8 , 45.  ]])

Source code in sklego/preprocessing/identitytransformer.py

class IdentityTransformer(BaseEstimator, TransformerMixin):
    """The `IdentityTransformer` returns what it is fed. Does not apply any transformation.

    The reason for having it is because you can build more expressive pipelines.

    Parameters
    ----------
    check_X : bool, default=False
        Whether to validate `X` to be non-empty 2D array of finite values and attempt to cast `X` to float.
        If disabled, the model/pipeline is expected to handle e.g. missing, non-numeric, or non-finite values.

    Attributes
    ----------
    n_samples_ : int
        The number of samples seen during `fit`.
    n_features_in_ : int
        The number of features seen during `fit`.
    shape_ : tuple[int, int]
        Deprecated, please use `n_samples_` and `n_features_in_` instead.

    Examples
    --------
    ```py
    import pandas as pd
    from sklego.preprocessing import IdentityTransformer

    df = pd.DataFrame({
        "name": ["Swen", "Victor", "Alex"],
        "length": [1.82, 1.85, 1.80],
        "shoesize": [42, 44, 45]
    })

    IdentityTransformer().fit_transform(df)
    #	name	length	shoesize
    # 0	Swen	1.82	42
    # 1	Victor	1.85	44
    # 2	Alex	1.80	45

    #using check_X=True to validate `X` to be non-empty 2D array of finite values and attempt to cast `X` to float
    IdentityTransformer(check_X=True).fit_transform(df.drop(columns="name"))
    # array([[ 1.82, 42.  ],
    #        [ 1.85, 44.  ],
    #        [ 1.8 , 45.  ]])
    ```
    """

    def __init__(self, check_X: bool = False):
        self.check_X = check_X

    def fit(self, X, y=None):
        """Check the input data if `check_X` is enabled and and records its shape.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data to fit.
        y : array-like of shape (n_samples,), default=None
            Ignored, present for compatibility.

        Returns
        -------
        self : IdentityTransformer
            The fitted transformer.
        """
        if self.check_X:
            X = check_array(X, copy=True, estimator=self)
        self.n_samples_, self.n_features_in_ = X.shape
        return self

    def transform(self, X):
        """Performs identity "transformation" on `X` - which is no transformation at all.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Input data.

        Returns
        -------
        array-like of shape (n_samples, n_features)
            Unchanged input data.

        Raises
        ------
        ValueError
            If the number of columns from `X` differs from the number of columns when fitting.
        """
        if self.check_X:
            X = check_array(X, copy=True, estimator=self)
        check_is_fitted(self, "n_features_in_")
        if X.shape[1] != self.n_features_in_:
            raise ValueError(
                f"Wrong shape is passed to transform. Trained on {self.n_features_in_} cols got {X.shape[1]}"
            )
        return X

    @property
    def shape_(self):
        """Returns the shape of the estimator."""
        return (self.n_samples_, self.n_features_in_)

`shape_` `property` ¶

Returns the shape of the estimator.

`fit(X, y=None)` ¶

Check the input data if check_X is enabled and and records its shape.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	The data to fit.	required
`y`	`array-like of shape (n_samples,)`	Ignored, present for compatibility.	`None`

Returns:

Name	Type	Description
`self`	`IdentityTransformer`	The fitted transformer.

Source code in sklego/preprocessing/identitytransformer.py

def fit(self, X, y=None):
    """Check the input data if `check_X` is enabled and and records its shape.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data to fit.
    y : array-like of shape (n_samples,), default=None
        Ignored, present for compatibility.

    Returns
    -------
    self : IdentityTransformer
        The fitted transformer.
    """
    if self.check_X:
        X = check_array(X, copy=True, estimator=self)
    self.n_samples_, self.n_features_in_ = X.shape
    return self

`transform(X)` ¶

Performs identity "transformation" on X - which is no transformation at all.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	Input data.	required

Returns:

Type	Description
`array-like of shape (n_samples, n_features)`	Unchanged input data.

Raises:

Type	Description
`ValueError`	If the number of columns from `X` differs from the number of columns when fitting.

Source code in sklego/preprocessing/identitytransformer.py

def transform(self, X):
    """Performs identity "transformation" on `X` - which is no transformation at all.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        Input data.

    Returns
    -------
    array-like of shape (n_samples, n_features)
        Unchanged input data.

    Raises
    ------
    ValueError
        If the number of columns from `X` differs from the number of columns when fitting.
    """
    if self.check_X:
        X = check_array(X, copy=True, estimator=self)
    check_is_fitted(self, "n_features_in_")
    if X.shape[1] != self.n_features_in_:
        raise ValueError(
            f"Wrong shape is passed to transform. Trained on {self.n_features_in_} cols got {X.shape[1]}"
        )
    return X

`sklego.preprocessing.projections.InformationFilter` ¶

Bases: BaseEstimator, TransformerMixin

The InformationFilter transformer uses a variant of the Gram-Schmidt process to filter information out of the dataset.

This can be useful if you want to filter information out of a dataset because of fairness.

To explain how it works: given a training matrix \(X\) that contains columns \(x_1, ..., x_k\). If we assume columns \(x_1\) and \(x_2\) to be the sensitive columns then the information-filter will remove information by applying these transformations:

\[\begin{split} v_1 & = x_1 \\ v_2 & = x_2 - \frac{x_2 v_1}{v_1 v_1} \\ v_3 & = x_3 - \frac{x_k v_1}{v_1 v_1} - \frac{x_2 v_2}{v_2 v_2} \\ & ... \\ v_k & = x_k - \frac{x_k v_1}{v_1 v_1} - \frac{x_2 v_2}{v_2 v_2} \end{split}\]

Concatenating our vectors (but removing the sensitive ones) gives us a new training matrix

\[X_{fair} = [v_3, ..., v_k]\]

Parameters:

Name	Type	Description	Default
`columns`	`int \| str \| Sequence[int] \| Sequence[str]`	The columns to filter out. This can be a sequence of either int (in the case of numpy) or string (in the case of pandas).	required
`alpha`	`float`	Parameter to control how much to filter: `alpha=1` we filter out all information. `alpha=0` we don't apply any filtering. Should be between 0 and 1.	`1.0`

Attributes:

Name	Type	Description
`projection_`	`array-like of shape (n_features, n_features)`	The projection matrix that can be used to filter information out of a dataset.
`col_ids_`	List[int] of length `len(columns)`	The list of column ids of the sensitive columns.

Examples:

import pandas as pd
from sklego.preprocessing import InformationFilter

df = pd.DataFrame({
    "user_id": [101, 102, 103],
    "length": [1.82, 1.85, 1.80],
    "age": [21, 37, 45]
})

InformationFilter(columns=["length", "age"], alpha=0.5).fit_transform(df)
# array([[50.10152483,  3.87905643],
#        [50.26253897, 19.59684308],
#        [52.66084873, 28.06719867]])

Source code in sklego/preprocessing/projections.py

class InformationFilter(BaseEstimator, TransformerMixin):
    r"""The `InformationFilter` transformer uses a variant of the
    [Gram-Schmidt process](https://en.wikipedia.org/wiki/Gram%E2%80%93Schmidt_process) to filter information out of the
    dataset.

    This can be useful if you want to filter information out of a dataset because of fairness.

    To explain how it works: given a training matrix $X$ that contains columns $x_1, ..., x_k$.
    If we assume columns $x_1$ and $x_2$ to be the _sensitive_ columns then the information-filter will remove
    information by applying these transformations:

    $$\begin{split}
       v_1 & = x_1 \\
       v_2 & = x_2 - \frac{x_2 v_1}{v_1 v_1} \\
       v_3 & = x_3 - \frac{x_k v_1}{v_1 v_1} - \frac{x_2 v_2}{v_2 v_2} \\
           & ... \\
       v_k & = x_k - \frac{x_k v_1}{v_1 v_1} - \frac{x_2 v_2}{v_2 v_2}
       \end{split}$$

    Concatenating our vectors (but removing the sensitive ones) gives us a new training matrix

    $$X_{fair} = [v_3, ..., v_k]$$

    Parameters
    ----------
    columns : int | str | Sequence[int] | Sequence[str]
        The columns to filter out. This can be a sequence of either int (in the case of numpy) or string
        (in the case of pandas).
    alpha : float, default=1.0
        Parameter to control how much to filter:

        - `alpha=1` we filter out all information.
        - `alpha=0` we don't apply any filtering.

        Should be between 0 and 1.

    Attributes
    ----------
    projection_ : array-like of shape (n_features, n_features)
        The projection matrix that can be used to filter information out of a dataset.
    col_ids_ : List[int] of length `len(columns)`
        The list of column ids of the sensitive columns.

    Examples
    --------
    ```py
    import pandas as pd
    from sklego.preprocessing import InformationFilter

    df = pd.DataFrame({
        "user_id": [101, 102, 103],
        "length": [1.82, 1.85, 1.80],
        "age": [21, 37, 45]
    })

    InformationFilter(columns=["length", "age"], alpha=0.5).fit_transform(df)
    # array([[50.10152483,  3.87905643],
    #        [50.26253897, 19.59684308],
    #        [52.66084873, 28.06719867]])
    ```
    """

    _required_parameters = ["columns"]

    def __init__(self, columns, alpha=1):
        self.columns = columns
        self.alpha = alpha

    def _check_coltype(self, X):
        """Check if the `columns` type(s) are compatible with `X` type."""
        X_ = nw.from_native(X, strict=False, eager_only=True)
        for col in as_list(self.columns):
            if isinstance(col, str):
                if isinstance(X_, np.ndarray):
                    raise ValueError(f"column {col} is a string but datatype receive is numpy.")
                if isinstance(X_, nw.DataFrame):
                    if col not in X_.columns:
                        raise ValueError(f"column {col} is not in {X_.columns}")
            if isinstance(col, int):
                if col not in range(np.atleast_2d(np.array(X_)).shape[1]):
                    raise ValueError(f"column {col} is out of bounds for input shape {X_.shape}")

    def _col_idx(self, X, name):
        """Get the column index of a column name."""
        if isinstance(name, str):
            if isinstance(X, np.ndarray):
                raise ValueError("You cannot have a column of type string on a numpy input matrix.")
            return {name: i for i, name in enumerate(X.columns)}[name]
        return name

    def _make_v_vectors(self, X, col_ids):
        """Make the v vectors that we will use to filter out information."""
        vs = np.zeros((X.shape[0], len(col_ids)))
        for i, c in enumerate(col_ids):
            vs[:, i] = X[:, col_ids[i]]
            for j in range(0, i):
                vs[:, i] = vs[:, i] - vector_projection(vs[:, i], vs[:, j])
        return vs

    def fit(self, X, y=None):
        """Fit the transformer by learning the projection required to make the dataset orthogonal to sensitive
        columns.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data to fit.
        y : array-like of shape (n_samples,), default=None
            Ignored, present for compatibility.

        Returns
        -------
        self : InformationFilter
            The fitted transformer.

        Raises
        ------
        ValueError
            If `columns` type(s) are incompatible with input data `X` type.
        """
        self._check_coltype(X)
        self.col_ids_ = [v if isinstance(v, int) else self._col_idx(X, v) for v in as_list(self.columns)]
        X = check_array(X, estimator=self)
        X_fair = X.copy()
        v_vectors = self._make_v_vectors(X, self.col_ids_)
        # gram smidt process but only on sensitive attributes
        for i, col in enumerate(X_fair.T):
            for v in v_vectors.T:
                X_fair[:, i] = X_fair[:, i] - vector_projection(X_fair[:, i], v)
        # we want to learn matrix P: X P = X_fair
        # this means we first need to create X_fair in order to learn P
        self.projection_, resid, rank, s = np.linalg.lstsq(X, X_fair, rcond=None)
        self.n_features_in_ = X.shape[1]

        return self

    def transform(self, X):
        """Transforms `X` by applying the information filter.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data to transform.

        Returns
        -------
        array-like of shape (n_samples, n_features)
            The transformed data.

        Raises
        ------
        ValueError
            If `columns` type(s) are incompatible with input data `X` type.
        """
        check_is_fitted(self, ["projection_", "col_ids_"])
        self._check_coltype(X)
        X = check_array(X, estimator=self)
        # apply the projection and remove the column we won't need
        X_fair = X @ self.projection_
        X_removed = np.delete(X_fair, self.col_ids_, axis=1)
        X_orig = np.delete(X, self.col_ids_, axis=1)
        return self.alpha * np.atleast_2d(X_removed) + (1 - self.alpha) * np.atleast_2d(X_orig)

`fit(X, y=None)` ¶

Fit the transformer by learning the projection required to make the dataset orthogonal to sensitive columns.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	The data to fit.	required
`y`	`array-like of shape (n_samples,)`	Ignored, present for compatibility.	`None`

Returns:

Name	Type	Description
`self`	`InformationFilter`	The fitted transformer.

Raises:

Type	Description
`ValueError`	If `columns` type(s) are incompatible with input data `X` type.

Source code in sklego/preprocessing/projections.py

def fit(self, X, y=None):
    """Fit the transformer by learning the projection required to make the dataset orthogonal to sensitive
    columns.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data to fit.
    y : array-like of shape (n_samples,), default=None
        Ignored, present for compatibility.

    Returns
    -------
    self : InformationFilter
        The fitted transformer.

    Raises
    ------
    ValueError
        If `columns` type(s) are incompatible with input data `X` type.
    """
    self._check_coltype(X)
    self.col_ids_ = [v if isinstance(v, int) else self._col_idx(X, v) for v in as_list(self.columns)]
    X = check_array(X, estimator=self)
    X_fair = X.copy()
    v_vectors = self._make_v_vectors(X, self.col_ids_)
    # gram smidt process but only on sensitive attributes
    for i, col in enumerate(X_fair.T):
        for v in v_vectors.T:
            X_fair[:, i] = X_fair[:, i] - vector_projection(X_fair[:, i], v)
    # we want to learn matrix P: X P = X_fair
    # this means we first need to create X_fair in order to learn P
    self.projection_, resid, rank, s = np.linalg.lstsq(X, X_fair, rcond=None)
    self.n_features_in_ = X.shape[1]

    return self

`transform(X)` ¶

Transforms X by applying the information filter.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	The data to transform.	required

Returns:

Type	Description
`array-like of shape (n_samples, n_features)`	The transformed data.

Raises:

Type	Description
`ValueError`	If `columns` type(s) are incompatible with input data `X` type.

Source code in sklego/preprocessing/projections.py

def transform(self, X):
    """Transforms `X` by applying the information filter.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data to transform.

    Returns
    -------
    array-like of shape (n_samples, n_features)
        The transformed data.

    Raises
    ------
    ValueError
        If `columns` type(s) are incompatible with input data `X` type.
    """
    check_is_fitted(self, ["projection_", "col_ids_"])
    self._check_coltype(X)
    X = check_array(X, estimator=self)
    # apply the projection and remove the column we won't need
    X_fair = X @ self.projection_
    X_removed = np.delete(X_fair, self.col_ids_, axis=1)
    X_orig = np.delete(X, self.col_ids_, axis=1)
    return self.alpha * np.atleast_2d(X_removed) + (1 - self.alpha) * np.atleast_2d(X_orig)

`sklego.preprocessing.intervalencoder.IntervalEncoder` ¶

Bases: TransformerMixin, BaseEstimator

The IntervalEncoder transformer bends features in X with regards toy.

We take each column in X separately and smooth it towards y using the strategy that is defined in method.

Note that this allows us to make certain features strictly monotonic in your machine learning model if you follow this with an appropriate model.

Parameters:

Name	Type	Description	Default
`n_chunks`	`int`	The number of cuts that makes the interval.	`10`
`span`	`float`	A hyperparameter for the interpolation method, if the method is `"normal"` it resembles the width of the radial basis function used to weigh the points. It is ignored if the method is `"increasing"` or `"decreasing"`.	`1.0`
`method`	`Literal[average, normal, increasing, decreasing]`	The interpolation method used, can be either `"average"`, `"normal"`, `"increasing"` or `"decreasing"`.	`"normal"`

Attributes:

Name	Type	Description
`quantiles_`	`np.ndarray of shape (n_features, n_chunks)`	The quantiles that are used to cut the interval.
`heights_`	`np.ndarray of shape (n_features, n_chunks)`	The heights of the quantiles that are used to cut the interval.
`n_features_in_`	`int`	Number of features seen during `fit`.
`num_cols_`	`int`	Deprecated, please use `n_features_in_` instead.

Source code in sklego/preprocessing/intervalencoder.py

class IntervalEncoder(TransformerMixin, BaseEstimator):
    """The `IntervalEncoder` transformer bends features in `X` with regards to`y`.

    We take each column in `X` separately and smooth it towards `y` using the strategy that is defined in `method`.

    Note that this allows us to make certain features strictly monotonic in your machine learning model if you follow
    this with an appropriate model.

    Parameters
    ----------
    n_chunks : int, default=10
        The number of cuts that makes the interval.
    span : float, default=1.0
        A hyperparameter for the interpolation method, if the method is `"normal"` it resembles the width of the radial
        basis function used to weigh the points. It is ignored if the method is `"increasing"` or `"decreasing"`.
    method : Literal["average", "normal", "increasing", "decreasing"], default="normal"
        The interpolation method used, can be either `"average"`, `"normal"`, `"increasing"` or `"decreasing"`.

    Attributes
    ----------
    quantiles_ : np.ndarray of shape (n_features, n_chunks)
        The quantiles that are used to cut the interval.
    heights_ : np.ndarray of shape (n_features, n_chunks)
        The heights of the quantiles that are used to cut the interval.
    n_features_in_ : int
        Number of features seen during `fit`.
    num_cols_ : int
        Deprecated, please use `n_features_in_` instead.
    """

    _ALLOWED_METHODS = ("average", "normal", "increasing", "decreasing")

    def __init__(self, n_chunks=10, span=1, method="normal"):
        self.span = span
        self.method = method
        self.n_chunks = n_chunks

    def fit(self, X, y):
        """Fit the `IntervalEncoder` transformer by computing interpolation quantiles for each column of `X`.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training data.
        y : array-like of shape (n_samples,)
            Target values.

        Returns
        -------
        self : IntervalEncoder
            The fitted transformer.

        Raises
        ------
        ValueError
            - If `method` is not one of `"average"`, `"normal"`, `"increasing"` or `"decreasing"`.
            - If `n_chunks` is not a positive integer.
            - If `span` is not between 0 and 1.
        """

        if self.method not in self._ALLOWED_METHODS:
            raise ValueError(f"`method` must be in {self._ALLOWED_METHODS}, got `{self.method}`")
        if self.n_chunks <= 0:
            raise ValueError(f"`n_chunks` must be >= 1, received {self.n_chunks}")
        if self.span > 1.0:
            raise ValueError(f"Error, we expect 0 <= span <= 1, received span={self.span}")
        if self.span < 0.0:
            raise ValueError(f"Error, we expect 0 <= span <= 1, received span={self.span}")

        # these two matrices will have shape (columns, quantiles)
        # quantiles indicate where the interval split occurs
        X, y = check_X_y(X, y, estimator=self)
        self.quantiles_ = np.zeros((X.shape[1], self.n_chunks))
        # heights indicate what heights these intervals will have
        self.heights_ = np.zeros((X.shape[1], self.n_chunks))
        self.n_features_in_ = X.shape[1]

        average_func = _mk_average if self.method in ["average", "normal"] else _mk_monotonic_average

        for col in range(X.shape[1]):
            self.quantiles_[col, :] = np.quantile(X[:, col], q=np.linspace(0, 1, self.n_chunks))
            self.heights_[col, :] = average_func(
                X[:, col],
                y,
                self.quantiles_[col, :],
                span=self.span,
                method=self.method,
            )
        return self

    def transform(self, X):
        """Performs smoothing on the column(s) of `X` according to the quantile values computed during fitting.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data for which the smoothing will be applied.

        Returns
        -------
        X : np.ndarray of shape (n_samples, n_features)
            `X` values with smoothed values.

        Raises
        ------
        ValueError
            If the number of columns from `X` differs from the number of columns when fitting.
        """
        check_is_fitted(self, ["quantiles_", "heights_", "n_features_in_"])
        X = check_array(X, estimator=self)
        if X.shape[1] != self.n_features_in_:
            raise ValueError(f"fitted on {self.n_features_in_} features but received {X.shape[1]}")
        transformed = np.zeros(X.shape)
        for col in range(transformed.shape[1]):
            transformed[:, col] = np.interp(X[:, col], self.quantiles_[col, :], self.heights_[col, :])
        return transformed

    @property
    def num_cols_(self):
        warn(
            "Please use `n_features_in_` instead of `num_cols_`, `num_cols_` will be deprecated in future versions",
            DeprecationWarning,
        )
        return self.n_features_in_

    @property
    def allowed_methods(self):
        warn(
            "Please use `_ALLOWED_METHODS` instead of `allowed_methods`,"
            "`allowed_methods` will be deprecated in future versions",
            DeprecationWarning,
        )
        return self._ALLOWED_METHODS

`fit(X, y)` ¶

Fit the IntervalEncoder transformer by computing interpolation quantiles for each column of X.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	Training data.	required
`y`	`array-like of shape (n_samples,)`	Target values.	required

Returns:

Name	Type	Description
`self`	`IntervalEncoder`	The fitted transformer.

Raises:

Type	Description
`ValueError`	If `method` is not one of `"average"`, `"normal"`, `"increasing"` or `"decreasing"`. If `n_chunks` is not a positive integer. If `span` is not between 0 and 1.

Source code in sklego/preprocessing/intervalencoder.py

def fit(self, X, y):
    """Fit the `IntervalEncoder` transformer by computing interpolation quantiles for each column of `X`.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        Training data.
    y : array-like of shape (n_samples,)
        Target values.

    Returns
    -------
    self : IntervalEncoder
        The fitted transformer.

    Raises
    ------
    ValueError
        - If `method` is not one of `"average"`, `"normal"`, `"increasing"` or `"decreasing"`.
        - If `n_chunks` is not a positive integer.
        - If `span` is not between 0 and 1.
    """

    if self.method not in self._ALLOWED_METHODS:
        raise ValueError(f"`method` must be in {self._ALLOWED_METHODS}, got `{self.method}`")
    if self.n_chunks <= 0:
        raise ValueError(f"`n_chunks` must be >= 1, received {self.n_chunks}")
    if self.span > 1.0:
        raise ValueError(f"Error, we expect 0 <= span <= 1, received span={self.span}")
    if self.span < 0.0:
        raise ValueError(f"Error, we expect 0 <= span <= 1, received span={self.span}")

    # these two matrices will have shape (columns, quantiles)
    # quantiles indicate where the interval split occurs
    X, y = check_X_y(X, y, estimator=self)
    self.quantiles_ = np.zeros((X.shape[1], self.n_chunks))
    # heights indicate what heights these intervals will have
    self.heights_ = np.zeros((X.shape[1], self.n_chunks))
    self.n_features_in_ = X.shape[1]

    average_func = _mk_average if self.method in ["average", "normal"] else _mk_monotonic_average

    for col in range(X.shape[1]):
        self.quantiles_[col, :] = np.quantile(X[:, col], q=np.linspace(0, 1, self.n_chunks))
        self.heights_[col, :] = average_func(
            X[:, col],
            y,
            self.quantiles_[col, :],
            span=self.span,
            method=self.method,
        )
    return self

`transform(X)` ¶

Performs smoothing on the column(s) of X according to the quantile values computed during fitting.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	The data for which the smoothing will be applied.	required

Returns:

Name	Type	Description
`X`	`np.ndarray of shape (n_samples, n_features)`	`X` values with smoothed values.

Raises:

Type	Description
`ValueError`	If the number of columns from `X` differs from the number of columns when fitting.

Source code in sklego/preprocessing/intervalencoder.py

def transform(self, X):
    """Performs smoothing on the column(s) of `X` according to the quantile values computed during fitting.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data for which the smoothing will be applied.

    Returns
    -------
    X : np.ndarray of shape (n_samples, n_features)
        `X` values with smoothed values.

    Raises
    ------
    ValueError
        If the number of columns from `X` differs from the number of columns when fitting.
    """
    check_is_fitted(self, ["quantiles_", "heights_", "n_features_in_"])
    X = check_array(X, estimator=self)
    if X.shape[1] != self.n_features_in_:
        raise ValueError(f"fitted on {self.n_features_in_} features but received {X.shape[1]}")
    transformed = np.zeros(X.shape)
    for col in range(transformed.shape[1]):
        transformed[:, col] = np.interp(X[:, col], self.quantiles_[col, :], self.heights_[col, :])
    return transformed

`sklego.preprocessing.formulaictransformer.FormulaicTransformer` ¶

Bases: TransformerMixin, BaseEstimator

The FormulaicTransformer offers a method to select the right columns from a dataframe as well as a DSL for transformations.

It is inspired from R formulas. This is can be useful as a first step in the pipeline.

Parameters:

Name	Type	Description	Default
`formula`	`str`	A formulaic-compatible formula. Refer to the formulaic documentation for more details.	required
`return_type`	`Literal[pandas, numpy, sparse]`	The type of the returned matrix. Refer to the formulaic documentation for more details.	`"numpy"`

Attributes:

Name	Type	Description
`formula_`	`Formula`	The parsed formula specification.
`model_spec_`	`ModelSpec`	The parsed model specification.
`n_features_in_`	`int`	Number of features seen during `fit`.

Examples:

import formulaic
import pandas as pd
import numpy as np
from sklego.preprocessing import FormulaicTransformer

df = pd.DataFrame({
    'a': ['A', 'B', 'C'],
    'b': [0.3, 0.1, 0.2],
})

#default type of returned matrix - numpy
FormulaicTransformer("a + b + a:b").fit_transform(df)
# array([[1. , 0. , 0. , 0.3, 0. , 0. ],
#        [1. , 1. , 0. , 0.1, 0.1, 0. ],
#        [1. , 0. , 1. , 0.2, 0. , 0.2]])

#pandas return type
FormulaicTransformer("a + b + a:b", "pandas").fit_transform(df)
#   Intercept       a[T.B]  a[T.C]  b           a[T.B]:b    a[T.C]:b
#0  1.0             0           0       0.3     0.0         0.0
#1  1.0             1           0       0.1     0.1         0.0
#2  1.0             0           1       0.2     0.0         0.2

Source code in sklego/preprocessing/formulaictransformer.py

class FormulaicTransformer(TransformerMixin, BaseEstimator):
    """The `FormulaicTransformer` offers a method to select the right columns from a dataframe as well as a DSL for
    transformations.

    It is inspired from R formulas. This is can be useful as a first step in the pipeline.

    Parameters
    ----------
    formula : str
        A formulaic-compatible formula.
        Refer to the [formulaic documentation](https://matthewwardrop.github.io/formulaic/guides/grammar/) for more
            details.
    return_type : Literal["pandas", "numpy", "sparse"], default="numpy"
        The type of the returned matrix.
        Refer to the [formulaic documentation](https://matthewwardrop.github.io/formulaic/guides/model_specs/) for more
            details.

    Attributes
    ----------
    formula_ : formulaic.Formula
        The parsed formula specification.
    model_spec_ : formulaic.ModelSpec
        The parsed model specification.
    n_features_in_ : int
        Number of features seen during `fit`.

    Examples
    --------
    ```py
    import formulaic
    import pandas as pd
    import numpy as np
    from sklego.preprocessing import FormulaicTransformer

    df = pd.DataFrame({
        'a': ['A', 'B', 'C'],
        'b': [0.3, 0.1, 0.2],
    })

    #default type of returned matrix - numpy
    FormulaicTransformer("a + b + a:b").fit_transform(df)
    # array([[1. , 0. , 0. , 0.3, 0. , 0. ],
    #        [1. , 1. , 0. , 0.1, 0.1, 0. ],
    #        [1. , 0. , 1. , 0.2, 0. , 0.2]])

    #pandas return type
    FormulaicTransformer("a + b + a:b", "pandas").fit_transform(df)
    #	Intercept	a[T.B]	a[T.C]	b	    a[T.B]:b	a[T.C]:b
    #0	1.0	        0	    0	    0.3	    0.0	        0.0
    #1	1.0	        1	    0	    0.1	    0.1	        0.0
    #2	1.0	        0	    1	    0.2	    0.0	        0.2
    ```
    """

    _required_parameters = ["formula"]

    def __init__(self, formula, return_type="numpy"):
        self.formula = formula
        self.return_type = return_type

    def fit(self, X, y=None):
        """Fit the `FormulaicTransformer` to the data by compiling the formula specification into a model spec.

        Parameters
        ----------
        X : pd.DataFrame of (n_samples, n_features)
            The data used to compile model spec.
        y : array-like of shape (n_samples,), default=None
            Ignored, present for compatibility.

        Returns
        -------
        self : FormulaicTransformer
            The fitted transformer.

        Raises
        ------
        ValueError
            If `formula` is not supported.
        """
        self.formula_ = formulaic.Formula.from_spec(self.formula)

        if self.formula_._has_structure:
            raise ValueError(
                f"Formula specification {repr(self.formula_)} results in a structured formula, which is not supported."
            )

        self.model_spec_ = self.formula_.get_model_matrix(X, output=self.return_type).model_spec
        self.n_features_in_ = X.shape[1]
        return self

    def transform(self, X, y=None):
        """Transform `X` by generating a model matrix from it based on the fit model spec.

        Parameters
        ----------
        X : pd.DataFrame of shape (n_samples, n_features)
            The data for transformation will be applied.
        y: array-like of shape (n_samples,), default=None
            Ignored, present for compatibility.

        Returns
        -------
        X : array-like of shape (n_samples, n_features), and type `return_type`
            Transformed data.

        Raises
        ------
        ValueError
            If the number of columns from `X` differs from the number of columns when fitting.
        """

        check_is_fitted(self, ["formula_", "model_spec_", "n_features_in_"])

        if X.shape[1] != self.n_features_in_:
            raise ValueError(
                "`X` must have the same number of columns in fit and transform. "
                f"Expected {self.n_features_in_}, found {X.shape[1]}."
            )

        X_ = self.model_spec_.get_model_matrix(X)
        return X_

`fit(X, y=None)` ¶

Fit the FormulaicTransformer to the data by compiling the formula specification into a model spec.

Parameters:

Name	Type	Description	Default
`X`	`pd.DataFrame of (n_samples, n_features)`	The data used to compile model spec.	required
`y`	`array-like of shape (n_samples,)`	Ignored, present for compatibility.	`None`

Returns:

Name	Type	Description
`self`	`FormulaicTransformer`	The fitted transformer.

Raises:

Type	Description
`ValueError`	If `formula` is not supported.

Source code in sklego/preprocessing/formulaictransformer.py

def fit(self, X, y=None):
    """Fit the `FormulaicTransformer` to the data by compiling the formula specification into a model spec.

    Parameters
    ----------
    X : pd.DataFrame of (n_samples, n_features)
        The data used to compile model spec.
    y : array-like of shape (n_samples,), default=None
        Ignored, present for compatibility.

    Returns
    -------
    self : FormulaicTransformer
        The fitted transformer.

    Raises
    ------
    ValueError
        If `formula` is not supported.
    """
    self.formula_ = formulaic.Formula.from_spec(self.formula)

    if self.formula_._has_structure:
        raise ValueError(
            f"Formula specification {repr(self.formula_)} results in a structured formula, which is not supported."
        )

    self.model_spec_ = self.formula_.get_model_matrix(X, output=self.return_type).model_spec
    self.n_features_in_ = X.shape[1]
    return self

`transform(X, y=None)` ¶

Transform X by generating a model matrix from it based on the fit model spec.

Parameters:

Name	Type	Description	Default
`X`	`pd.DataFrame of shape (n_samples, n_features)`	The data for transformation will be applied.	required
`y`		Ignored, present for compatibility.	`None`

Returns:

Name	Type	Description
`X`	array-like of shape (n_samples, n_features), and type `return_type`	Transformed data.

Raises:

Type	Description
`ValueError`	If the number of columns from `X` differs from the number of columns when fitting.

Source code in sklego/preprocessing/formulaictransformer.py

def transform(self, X, y=None):
    """Transform `X` by generating a model matrix from it based on the fit model spec.

    Parameters
    ----------
    X : pd.DataFrame of shape (n_samples, n_features)
        The data for transformation will be applied.
    y: array-like of shape (n_samples,), default=None
        Ignored, present for compatibility.

    Returns
    -------
    X : array-like of shape (n_samples, n_features), and type `return_type`
        Transformed data.

    Raises
    ------
    ValueError
        If the number of columns from `X` differs from the number of columns when fitting.
    """

    check_is_fitted(self, ["formula_", "model_spec_", "n_features_in_"])

    if X.shape[1] != self.n_features_in_:
        raise ValueError(
            "`X` must have the same number of columns in fit and transform. "
            f"Expected {self.n_features_in_}, found {X.shape[1]}."
        )

    X_ = self.model_spec_.get_model_matrix(X)
    return X_

`sklego.preprocessing.monotonicspline.MonotonicSplineTransformer` ¶

Bases: TransformerMixin, BaseEstimator

The MonotonicSplineTransformer integrates the output of the SplineTransformer in an attempt to make monotonic features.

This estimator is heavily inspired by this blogpost by Mate Kadlicsko.

Parameters:

Name	Type	Description	Default
`n_knots`	`int`	The number of knots to use in the spline transformation.	`3`
`degree`	`int`	The polynomial degree to use in the spline transformation	`3`
`knots`	`Literal[uniform, quantile]`	Knots argument of spline transformer	`"uniform"`

Attributes:

Name	Type	Description
`spline_transformer_`	`trained SplineTransformer`
`features_in_`	`int`	The number of features seen in the training data.

Source code in sklego/preprocessing/monotonicspline.py

class MonotonicSplineTransformer(TransformerMixin, BaseEstimator):
    """The `MonotonicSplineTransformer` integrates the output of the `SplineTransformer` in an attempt to make monotonic features.

    This estimator is heavily inspired by [this blogpost](https://matekadlicsko.github.io/posts/monotonic-splines/) by Mate Kadlicsko.

    Parameters
    ----------
    n_knots : int, default=3
        The number of knots to use in the spline transformation.
    degree : int, default=3
        The polynomial degree to use in the spline transformation
    knots : Literal['uniform', 'quantile'], default="uniform"
        Knots argument of spline transformer

    Attributes
    ----------
    spline_transformer_ : trained SplineTransformer
    features_in_ : int
        The number of features seen in the training data.

    """

    def __init__(self, n_knots=3, degree=3, knots="uniform"):
        self.n_knots = n_knots
        self.degree = degree
        self.knots = knots

    def fit(self, X, y=None):
        """Fit the `MonotonicSplineTransformer` transformer by computing the spline transformation of `X`.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data to transform.
        y : array-like of shape (n_samples,), default=None
            Ignored, present for compatibility.

        Returns
        -------
        self : MonotonicSplineTransformer
            The fitted transformer.

        Raises
        ------
        ValueError
            If `X` contains non-numeric columns.
        """
        X = check_array(X, copy=True, force_all_finite=False, dtype=FLOAT_DTYPES, estimator=self)

        # If X contains infs, we need to replace them by nans before computing quantiles
        self.spline_transformer_ = {
            col: SplineTransformer(n_knots=self.n_knots, degree=self.degree, knots=self.knots).fit(
                X[:, col].reshape(-1, 1)
            )
            for col in range(X.shape[1])
        }
        self.n_features_in_ = X.shape[1]
        return self

    def transform(self, X):
        """Performs the Ispline transformation on `X`.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)

        Returns
        -------
        X : np.ndarray of shape (n_samples, n_out)
            Transformed `X` values.

        Raises
        ------
        ValueError
            If the number of columns from `X` differs from the number of columns when fitting.
        """
        check_is_fitted(self, "spline_transformer_")
        X = check_array(
            X,
            force_all_finite=False,
            dtype=FLOAT_DTYPES,
            estimator=self,
        )
        if X.shape[1] != self.n_features_in_:
            raise ValueError("Number of features going into .transform() do not match number going into .fit().")

        out = []
        for col in range(X.shape[1]):
            out.append(
                np.cumsum(
                    self.spline_transformer_[col].transform(X[:, [col]])[:, ::-1],
                    axis=1,
                )
            )
        return np.concatenate(out, axis=1)

`fit(X, y=None)` ¶

Fit the MonotonicSplineTransformer transformer by computing the spline transformation of X.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	The data to transform.	required
`y`	`array-like of shape (n_samples,)`	Ignored, present for compatibility.	`None`

Returns:

Name	Type	Description
`self`	`MonotonicSplineTransformer`	The fitted transformer.

Raises:

Type	Description
`ValueError`	If `X` contains non-numeric columns.

Source code in sklego/preprocessing/monotonicspline.py

def fit(self, X, y=None):
    """Fit the `MonotonicSplineTransformer` transformer by computing the spline transformation of `X`.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data to transform.
    y : array-like of shape (n_samples,), default=None
        Ignored, present for compatibility.

    Returns
    -------
    self : MonotonicSplineTransformer
        The fitted transformer.

    Raises
    ------
    ValueError
        If `X` contains non-numeric columns.
    """
    X = check_array(X, copy=True, force_all_finite=False, dtype=FLOAT_DTYPES, estimator=self)

    # If X contains infs, we need to replace them by nans before computing quantiles
    self.spline_transformer_ = {
        col: SplineTransformer(n_knots=self.n_knots, degree=self.degree, knots=self.knots).fit(
            X[:, col].reshape(-1, 1)
        )
        for col in range(X.shape[1])
    }
    self.n_features_in_ = X.shape[1]
    return self

`transform(X)` ¶

Performs the Ispline transformation on X.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`		required

Returns:

Name	Type	Description
`X`	`np.ndarray of shape (n_samples, n_out)`	Transformed `X` values.

Raises:

Type	Description
`ValueError`	If the number of columns from `X` differs from the number of columns when fitting.

Source code in sklego/preprocessing/monotonicspline.py

def transform(self, X):
    """Performs the Ispline transformation on `X`.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)

    Returns
    -------
    X : np.ndarray of shape (n_samples, n_out)
        Transformed `X` values.

    Raises
    ------
    ValueError
        If the number of columns from `X` differs from the number of columns when fitting.
    """
    check_is_fitted(self, "spline_transformer_")
    X = check_array(
        X,
        force_all_finite=False,
        dtype=FLOAT_DTYPES,
        estimator=self,
    )
    if X.shape[1] != self.n_features_in_:
        raise ValueError("Number of features going into .transform() do not match number going into .fit().")

    out = []
    for col in range(X.shape[1]):
        out.append(
            np.cumsum(
                self.spline_transformer_[col].transform(X[:, [col]])[:, ::-1],
                axis=1,
            )
        )
    return np.concatenate(out, axis=1)

`sklego.preprocessing.projections.OrthogonalTransformer` ¶

Bases: BaseEstimator, TransformerMixin

The OrthogonalTransformer transforms the columns of a dataframe or numpy array to orthogonal (or orthonormal if normalize=True) matrix.

It learns matrices \(Q, R\) such that \(X = Q \cdot R\), with \(Q\) orthogonal, from which follows \(Q = X \cdot R^{-1}\)

Parameters:

Name	Type	Description	Default
`normalize`	`bool`	Whether or not orthogonal matrix should be orthonormal as well.	`False`

Attributes:

Name	Type	Description
`inv_R_`	`array-like of shape (n_features, n_features)`	The inverse of R of the QR decomposition of `X`.
`normalization_vector_`	`array-like of shape (n_features,)`	The normalization terms to make the orthogonal matrix orthonormal.

Examples:

from sklearn.datasets import make_regression
from sklego.preprocessing import OrthogonalTransformer

# Generate a synthetic dataset
X, y = make_regression(n_samples=100, n_features=3, noise=0.1, random_state=42)

# Instantiate the transformer
transformer = OrthogonalTransformer(normalize=True)

# Fit the pipeline with the training data
transformer.fit(X)

# Transform the data using the fitted transformer
X_transformed = transformer.transform(X)

Source code in sklego/preprocessing/projections.py

class OrthogonalTransformer(BaseEstimator, TransformerMixin):
    r"""The `OrthogonalTransformer` transforms the columns of a dataframe or numpy array to orthogonal (or
    orthonormal if `normalize=True`) matrix.

    It learns matrices $Q, R$ such that $X = Q \cdot R$, with $Q$ orthogonal, from which follows $Q = X \cdot R^{-1}$

    Parameters
    ----------
    normalize : bool, default=False
        Whether or not orthogonal matrix should be orthonormal as well.

    Attributes
    ----------
    inv_R_ : array-like of shape (n_features, n_features)
        The inverse of R of the QR decomposition of `X`.
    normalization_vector_ : array-like of shape (n_features,)
        The normalization terms to make the orthogonal matrix orthonormal.

    Examples
    --------
    ```py
    from sklearn.datasets import make_regression
    from sklego.preprocessing import OrthogonalTransformer

    # Generate a synthetic dataset
    X, y = make_regression(n_samples=100, n_features=3, noise=0.1, random_state=42)

    # Instantiate the transformer
    transformer = OrthogonalTransformer(normalize=True)

    # Fit the pipeline with the training data
    transformer.fit(X)

    # Transform the data using the fitted transformer
    X_transformed = transformer.transform(X)
    ```
    """

    def __init__(self, normalize=False):
        self.normalize = normalize

    def fit(self, X, y=None):
        """Fit the transformer to the input data by calculating the inverse of R of the QR decomposition of `X`.
        This can be used to calculate the orthogonal projection of `X`.

        If normalization is required, also stores a vector with normalization terms.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data to fit.
        y : array-like of shape (n_samples,), default=None
            Ignored, present for compatibility.

        Returns
        -------
        self : OrthogonalTransformer
            The fitted transformer.
        """
        X = check_array(X, estimator=self)

        if not X.shape[0] > 1:
            raise ValueError("Orthogonal transformation not valid for one sample")

        # Q, R such that X = Q*R, with Q orthogonal, from which follows Q = X*inv(R)
        Q, R = np.linalg.qr(X)
        self.inv_R_ = np.linalg.inv(R)

        if self.normalize:
            self.normalization_vector_ = np.linalg.norm(Q, ord=2, axis=0)
        else:
            self.normalization_vector_ = np.ones((X.shape[1],))
        self.n_features_in_ = X.shape[1]
        return self

    def transform(self, X):
        """Transforms `X` using the fitted inverse of R. Normalizes the result if required.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data to transform.

        Returns
        -------
        array-like of shape (n_samples, n_features)
            The transformed data.
        """
        if self.normalize:
            check_is_fitted(self, ["inv_R_", "normalization_vector_"])
        else:
            check_is_fitted(self, ["inv_R_"])

        X = check_array(X, estimator=self)

        return X @ self.inv_R_ / self.normalization_vector_

`fit(X, y=None)` ¶

Fit the transformer to the input data by calculating the inverse of R of the QR decomposition of X. This can be used to calculate the orthogonal projection of X.

If normalization is required, also stores a vector with normalization terms.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	The data to fit.	required
`y`	`array-like of shape (n_samples,)`	Ignored, present for compatibility.	`None`

Returns:

Name	Type	Description
`self`	`OrthogonalTransformer`	The fitted transformer.

Source code in sklego/preprocessing/projections.py

def fit(self, X, y=None):
    """Fit the transformer to the input data by calculating the inverse of R of the QR decomposition of `X`.
    This can be used to calculate the orthogonal projection of `X`.

    If normalization is required, also stores a vector with normalization terms.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data to fit.
    y : array-like of shape (n_samples,), default=None
        Ignored, present for compatibility.

    Returns
    -------
    self : OrthogonalTransformer
        The fitted transformer.
    """
    X = check_array(X, estimator=self)

    if not X.shape[0] > 1:
        raise ValueError("Orthogonal transformation not valid for one sample")

    # Q, R such that X = Q*R, with Q orthogonal, from which follows Q = X*inv(R)
    Q, R = np.linalg.qr(X)
    self.inv_R_ = np.linalg.inv(R)

    if self.normalize:
        self.normalization_vector_ = np.linalg.norm(Q, ord=2, axis=0)
    else:
        self.normalization_vector_ = np.ones((X.shape[1],))
    self.n_features_in_ = X.shape[1]
    return self

`transform(X)` ¶

Transforms X using the fitted inverse of R. Normalizes the result if required.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	The data to transform.	required

Returns:

Type	Description
`array-like of shape (n_samples, n_features)`	The transformed data.

Source code in sklego/preprocessing/projections.py

def transform(self, X):
    """Transforms `X` using the fitted inverse of R. Normalizes the result if required.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data to transform.

    Returns
    -------
    array-like of shape (n_samples, n_features)
        The transformed data.
    """
    if self.normalize:
        check_is_fitted(self, ["inv_R_", "normalization_vector_"])
    else:
        check_is_fitted(self, ["inv_R_"])

    X = check_array(X, estimator=self)

    return X @ self.inv_R_ / self.normalization_vector_

`sklego.preprocessing.outlier_remover.OutlierRemover` ¶

Bases: TrainOnlyTransformerMixin, BaseEstimator

The OutlierRemover transformer removes outliers (train-time only) using the supplied removal model. The removal model should implement .fit() and .predict() methods.

Parameters:

Name	Type	Description	Default
`outlier_detector`	`scikit-learn compatible estimator`	An outlier detector that implements `.fit()` and `.predict()` methods.	required
`refit`	`bool`	Whether or not to fit the underlying estimator during `OutlierRemover(...).fit()`.	`True`

Attributes:

Name	Type	Description
`estimator_`	`object`	The fitted outlier detector.

Examples:

import numpy as np

from sklearn.ensemble import IsolationForest
from sklego.preprocessing import OutlierRemover

np.random.seed(0)
X = np.random.randn(10000, 2)

isolation_forest = IsolationForest()
isolation_forest.fit(X)
detector_preds = isolation_forest.predict(X)

outlier_remover = OutlierRemover(isolation_forest, refit=True)
outlier_remover.fit(X)

X_trans = outlier_remover.transform_train(X)

Source code in sklego/preprocessing/outlier_remover.py

class OutlierRemover(TrainOnlyTransformerMixin, BaseEstimator):
    """The `OutlierRemover` transformer removes outliers (train-time only) using the supplied removal model. The
    removal model should implement `.fit()` and `.predict()` methods.

    Parameters
    ----------
    outlier_detector : scikit-learn compatible estimator
        An outlier detector that implements `.fit()` and `.predict()` methods.
    refit : bool, default=True
        Whether or not to fit the underlying estimator during `OutlierRemover(...).fit()`.

    Attributes
    ----------
    estimator_ : object
        The fitted outlier detector.

    Examples
    --------
    ```py
    import numpy as np

    from sklearn.ensemble import IsolationForest
    from sklego.preprocessing import OutlierRemover

    np.random.seed(0)
    X = np.random.randn(10000, 2)

    isolation_forest = IsolationForest()
    isolation_forest.fit(X)
    detector_preds = isolation_forest.predict(X)

    outlier_remover = OutlierRemover(isolation_forest, refit=True)
    outlier_remover.fit(X)

    X_trans = outlier_remover.transform_train(X)
    ```
    """

    _required_parameters = ["outlier_detector"]

    def __init__(self, outlier_detector, refit=True):
        self.outlier_detector = outlier_detector
        self.refit = refit

    def fit(self, X, y=None):
        """Fit the estimator on training data `X` and `y` by fitting the underlying outlier detector if `refit` is True.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training data.
        y : array-like of shape (n_samples,), default=None
            Target values.

        Returns
        -------
        self : OutlierRemover
            The fitted transformer.
        """
        self.estimator_ = clone(self.outlier_detector)
        if self.refit:
            super().fit(X, y)
            self.estimator_.fit(X, y)
        return self

    def transform_train(self, X):
        """Removes outliers from `X` using the fitted estimator.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data for which the outliers will be removed.

        Returns
        -------
        np.ndarray of shape (n_not_outliers, n_features)
            The data with the outliers removed, where `n_not_outliers = n_samples - n_outliers`.
        """
        check_is_fitted(self, "estimator_")
        predictions = self.estimator_.predict(X)
        check_array(predictions, estimator=self.outlier_detector, ensure_2d=False)
        return X[predictions != -1]

`fit(X, y=None)` ¶

Fit the estimator on training data X and y by fitting the underlying outlier detector if refit is True.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	Training data.	required
`y`	`array-like of shape (n_samples,)`	Target values.	`None`

Returns:

Name	Type	Description
`self`	`OutlierRemover`	The fitted transformer.

Source code in sklego/preprocessing/outlier_remover.py

def fit(self, X, y=None):
    """Fit the estimator on training data `X` and `y` by fitting the underlying outlier detector if `refit` is True.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        Training data.
    y : array-like of shape (n_samples,), default=None
        Target values.

    Returns
    -------
    self : OutlierRemover
        The fitted transformer.
    """
    self.estimator_ = clone(self.outlier_detector)
    if self.refit:
        super().fit(X, y)
        self.estimator_.fit(X, y)
    return self

`transform_train(X)` ¶

Removes outliers from X using the fitted estimator.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	The data for which the outliers will be removed.	required

Returns:

Type	Description
`np.ndarray of shape (n_not_outliers, n_features)`	The data with the outliers removed, where `n_not_outliers = n_samples - n_outliers`.

Source code in sklego/preprocessing/outlier_remover.py

def transform_train(self, X):
    """Removes outliers from `X` using the fitted estimator.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data for which the outliers will be removed.

    Returns
    -------
    np.ndarray of shape (n_not_outliers, n_features)
        The data with the outliers removed, where `n_not_outliers = n_samples - n_outliers`.
    """
    check_is_fitted(self, "estimator_")
    predictions = self.estimator_.predict(X)
    check_array(predictions, estimator=self.outlier_detector, ensure_2d=False)
    return X[predictions != -1]

`sklego.preprocessing.pandastransformers.PandasTypeSelector` ¶

Bases: TypeSelector

Deprecated since version 0.9.0, please use TypeSelector instead

Source code in sklego/preprocessing/pandastransformers.py

class PandasTypeSelector(TypeSelector):
    """
    !!! warning "Deprecated since version 0.9.0, please use TypeSelector instead"
    """

    def __init__(self, include=None, exclude=None):
        warnings.warn(
            "PandasTypeSelector is deprecated and will be removed in a future version. "
            "Please use `from sklego.preprocessing import TypeSelector` instead.",
            DeprecationWarning,
            stacklevel=2,
        )
        super().__init__(include=include, exclude=exclude)

`sklego.preprocessing.randomadder.RandomAdder` ¶

Bases: TrainOnlyTransformerMixin, BaseEstimator

The RandomAdder transformer adds random noise to the input data.

This class is designed to be used during the training phase and not for transforming test data. Noise added is sampled from a normal distribution with mean 0 and standard deviation noise.

Parameters:

Name	Type	Description	Default
`noise`	`float`	The standard deviation of the normal distribution from which the noise is sampled.	`1.0`
`random_state`	`int \| None`	The seed used by the random number generator.	`None`

Attributes:

Name	Type	Description
`n_features_in_`	`int`	Number of features seen during `fit`.
`dim_`	`int`	Deprecated, please use `n_features_in_` instead.

Examples:

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklego.preprocessing import RandomAdder

# Create a pipeline with the RandomAdder and a LinearRegression model
pipeline = Pipeline([
    ('random_adder', RandomAdder(noise=0.5, random_state=42)),
    ('linear_regression', LinearRegression())
])

# Fit the pipeline with training data
pipeline.fit(X_train, y_train)

# Use the fitted pipeline to make predictions
y_pred = pipeline.predict(X_test)

Source code in sklego/preprocessing/randomadder.py

class RandomAdder(TrainOnlyTransformerMixin, BaseEstimator):
    """The `RandomAdder` transformer adds random noise to the input data.

    This class is designed to be used during the training phase and not for transforming test data.
    Noise added is sampled from a normal distribution with mean 0 and standard deviation `noise`.

    Parameters
    ----------
    noise : float, default=1.0
        The standard deviation of the normal distribution from which the noise is sampled.
    random_state : int | None
        The seed used by the random number generator.

    Attributes
    ----------
    n_features_in_ : int
        Number of features seen during `fit`.
    dim_ : int
        Deprecated, please use `n_features_in_` instead.

    Examples
    --------
    ```py
    from sklearn.pipeline import Pipeline
    from sklearn.linear_model import LinearRegression
    from sklego.preprocessing import RandomAdder

    # Create a pipeline with the RandomAdder and a LinearRegression model
    pipeline = Pipeline([
        ('random_adder', RandomAdder(noise=0.5, random_state=42)),
        ('linear_regression', LinearRegression())
    ])

    # Fit the pipeline with training data
    pipeline.fit(X_train, y_train)

    # Use the fitted pipeline to make predictions
    y_pred = pipeline.predict(X_test)
    ```
    """

    def __init__(self, noise=1, random_state=None):
        self.noise = noise
        self.random_state = random_state

    def fit(self, X, y):
        """Fit the transformer on training data `X` and `y` by checking the input data and record the number of
        input features.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training data.
        y : array-like of shape (n_samples,)
            Target values.

        Returns
        -------
        self : RandomAdder
            The fitted transformer.
        """
        super().fit(X, y)
        X, y = check_X_y(X, y, estimator=self, dtype=FLOAT_DTYPES)
        self.n_features_in_ = X.shape[1]

        return self

    def transform_train(self, X):
        r"""Transform training data by adding random noise sampled from $N(0, \text{noise})$.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data for which the noise will be added.

        Returns
        -------
        np.ndarray of shape (n_samples, n_features)
            The data with the noise added.
        """
        rs = check_random_state(self.random_state)
        check_is_fitted(self, ["n_features_in_"])

        X = check_array(X, estimator=self, dtype=FLOAT_DTYPES)

        return X + rs.normal(0, self.noise, size=X.shape)

    @property
    def dim_(self):
        warn(
            "Please use `n_features_in_` instead of `dim_`, `dim_` will be deprecated in future versions",
            DeprecationWarning,
        )
        return self.n_features_in_

    def _more_tags(self):
        return {"non_deterministic": True}

`fit(X, y)` ¶

Fit the transformer on training data X and y by checking the input data and record the number of input features.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	Training data.	required
`y`	`array-like of shape (n_samples,)`	Target values.	required

Returns:

Name	Type	Description
`self`	`RandomAdder`	The fitted transformer.

Source code in sklego/preprocessing/randomadder.py

def fit(self, X, y):
    """Fit the transformer on training data `X` and `y` by checking the input data and record the number of
    input features.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        Training data.
    y : array-like of shape (n_samples,)
        Target values.

    Returns
    -------
    self : RandomAdder
        The fitted transformer.
    """
    super().fit(X, y)
    X, y = check_X_y(X, y, estimator=self, dtype=FLOAT_DTYPES)
    self.n_features_in_ = X.shape[1]

    return self

`transform_train(X)` ¶

Transform training data by adding random noise sampled from \(N(0, \text{noise})\).

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	The data for which the noise will be added.	required

Returns:

Type	Description
`np.ndarray of shape (n_samples, n_features)`	The data with the noise added.

Source code in sklego/preprocessing/randomadder.py

def transform_train(self, X):
    r"""Transform training data by adding random noise sampled from $N(0, \text{noise})$.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data for which the noise will be added.

    Returns
    -------
    np.ndarray of shape (n_samples, n_features)
        The data with the noise added.
    """
    rs = check_random_state(self.random_state)
    check_is_fitted(self, ["n_features_in_"])

    X = check_array(X, estimator=self, dtype=FLOAT_DTYPES)

    return X + rs.normal(0, self.noise, size=X.shape)

`sklego.preprocessing.repeatingbasis.RepeatingBasisFunction` ¶

Bases: TransformerMixin, BaseEstimator

The RepeatingBasisFunction transformer is designed to be used when the input data has a circular nature.

For example, for days of the week you might face the problem that, conceptually, day 7 is as close to day 6 as it is to day 1. While numerically their distance is different.

This transformer remedies that problem. The transformer selects a column and transforms it with a given number of repeating (radial) basis functions.

Radial basis functions are bell-curve shaped functions which take the original data as input. The basis functions are equally spaced over the input range. The key feature of repeating basis functions is that they are continuous when moving from the max to the min of the input range. As a result these repeating basis functions can capture how close each datapoint is to the center of each repeating basis function, even when the input data has a circular nature.

Parameters:

Name	Type	Description	Default
`column`	`int \| str`	Index or column name of the data to transform. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name.	`0`
`remainder`	`Literal[drop, passthrough]`	By default, only the specified column is transformed, and the non-specified columns are dropped. By specifying `remainder="passthrough"`, all remaining columns will be automatically passed through. This subset of columns is concatenated with the output of the transformer.	`"drop"`
`n_periods`	`int`	Number of basis functions to create, i.e., the number of columns that will exit the transformer.	`12`
`input_range`	`Tuple[float, float] \| List[float] \| None`	The values at which the data repeats itself. For example, for days of the week this is (1,7). If `input_range=None` it is inferred from the training data.	`None`
`width`	`float`	Determines the width of the radial basis functions.	`1.0.`

Attributes:

Name	Type	Description
`pipeline_`	`ColumnTransformer`	Fitted `ColumnTransformer` object used to transform data with repeating basis functions.

Examples:

import pandas as pd
from sklego.preprocessing import RepeatingBasisFunction

df = pd.DataFrame({
    "user_id": [101, 102, 103],
    "created_day": [5, 1, 7]
})
RepeatingBasisFunction(column="created_day", input_range=(1,7)).fit_transform(df)
# array([[0.06217652, 0.00432024, 0.16901332, 0.89483932, 0.64118039],
#        [1.        , 0.36787944, 0.01831564, 0.01831564, 0.36787944],
#        [1.        , 0.36787944, 0.01831564, 0.01831564, 0.36787944]])

Source code in sklego/preprocessing/repeatingbasis.py

class RepeatingBasisFunction(TransformerMixin, BaseEstimator):
    """The `RepeatingBasisFunction` transformer is designed to be used when the input data has a circular nature.

    For example, for days of the week you might face the problem that, conceptually, day 7 is as close to day 6 as it is
    to day 1. While numerically their distance is different.

    This transformer remedies that problem. The transformer selects a column and transforms it with a given number of
    repeating (radial) basis functions.

    Radial basis functions are bell-curve shaped functions which take the original data as input. The basis functions
    are equally spaced over the input range. The key feature of repeating basis functions is that they are continuous
    when moving from the max to the min of the input range. As a result these repeating basis functions can capture how
    close each datapoint is to the center of each repeating basis function, even when the input data has a circular
    nature.

    Parameters
    ----------
    column : int | str, default=0
        Index or column name of the data to transform. Integers are interpreted as positional columns, while
        strings can reference DataFrame columns by name.
    remainder : Literal["drop", "passthrough"], default="drop"
        By default, only the specified column is transformed, and the non-specified columns are dropped.
        By specifying `remainder="passthrough"`, all remaining columns will be automatically passed through.
        This subset of columns is concatenated with the output of the transformer.
    n_periods : int, default=12
        Number of basis functions to create, i.e., the number of columns that will exit the transformer.
    input_range : Tuple[float, float] | List[float] | None, default=None
        The values at which the data repeats itself. For example, for days of the week this is (1,7).
        If `input_range=None` it is inferred from the training data.
    width : float, default=1.0.
        Determines the width of the radial basis functions.

    Attributes
    ----------
    pipeline_ : ColumnTransformer
        Fitted `ColumnTransformer` object used to transform data with repeating basis functions.

    Examples
    --------
    ```py
    import pandas as pd
    from sklego.preprocessing import RepeatingBasisFunction

    df = pd.DataFrame({
        "user_id": [101, 102, 103],
        "created_day": [5, 1, 7]
    })
    RepeatingBasisFunction(column="created_day", input_range=(1,7)).fit_transform(df)
    # array([[0.06217652, 0.00432024, 0.16901332, 0.89483932, 0.64118039],
    #        [1.        , 0.36787944, 0.01831564, 0.01831564, 0.36787944],
    #        [1.        , 0.36787944, 0.01831564, 0.01831564, 0.36787944]])
    ```
    """

    def __init__(self, column=0, remainder="drop", n_periods=12, input_range=None, width=1.0):
        self.column = column
        self.remainder = remainder
        self.n_periods = n_periods
        self.input_range = input_range
        self.width = width

    def fit(self, X, y=None):
        """Fit `RepeatingBasisFunction` transformer on input data `X`.
        It uses `sklearn.compose.ColumnTransformer`.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data used to compute the quantiles for capping.
        y : array-like of shape (n_samples,), default=None
            Ignored, present for compatibility.

        Returns
        -------
        self : RepeatingBasisFunction
            The fitted transformer.
        """
        self.pipeline_ = ColumnTransformer(
            [
                (
                    "repeatingbasis",
                    _RepeatingBasisFunction(
                        n_periods=self.n_periods,
                        input_range=self.input_range,
                        width=self.width,
                    ),
                    [self.column],
                )
            ],
            remainder=self.remainder,
        )

        self.pipeline_.fit(X, y)

        return self

    def transform(self, X):
        """Transform input data `X` with fitted `RepeatingBasisFunction` transformer.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The data to transform.

        Returns
        -------
        X_transformed : array-like of shape (n_samples, n_periods)
            Transformed data.
        """
        check_is_fitted(self, ["pipeline_"])
        return self.pipeline_.transform(X)

`fit(X, y=None)` ¶

Fit RepeatingBasisFunction transformer on input data X. It uses sklearn.compose.ColumnTransformer.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	The data used to compute the quantiles for capping.	required
`y`	`array-like of shape (n_samples,)`	Ignored, present for compatibility.	`None`

Returns:

Name	Type	Description
`self`	`RepeatingBasisFunction`	The fitted transformer.

Source code in sklego/preprocessing/repeatingbasis.py

def fit(self, X, y=None):
    """Fit `RepeatingBasisFunction` transformer on input data `X`.
    It uses `sklearn.compose.ColumnTransformer`.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data used to compute the quantiles for capping.
    y : array-like of shape (n_samples,), default=None
        Ignored, present for compatibility.

    Returns
    -------
    self : RepeatingBasisFunction
        The fitted transformer.
    """
    self.pipeline_ = ColumnTransformer(
        [
            (
                "repeatingbasis",
                _RepeatingBasisFunction(
                    n_periods=self.n_periods,
                    input_range=self.input_range,
                    width=self.width,
                ),
                [self.column],
            )
        ],
        remainder=self.remainder,
    )

    self.pipeline_.fit(X, y)

    return self

`transform(X)` ¶

Transform input data X with fitted RepeatingBasisFunction transformer.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	The data to transform.	required

Returns:

Name	Type	Description
`X_transformed`	`array-like of shape (n_samples, n_periods)`	Transformed data.

Source code in sklego/preprocessing/repeatingbasis.py

def transform(self, X):
    """Transform input data `X` with fitted `RepeatingBasisFunction` transformer.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data to transform.

    Returns
    -------
    X_transformed : array-like of shape (n_samples, n_periods)
        Transformed data.
    """
    check_is_fitted(self, ["pipeline_"])
    return self.pipeline_.transform(X)

`sklego.preprocessing.pandastransformers.TypeSelector` ¶

Bases: BaseEstimator, TransformerMixin

The TypeSelector transformer allows to select columns in a DataFrame based on their type. Can be useful in a sklearn Pipeline.

For pandas, it uses pandas.DataFrame.select_dtypes method.
For non-pandas dataframes (e.g. Polars), the following inputs are allowed:
- 'number'
- 'string'
- 'bool'
- 'category'

New in version 0.9.0

Notes

Native cross-dataframe support is achieved using Narwhals.

Supported dataframes are:

pandas
Polars (eager or lazy)
Modin
cuDF

See Narwhals docs for an up-to-date list (and to learn how you can add your dataframe library to it!).

Parameters:

Name	Type	Description	Default
`include`	`scalar or list - like`	Column type(s) to be selected	`None`
`exclude`	`scalar or list - like`	Column type(s) to be excluded from selection	`None`

Attributes:

Name	Type	Description
`feature_names_`	`list[str]`	The names of the features to keep during transform.
`X_dtypes_`	`Series \| dict[str, DType]`	The dtypes of the columns in the input DataFrame.
`!!! warning`		Raises a `TypeError` if input provided is not a DataFrame.

Examples:

import pandas as pd
from sklego.preprocessing import TypeSelector

df = pd.DataFrame({
    "name": ["Swen", "Victor", "Alex"],
    "length": [1.82, 1.85, 1.80],
    "shoesize": [42, 44, 45]
})

#Excluding single column
TypeSelector(exclude="int64").fit_transform(df)
#   name    length
#0  Swen    1.82
#1  Victor  1.85
#2  Alex    1.80

#Including multiple columns
TypeSelector(include=["int64", "object"]).fit_transform(df)
#   name    shoesize
#0  Swen    42
#1  Victor  44
#2  Alex    45

Source code in sklego/preprocessing/pandastransformers.py

class TypeSelector(BaseEstimator, TransformerMixin):
    """The `TypeSelector` transformer allows to select columns in a DataFrame based on their type.
    Can be useful in a sklearn Pipeline.

    - For pandas, it uses
      [pandas.DataFrame.select_dtypes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html)
      method.
    - For non-pandas dataframes (e.g. Polars), the following inputs are allowed:

        - 'number'
        - 'string'
        - 'bool'
        - 'category'

    !!! info "New in version 0.9.0"

    Notes
    -----
    Native cross-dataframe support is achieved using
    [Narwhals](https://narwhals-dev.github.io/narwhals/){:target="_blank"}.

    Supported dataframes are:

    - pandas
    - Polars (eager or lazy)
    - Modin
    - cuDF

    See [Narwhals docs](https://narwhals-dev.github.io/narwhals/extending/){:target="_blank"} for an up-to-date list
    (and to learn how you can add your dataframe library to it!).

    Parameters
    ----------
    include : scalar or list-like
        Column type(s) to be selected
    exclude : scalar or list-like
        Column type(s) to be excluded from selection

    Attributes
    ----------
    feature_names_ : list[str]
        The names of the features to keep during transform.
    X_dtypes_ : Series | dict[str, DType]
        The dtypes of the columns in the input DataFrame.

    !!! warning

        Raises a `TypeError` if input provided is not a DataFrame.

    Examples
    --------
    ```py
    import pandas as pd
    from sklego.preprocessing import TypeSelector

    df = pd.DataFrame({
        "name": ["Swen", "Victor", "Alex"],
        "length": [1.82, 1.85, 1.80],
        "shoesize": [42, 44, 45]
    })

    #Excluding single column
    TypeSelector(exclude="int64").fit_transform(df)
    #	name	length
    #0	Swen	1.82
    #1	Victor	1.85
    #2	Alex	1.80

    #Including multiple columns
    TypeSelector(include=["int64", "object"]).fit_transform(df)
    #	name	shoesize
    #0	Swen	42
    #1	Victor	44
    #2	Alex	45
    ```
    """

    def __init__(self, include=None, exclude=None):
        self.include = include
        self.exclude = exclude

    def fit(self, X, y=None):
        """Fit the transformer by saving the column names to keep during transform.

        Parameters
        ----------
        X : DataFrame
            The data on which we apply the column selection.
        y : Series, default=None
            Ignored, present for compatibility.

        Returns
        -------
        self : TypeSelector
            The fitted transformer.

        Raises
        ------
        TypeError
            If `X` is not a supported DataFrame.
        ValueError
            If provided type(s) results in empty dataframe.
        """
        if (pd := get_pandas()) is not None and isinstance(X, pd.DataFrame):
            self.X_dtypes_ = X.dtypes
            self.feature_names_ = list(X.select_dtypes(include=self.include, exclude=self.exclude).columns)
        else:
            X = nw.from_native(X)
            self.X_dtypes_ = X.schema
            self.feature_names_ = _nw_select_dtypes(include=self.include, exclude=self.exclude, schema=self.X_dtypes_)

        if len(self.feature_names_) == 0:
            raise ValueError("Provided type(s) results in empty dataframe")

        return self

    def get_feature_names(self, *args, **kwargs):
        """Alias for `.feature_names_` attribute"""
        return self.feature_names_

    def transform(self, X):
        """Returns a DataFrame with columns (de)selected based on their dtype.

        Parameters
        ----------
        X : DataFrame
            The data to select dtype for.

        Returns
        -------
        DataFrame
            The data with the specified columns selected.

        Raises
        ------
        TypeError
            If `X` is not a supported DataFrame.
        ValueError
            If column dtypes were not equal during fit and transform.
        """
        check_is_fitted(self, ["X_dtypes_", "feature_names_"])

        if (pd := get_pandas()) is not None and isinstance(X, pd.DataFrame):
            try:
                if (self.X_dtypes_ != X.dtypes).any():
                    raise ValueError(
                        f"Column dtypes were not equal during fit and transform. Fit types: \n"
                        f"{self.X_dtypes_}\n"
                        f"transform: \n"
                        f"{X.dtypes}"
                    )
            except ValueError as e:
                raise ValueError("Column dtypes were not equal during fit and transform") from e
            transformed_df = X.select_dtypes(include=self.include, exclude=self.exclude)
        else:
            X = nw.from_native(X)
            X_schema = X.schema
            if self.X_dtypes_ != X_schema:
                raise ValueError(
                    f"Column dtypes were not equal during fit and transform. Fit types: \n"
                    f"{self.X_dtypes_}\n"
                    f"transform: \n"
                    f"{X.schema}"
                )
            transformed_df = X.select(
                _nw_select_dtypes(include=self.include, exclude=self.exclude, schema=X_schema)
            ).pipe(nw.to_native)

        return transformed_df

`fit(X, y=None)` ¶

Fit the transformer by saving the column names to keep during transform.

Parameters:

Name	Type	Description	Default
`X`	`DataFrame`	The data on which we apply the column selection.	required
`y`	`Series`	Ignored, present for compatibility.	`None`

Returns:

Name	Type	Description
`self`	`TypeSelector`	The fitted transformer.

Raises:

Type	Description
`TypeError`	If `X` is not a supported DataFrame.
`ValueError`	If provided type(s) results in empty dataframe.

Source code in sklego/preprocessing/pandastransformers.py

def fit(self, X, y=None):
    """Fit the transformer by saving the column names to keep during transform.

    Parameters
    ----------
    X : DataFrame
        The data on which we apply the column selection.
    y : Series, default=None
        Ignored, present for compatibility.

    Returns
    -------
    self : TypeSelector
        The fitted transformer.

    Raises
    ------
    TypeError
        If `X` is not a supported DataFrame.
    ValueError
        If provided type(s) results in empty dataframe.
    """
    if (pd := get_pandas()) is not None and isinstance(X, pd.DataFrame):
        self.X_dtypes_ = X.dtypes
        self.feature_names_ = list(X.select_dtypes(include=self.include, exclude=self.exclude).columns)
    else:
        X = nw.from_native(X)
        self.X_dtypes_ = X.schema
        self.feature_names_ = _nw_select_dtypes(include=self.include, exclude=self.exclude, schema=self.X_dtypes_)

    if len(self.feature_names_) == 0:
        raise ValueError("Provided type(s) results in empty dataframe")

    return self

`get_feature_names(*args, **kwargs)` ¶

Alias for .feature_names_ attribute

Source code in sklego/preprocessing/pandastransformers.py

def get_feature_names(self, *args, **kwargs):
    """Alias for `.feature_names_` attribute"""
    return self.feature_names_

`transform(X)` ¶

Returns a DataFrame with columns (de)selected based on their dtype.

Parameters:

Name	Type	Description	Default
`X`	`DataFrame`	The data to select dtype for.	required

Returns:

Type	Description
`DataFrame`	The data with the specified columns selected.

Raises:

Type	Description
`TypeError`	If `X` is not a supported DataFrame.
`ValueError`	If column dtypes were not equal during fit and transform.

Source code in sklego/preprocessing/pandastransformers.py

def transform(self, X):
    """Returns a DataFrame with columns (de)selected based on their dtype.

    Parameters
    ----------
    X : DataFrame
        The data to select dtype for.

    Returns
    -------
    DataFrame
        The data with the specified columns selected.

    Raises
    ------
    TypeError
        If `X` is not a supported DataFrame.
    ValueError
        If column dtypes were not equal during fit and transform.
    """
    check_is_fitted(self, ["X_dtypes_", "feature_names_"])

    if (pd := get_pandas()) is not None and isinstance(X, pd.DataFrame):
        try:
            if (self.X_dtypes_ != X.dtypes).any():
                raise ValueError(
                    f"Column dtypes were not equal during fit and transform. Fit types: \n"
                    f"{self.X_dtypes_}\n"
                    f"transform: \n"
                    f"{X.dtypes}"
                )
        except ValueError as e:
            raise ValueError("Column dtypes were not equal during fit and transform") from e
        transformed_df = X.select_dtypes(include=self.include, exclude=self.exclude)
    else:
        X = nw.from_native(X)
        X_schema = X.schema
        if self.X_dtypes_ != X_schema:
            raise ValueError(
                f"Column dtypes were not equal during fit and transform. Fit types: \n"
                f"{self.X_dtypes_}\n"
                f"transform: \n"
                f"{X.schema}"
            )
        transformed_df = X.select(
            _nw_select_dtypes(include=self.include, exclude=self.exclude, schema=X_schema)
        ).pipe(nw.to_native)

    return transformed_df

Preprocessing¶

sklego.preprocessing.columncapper.ColumnCapper ¶

fit(X, y=None) ¶

transform(X) ¶

sklego.preprocessing.pandastransformers.ColumnDropper ¶

fit(X, y=None) ¶

get_feature_names() ¶

transform(X) ¶

sklego.preprocessing.pandastransformers.ColumnSelector ¶

fit(X, y=None) ¶

get_feature_names() ¶

transform(X) ¶

sklego.preprocessing.dictmapper.DictMapper ¶

fit(X, y=None) ¶

transform(X) ¶

sklego.preprocessing.identitytransformer.IdentityTransformer ¶

shape_ property ¶

fit(X, y=None) ¶

transform(X) ¶

sklego.preprocessing.projections.InformationFilter ¶

fit(X, y=None) ¶

transform(X) ¶

sklego.preprocessing.intervalencoder.IntervalEncoder ¶

fit(X, y) ¶

transform(X) ¶

sklego.preprocessing.formulaictransformer.FormulaicTransformer ¶

fit(X, y=None) ¶

transform(X, y=None) ¶

sklego.preprocessing.monotonicspline.MonotonicSplineTransformer ¶

fit(X, y=None) ¶

transform(X) ¶

sklego.preprocessing.projections.OrthogonalTransformer ¶

fit(X, y=None) ¶

transform(X) ¶

sklego.preprocessing.outlier_remover.OutlierRemover ¶

fit(X, y=None) ¶

transform_train(X) ¶

sklego.preprocessing.pandastransformers.PandasTypeSelector ¶

sklego.preprocessing.randomadder.RandomAdder ¶

fit(X, y) ¶

transform_train(X) ¶

sklego.preprocessing.repeatingbasis.RepeatingBasisFunction ¶

fit(X, y=None) ¶

transform(X) ¶

sklego.preprocessing.pandastransformers.TypeSelector ¶

fit(X, y=None) ¶

get_feature_names(*args, **kwargs) ¶

transform(X) ¶

`sklego.preprocessing.columncapper.ColumnCapper` ¶

`fit(X, y=None)` ¶

`transform(X)` ¶

`sklego.preprocessing.pandastransformers.ColumnDropper` ¶

`fit(X, y=None)` ¶

`get_feature_names()` ¶

`transform(X)` ¶

`sklego.preprocessing.pandastransformers.ColumnSelector` ¶

`fit(X, y=None)` ¶

`get_feature_names()` ¶

`transform(X)` ¶

`sklego.preprocessing.dictmapper.DictMapper` ¶

`fit(X, y=None)` ¶

`transform(X)` ¶

`sklego.preprocessing.identitytransformer.IdentityTransformer` ¶

`shape_` `property` ¶

`fit(X, y=None)` ¶

`transform(X)` ¶

`sklego.preprocessing.projections.InformationFilter` ¶

`fit(X, y=None)` ¶

`transform(X)` ¶

`sklego.preprocessing.intervalencoder.IntervalEncoder` ¶

`fit(X, y)` ¶

`transform(X)` ¶

`sklego.preprocessing.formulaictransformer.FormulaicTransformer` ¶

`fit(X, y=None)` ¶

`transform(X, y=None)` ¶

`sklego.preprocessing.monotonicspline.MonotonicSplineTransformer` ¶

`fit(X, y=None)` ¶

`transform(X)` ¶

`sklego.preprocessing.projections.OrthogonalTransformer` ¶

`fit(X, y=None)` ¶

`transform(X)` ¶

`sklego.preprocessing.outlier_remover.OutlierRemover` ¶

`fit(X, y=None)` ¶

`transform_train(X)` ¶

`sklego.preprocessing.pandastransformers.PandasTypeSelector` ¶

`sklego.preprocessing.randomadder.RandomAdder` ¶

`fit(X, y)` ¶

`transform_train(X)` ¶

`sklego.preprocessing.repeatingbasis.RepeatingBasisFunction` ¶

`fit(X, y=None)` ¶

`transform(X)` ¶

`sklego.preprocessing.pandastransformers.TypeSelector` ¶

`fit(X, y=None)` ¶

`get_feature_names(*args, **kwargs)` ¶

`transform(X)` ¶