Features Selection¶
sklego.feature_selection.mrmr.MaximumRelevanceMinimumRedundancy
¶
Bases: SelectorMixin
, BaseEstimator
Maximum Relevance Minimum Redundancy (MRMR) is an iterative feature selection method commonly used in data science to select a subset of features from a larger feature set. The goal of MRMR is to choose features that have high relevance to the target variable while minimizing redundancy among the already selected features.
How MRMR works:
-
Compute the relevance of each feature to the target variable: The relevance of a feature is typically measured using a metric such as mutual information, correlation coefficient, or another appropriate measure of dependence between the feature and the target variable.
-
Compute the redundancy between each pair of features: Redundancy is the degree of similarity or overlap between features. It can be measured using metrics such as mutual information, correlation coefficient, or other similarity measures.
-
Select features based on the maximum relevance and minimum redundancy criteria: MRMR aims to maximize the relevance of selected features to the target variable while minimizing redundancy among them.
-
Construct the final subset of features: MRMR iteratively adds features to the selected subset until a predefined number of features is reached.
The implemented formula is:
Warning
If a custom relevance_func is provided it must have this signature:
Callable[[np.ndarray, np.ndarray], np.ndarray]
It should accept X, y as arguments and it should compute the score for each feature of X
and return an array of shape (n_features_in_,).
Warning
If a custom redundancy_func is provided it must have the same signature as the method _redundancy_pearson, hence the function must have three parameters:
- X : array-like, shape=(n_samples, n_features,). Training used to compute redundancy of the training features.
- selected : array-like. List of indexes of the selected features at iteration i-th.
- left : array-like. List of indexes of the left features at iteration i-th. Mrmr will select a feature from this list.
and it must return:
- np.ndarray, shape = (len(left), ), The array containing the redundancy score using the custom function.
New in version 0.8.0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
k
|
int
|
Number of feature the model should use. |
required |
relevance_func
|
str | Callable
|
The relevance function to use. The default maps to scikit-learn f_classif or f_regression for classification or regression (resp.) |
"f"
|
redundancy_func
|
str | Callable
|
The redundancy function to use. The default maps to Pearson correlation computed for each remaining features. |
"p"
|
kind
|
Literal[auto, classficiation, regression]
|
'classification' or 'regression' or 'auto' if auto the model will try to infer the type of problem looking at the y data type, by default "auto". |
"auto".
|
Attributes:
Name | Type | Description |
---|---|---|
_y_dtype |
dtype
|
data type of y |
selected_features_ |
array-like of shape (k,)
|
Indexes of the selected features. |
scores_ |
array-like of shape (k,)
|
Scores of the selected features. |
Examples:
from sklego.feature_selection import MaximumRelevanceMinimumRedundancy
from sklearn.datasets import make_classification
mrmr = MaximumRelevanceMinimumRedundancy(k=4,
kind='auto',
redundancy_func='p',
relevance_func='f')
X, y = make_classification(n_features=4)
# Fit mrmr model
mrmr = mrmr.fit(X, y)
# Selected features
selected_features = mrmr.selected_features_
# Get the scores of the selected features
feature_scores = mrmr.scores_
Source code in sklego/feature_selection/mrmr.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 |
|
fit(X, y)
¶
Fit the underlying feature selection algorithm on the training data X
and y
using the provided redundancy and relevance functions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
Training data. |
required |
y
|
array-like of shape (n_samples,)
|
Target values. |
required |
Returns:
Name | Type | Description |
---|---|---|
self |
MaximumRelevanceMinimumRedundancy
|
The fitted estimator. |
Raises:
Type | Description |
---|---|
ValueError
|
if:
|