Version 1.5#
For a short description of the main highlights of the release, please refer to Release Highlights for scikit-learn 1.5.
Legend for changelogs
Major Feature something big that you couldn’t do before.
Feature something that you couldn’t do before.
Efficiency an existing feature now may not require as much computation or memory.
Enhancement a miscellaneous minor improvement.
Fix something that previously didn’t work as documented – or according to reasonable expectations – should now work.
API Change you will need to change your code to have the same effect in the future; or a feature will be removed in the future.
Version 1.5.2#
September 2024
Changes impacting many modules#
Fix Fixed performance regression in a few Cython modules in
sklearn._loss,sklearn.manifold,sklearn.metricsandsklearn.utils, which were built without OpenMP support. #29694 by Loïc Estèvce.
Changelog#
sklearn.calibration#
Fix Raise error when
LeaveOneOutused incv, matching what would happen ifKFold(n_splits=n_samples)was used. #29545 by Lucy Liu
sklearn.compose#
Fix Fixed
compose.TransformedTargetRegressornot to raiseUserWarningif transform output is set topandasorpolars, since it isn’t a transformer. #29401 by Stefanie Senger.
sklearn.decomposition#
Fix Increase rank deficiency threshold in the whitening step of
decomposition.FastICAwithwhiten_solver="eigh"to improve the platform-agnosticity of the estimator. #29612 by Olivier Grisel.
sklearn.metrics#
Fix Fix a regression in
metrics.accuracy_scoreand inmetrics.zero_one_losscausing an error for Array API dispatch with multilabel inputs. #29336 by Edoardo Abati.
sklearn.svm#
Fix Fixed a regression in
svm.SVCandsvm.SVRsuch that we acceptC=float("inf"). #29780 by Guillaume Lemaitre.
Version 1.5.1#
July 2024
Changes impacting many modules#
Fix Fixed a regression in the validation of the input data of all estimators where an unexpected error was raised when passing a DataFrame backed by a read-only buffer. #29018 by Jérémie du Boisberranger.
Fix Fixed a regression causing a dead-lock at import time in some settings. #29235 by Jérémie du Boisberranger.
Changelog#
sklearn.compose#
Efficiency Fix a performance regression in
compose.ColumnTransformerwhere the full input data was copied for each transformer whenn_jobs > 1. #29330 by Jérémie du Boisberranger.
sklearn.metrics#
Fix Fix a regression in
metrics.r2_score. Passing torch CPU tensors with array API dispatched disabled would complain about non-CPU devices instead of implicitly converting those inputs as regular NumPy arrays. #29119 by @Olivier Grisel.Fix Fix a regression in
metrics.zero_one_losscausing an error for Array API dispatch with multilabel inputs. #29269 by Yaroslav Korobko.
sklearn.model_selection#
Fix Fix a regression in
model_selection.GridSearchCVfor parameter grids that have heterogeneous parameter values. #29078 by Loïc Estève.Fix Fix a regression in
model_selection.GridSearchCVfor parameter grids that have estimators as parameter values. #29179 by Marco Gorelli.Fix Fix a regression in
model_selection.GridSearchCVfor parameter grids that have arrays of different sizes as parameter values. #29314 by Marco Gorelli.
sklearn.tree#
Fix Fix an issue in
tree.export_graphvizandtree.plot_treethat could potentially result in exception or wrong results on 32bit OSes. #29327 by Loïc Estève.
sklearn.utils#
API Change
utils.validation.check_arrayhas a new parameter,force_writeable, to control the writeability of the output array. If set toTrue, the output array will be guaranteed to be writeable and a copy will be made if the input array is read-only. If set toFalse, no guarantee is made about the writeability of the output array. #29018 by Jérémie du Boisberranger.
Version 1.5.0#
May 2024
Security#
Fix
feature_extraction.text.CountVectorizerandfeature_extraction.text.TfidfVectorizerno longer store discarded tokens from the training set in theirstop_words_attribute. This attribute would hold too frequent (abovemax_df) but also too rare tokens (belowmin_df). This fixes a potential security issue (data leak) if the discarded rare tokens hold sensitive information from the training set without the model developer’s knowledge.Note: users of those classes are encouraged to either retrain their pipelines with the new scikit-learn version or to manually clear the
stop_words_attribute from previously trained instances of those transformers. This attribute was designed only for model inspection purposes and has no impact on the behavior of the transformers. #28823 by Olivier Grisel.
Changed models#
Efficiency The subsampling in
preprocessing.QuantileTransformeris now more efficient for dense arrays but the fitted quantiles and the results oftransformmay be slightly different than before (keeping the same statistical properties). #27344 by Xuefeng Xu.Enhancement
decomposition.PCA,decomposition.SparsePCAanddecomposition.TruncatedSVDnow set the sign of thecomponents_attribute based on the component values instead of using the transformed data as reference. This change is needed to be able to offer consistent component signs across allPCAsolvers, including the newsvd_solver="covariance_eigh"option introduced in this release.
Changes impacting many modules#
Fix Raise
ValueErrorwith an informative error message when passing 1D sparse arrays to methods that expect 2D sparse inputs. #28988 by Olivier Grisel.API Change The name of the input of the
inverse_transformmethod of estimators has been standardized toX. As a consequence,Xtis deprecated and will be removed in version 1.7 in the following estimators:cluster.FeatureAgglomeration,decomposition.MiniBatchNMF,decomposition.NMF,model_selection.GridSearchCV,model_selection.RandomizedSearchCV,pipeline.Pipelineandpreprocessing.KBinsDiscretizer. #28756 by Will Dean.
Support for Array API#
Additional estimators and functions have been updated to include support for all Array API compliant inputs.
See Array API support (experimental) for more details.
Functions:
sklearn.metrics.r2_scorenow supports Array API compliant inputs. #27904 by Eric Lindgren, Franck Charras, Olivier Grisel and Tim Head.
Classes:
linear_model.Ridgenow supports the Array API for thesvdsolver. See Array API support (experimental) for more details. #27800 by Franck Charras, Olivier Grisel and Tim Head.
Support for building with Meson#
From scikit-learn 1.5 onwards, Meson is the main supported way to build scikit-learn, see Building from source for more details.
Unless we discover a major blocker, setuptools support will be dropped in scikit-learn 1.6. The 1.5.x releases will support building scikit-learn with setuptools.
Meson support for building scikit-learn was added in #28040 by Loïc Estève
Metadata Routing#
The following models now support metadata routing in one or more of their methods. Refer to the Metadata Routing User Guide for more details.
Feature
impute.IterativeImputernow supports metadata routing in itsfitmethod. #28187 by Stefanie Senger.Feature
ensemble.BaggingClassifierandensemble.BaggingRegressornow support metadata routing. The fit methods now accept**fit_paramswhich are passed to the underlying estimators via theirfitmethods. #28432 by Adam Li and Benjamin Bossan.Feature
linear_model.RidgeCVandlinear_model.RidgeClassifierCVnow support metadata routing in theirfitmethod and route metadata to the underlyingmodel_selection.GridSearchCVobject or the underlying scorer. #27560 by Omar Salman.Feature
GraphicalLassoCVnow supports metadata routing in itsfitmethod and routes metadata to the CV splitter. #27566 by Omar Salman.Feature
linear_model.RANSACRegressornow supports metadata routing in itsfit,scoreandpredictmethods and route metadata to its underlying estimator’sfit,scoreandpredictmethods. #28261 by Stefanie Senger.Feature
ensemble.VotingClassifierandensemble.VotingRegressornow support metadata routing and pass**fit_paramsto the underlying estimators via theirfitmethods. #27584 by Stefanie Senger.Feature
pipeline.FeatureUnionnow supports metadata routing in itsfitandfit_transformmethods and route metadata to the underlying transformers’fitandfit_transform. #28205 by Stefanie Senger.Fix Fix an issue when resolving default routing requests set via class attributes. #28435 by Adrin Jalali.
Fix Fix an issue when
set_{method}_requestmethods are used as unbound methods, which can happen if one tries to decorate them. #28651 by Adrin Jalali.Fix Prevent a
RecursionErrorwhen estimators with the defaultscoringparam (None) route metadata. #28712 by Stefanie Senger.
Changelog#
sklearn.calibration#
Fix Fixed a regression in
calibration.CalibratedClassifierCVwhere an error was wrongly raised with string targets. #28843 by Jérémie du Boisberranger.
sklearn.cluster#
Fix The
cluster.MeanShiftclass now properly converges for constant data. #28951 by Akihiro Kuno.Fix Create copy of precomputed sparse matrix within the
fitmethod ofOPTICSto avoid in-place modification of the sparse matrix. #28491 by Thanh Lam Dang.Fix
cluster.HDBSCANnow supports all metrics supported bysklearn.metrics.pairwise_distanceswhenalgorithm="brute"or"auto". #28664 by Manideep Yenugula.
sklearn.compose#
Feature A fitted
compose.ColumnTransformernow implements__getitem__which returns the fitted transformers by name. #27990 by Thomas Fan.Enhancement
compose.TransformedTargetRegressornow raises an error infitif onlyinverse_funcis provided withoutfunc(that would default to identity) being explicitly set as well. #28483 by Stefanie Senger.Enhancement
compose.ColumnTransformercan now expose the “remainder” columns in the fittedtransformers_attribute as column names or boolean masks, rather than column indices. #27657 by Jérôme Dockès.Fix Fixed a bug in
compose.ColumnTransformerwithn_jobs > 1, where the intermediate selected columns were passed to the transformers as read-only arrays. #28822 by Jérémie du Boisberranger.
sklearn.cross_decomposition#
Fix The
coef_fitted attribute ofcross_decomposition.PLSRegressionnow takes into account both the scale ofXandYwhenscale=True. Note that the previous predicted values were not affected by this bug. #28612 by Guillaume Lemaitre.API Change Deprecates
Yin favor ofyin the methodsfit,transformandinverse_transformof:cross_decomposition.PLSRegression,cross_decomposition.PLSCanonical, andcross_decomposition.CCA, and methodsfitandtransformof:cross_decomposition.PLSSVD.Ywill be removed in version 1.7. #28604 by David Leon.
sklearn.datasets#
Enhancement Adds optional arguments
n_retriesanddelayto functionsdatasets.fetch_20newsgroups,datasets.fetch_20newsgroups_vectorized,datasets.fetch_california_housing,datasets.fetch_covtype,datasets.fetch_kddcup99,datasets.fetch_lfw_pairs,datasets.fetch_lfw_people,datasets.fetch_olivetti_faces,datasets.fetch_rcv1, anddatasets.fetch_species_distributions. By default, the functions will retry up to 3 times in case of network failures. #28160 by Zhehao Liu and Filip Karlo Došilović.
sklearn.decomposition#
Efficiency
decomposition.PCAwithsvd_solver="full"now assigns a contiguouscomponents_attribute instead of a non-contiguous slice of the singular vectors. Whenn_components << n_features, this can save some memory and, more importantly, help speed-up subsequent calls to thetransformmethod by more than an order of magnitude by leveraging cache locality of BLAS GEMM on contiguous arrays. #27491 by Olivier Grisel.Enhancement
PCAnow automatically selects the ARPACK solver for sparse inputs whensvd_solver="auto"instead of raising an error. #28498 by Thanh Lam Dang.Enhancement
decomposition.PCAnow supports a new solver option namedsvd_solver="covariance_eigh"which offers an order of magnitude speed-up and reduced memory usage for datasets with a large number of data points and a small number of features (say,n_samples >> 1000 > n_features). Thesvd_solver="auto"option has been updated to use the new solver automatically for such datasets. This solver also accepts sparse input data. #27491 by Olivier Grisel.Fix
decomposition.PCAfit withsvd_solver="arpack",whiten=Trueand a value forn_componentsthat is larger than the rank of the training set, no longer returns infinite values when transforming hold-out data. #27491 by Olivier Grisel.
sklearn.dummy#
Enhancement
dummy.DummyClassifieranddummy.DummyRegressornow have then_features_in_andfeature_names_in_attributes afterfit. #27937 by Marco vd Boom.
sklearn.ensemble#
Efficiency Improves runtime of
predictofensemble.HistGradientBoostingClassifierby avoiding to callpredict_proba. #27844 by Christian Lorentzen.Efficiency
ensemble.HistGradientBoostingClassifierandensemble.HistGradientBoostingRegressorare now a tiny bit faster by pre-sorting the data before finding the thresholds for binning. #28102 by Christian Lorentzen.Fix Fixes a bug in
ensemble.HistGradientBoostingClassifierandensemble.HistGradientBoostingRegressorwhenmonotonic_cstis specified for non-categorical features. #28925 by Xiao Yuan.
sklearn.feature_extraction#
Efficiency
feature_extraction.text.TfidfTransformeris now faster and more memory-efficient by using a NumPy vector instead of a sparse matrix for storing the inverse document frequency. #18843 by Paolo Montesel.Enhancement
feature_extraction.text.TfidfTransformernow preserves the data type of the input matrix if it isnp.float64ornp.float32. #28136 by Guillaume Lemaitre.
sklearn.feature_selection#
Enhancement
feature_selection.mutual_info_regressionandfeature_selection.mutual_info_classifnow supportn_jobsparameter. #28085 by Neto Menoci and Florin Andrei.Enhancement The
cv_results_attribute offeature_selection.RFECVhas a new key,n_features, containing an array with the number of features selected at each step. #28670 by Miguel Silva.
sklearn.impute#
Enhancement
impute.SimpleImputernow supports custom strategies by passing a function in place of a strategy name. #28053 by Mark Elliot.
sklearn.inspection#
Fix
inspection.DecisionBoundaryDisplay.from_estimatorno longer warns about missing feature names when provided apolars.DataFrame. #28718 by Patrick Wang.
sklearn.linear_model#
Enhancement Solver
"newton-cg"inlinear_model.LogisticRegressionandlinear_model.LogisticRegressionCVnow emits information whenverboseis set to positive values. #27526 by Christian Lorentzen.Fix
linear_model.ElasticNet,linear_model.ElasticNetCV,linear_model.Lassoandlinear_model.LassoCVnow explicitly don’t accept large sparse data formats. #27576 by Stefanie Senger.Fix
linear_model.RidgeCVandRidgeClassifierCVcorrectly passsample_weightto the underlying scorer whencvis None. #27560 by Omar Salman.Fix
n_nonzero_coefs_attribute inlinear_model.OrthogonalMatchingPursuitwill now always beNonewhentolis set, asn_nonzero_coefsis ignored in this case. #28557 by Lucy Liu.API Change
linear_model.RidgeCVandlinear_model.RidgeClassifierCVwill now allowalpha=0whencv != None, which is consistent withlinear_model.Ridgeandlinear_model.RidgeClassifier. #28425 by Lucy Liu.API Change Passing
average=0to disable averaging is deprecated inlinear_model.PassiveAggressiveClassifier,linear_model.PassiveAggressiveRegressor,linear_model.SGDClassifier,linear_model.SGDRegressorandlinear_model.SGDOneClassSVM. Passaverage=Falseinstead. #28582 by Jérémie du Boisberranger.API Change Parameter
multi_classwas deprecated inlinear_model.LogisticRegressionandlinear_model.LogisticRegressionCV.multi_classwill be removed in 1.8, and internally, for 3 and more classes, it will always use multinomial. If you still want to use the one-vs-rest scheme, you can useOneVsRestClassifier(LogisticRegression(..)). #28703 by Christian Lorentzen.API Change Parameters
store_cv_valuesandcv_values_are deprecated in favor ofstore_cv_resultsandcv_results_in~linear_model.RidgeCVand~linear_model.RidgeClassifierCV. #28915 by Lucy Liu.
sklearn.manifold#
API Change Deprecates
n_iterin favor ofmax_iterinmanifold.TSNE.n_iterwill be removed in version 1.7. This makesmanifold.TSNEconsistent with the rest of the estimators. #28471 by Lucy Liu
sklearn.metrics#
Feature
metrics.pairwise_distancesaccepts calculating pairwise distances for non-numeric arrays as well. This is supported through custom metrics only. #27456 by Venkatachalam N, Kshitij Mathur and Julian Libiseller-Egger.Feature
sklearn.metrics.check_scoringnow returns a multi-metric scorer whenscoringas adict,set,tuple, orlist. #28360 by Thomas Fan.Feature
metrics.d2_log_loss_scorehas been added which calculates the D^2 score for the log loss. #28351 by Omar Salman.Efficiency Improve efficiency of functions
brier_score_loss,calibration_curve,det_curve,precision_recall_curve,roc_curvewhenpos_labelargument is specified. Also improve efficiency of methodsfrom_estimatorandfrom_predictionsinRocCurveDisplay,PrecisionRecallDisplay,DetCurveDisplay,CalibrationDisplay. #28051 by Pierre de Fréminville.Fix
metrics.classification_reportnow shows only accuracy and not micro-average when input is a subset of labels. #28399 by Vineet Joshi.Fix Fix OpenBLAS 0.3.26 dead-lock on Windows in pairwise distances computation. This is likely to affect neighbor-based algorithms. #28692 by Loïc Estève.
API Change
metrics.precision_recall_curvedeprecated the keyword argumentprobas_predin favor ofy_score.probas_predwill be removed in version 1.7. #28092 by Adam Li.API Change
metrics.brier_score_lossdeprecated the keyword argumenty_probin favor ofy_proba.y_probwill be removed in version 1.7. #28092 by Adam Li.API Change For classifiers and classification metrics, labels encoded as bytes is deprecated and will raise an error in v1.7. #18555 by Kaushik Amar Das.
sklearn.mixture#
Fix The
converged_attribute ofmixture.GaussianMixtureandmixture.BayesianGaussianMixturenow reflects the convergence status of the best fit whereas it was previouslyTrueif any of the fits converged. #26837 by Krsto Proroković.
sklearn.model_selection#
Major Feature
model_selection.TunedThresholdClassifierCVfinds the decision threshold of a binary classifier that maximizes a classification metric through cross-validation.model_selection.FixedThresholdClassifieris an alternative when one wants to use a fixed decision threshold without any tuning scheme. #26120 by Guillaume Lemaitre.Enhancement CV splitters that ignores the group parameter now raises a warning when groups are passed in to split. #28210 by Thomas Fan.
Enhancement The HTML diagram representation of
GridSearchCV,RandomizedSearchCV,HalvingGridSearchCV, andHalvingRandomSearchCVwill show the best estimator whenrefit=True. #28722 by Yao Xiao and Thomas Fan.Fix the
cv_results_attribute (ofmodel_selection.GridSearchCV) now returns masked arrays of the appropriate NumPy dtype, as opposed to always returning dtypeobject. #28352 by Marco Gorelli.Fix
model_selection.train_test_splitworks with Array API inputs. Previously indexing was not handled correctly leading to exceptions when using strict implementations of the Array API like CuPY. #28407 by Tim Head.
sklearn.multioutput#
Enhancement
chain_methodparameter added tomultioutput.ClassifierChain. #27700 by Lucy Liu.
sklearn.neighbors#
Fix Fixes
neighbors.NeighborhoodComponentsAnalysissuch thatget_feature_names_outreturns the correct number of feature names. #28306 by Brendan Lu.
sklearn.pipeline#
Feature
pipeline.FeatureUnioncan now use theverbose_feature_names_outattribute. IfTrue,get_feature_names_outwill prefix all feature names with the name of the transformer that generated that feature. IfFalse,get_feature_names_outwill not prefix any feature names and will error if feature names are not unique. #25991 by Jiawei Zhang.
sklearn.preprocessing#
Enhancement
preprocessing.QuantileTransformerandpreprocessing.quantile_transformnow supports disabling subsampling explicitly. #27636 by Ralph Urlus.
sklearn.tree#
Enhancement Plotting trees in matplotlib via
tree.plot_treenow show a “True/False” label to indicate the directionality the samples traverse given the split condition. #28552 by Adam Li.
sklearn.utils#
Fix
_safe_indexingnow works correctly for polars DataFrame whenaxis=0and supports indexing polars Series. #28521 by Yao Xiao.API Change
utils.IS_PYPYis deprecated and will be removed in version 1.7. #28768 by Jérémie du Boisberranger.API Change
utils.tosequenceis deprecated and will be removed in version 1.7. #28763 by Jérémie du Boisberranger.API Change
utils.parallel_backendandutils.register_parallel_backendare deprecated and will be removed in version 1.7. Usejoblib.parallel_backendandjoblib.register_parallel_backendinstead. #28847 by Jérémie du Boisberranger.API Change Raise informative warning message in
type_of_targetwhen represented as bytes. For classifiers and classification metrics, labels encoded as bytes is deprecated and will raise an error in v1.7. #18555 by Kaushik Amar Das.API Change
utils.estimator_checks.check_estimator_sparse_datawas split into two functions:utils.estimator_checks.check_estimator_sparse_matrixandutils.estimator_checks.check_estimator_sparse_array. #27576 by Stefanie Senger.
Code and documentation contributors
Thanks to everyone who has contributed to the maintenance and improvement of the project since version 1.4, including:
101AlexMartin, Abdulaziz Aloqeely, Adam J. Stewart, Adam Li, Adarsh Wase, Adeyemi Biola, Aditi Juneja, Adrin Jalali, Advik Sinha, Aisha, Akash Srivastava, Akihiro Kuno, Alan Guedes, Alberto Torres, Alexis IMBERT, alexqiao, Ana Paula Gomes, Anderson Nelson, Andrei Dzis, Arif Qodari, Arnaud Capitaine, Arturo Amor, Aswathavicky, Audrey Flanders, awwwyan, baggiponte, Bharat Raghunathan, bme-git, brdav, Brendan Lu, Brigitta Sipőcz, Bruno, Cailean Carter, Cemlyn, Christian Lorentzen, Christian Veenhuis, Cindy Liang, Claudio Salvatore Arcidiacono, Connor Boyle, Conrad Stevens, crispinlogan, David Matthew Cherney, Davide Chicco, davidleon123, dependabot[bot], DerWeh, dinga92, Dipan Banik, Drew Craeton, Duarte São José, DUONG, Eddie Bergman, Edoardo Abati, Egehan Gunduz, Emad Izadifar, EmilyXinyi, Erich Schubert, Evelyn, Filip Karlo Došilović, Franck Charras, Gael Varoquaux, Gönül Aycı, Guillaume Lemaitre, Gyeongjae Choi, Harmanan Kohli, Hong Xiang Yue, Ian Faust, Ilya Komarov, itsaphel, Ivan Wiryadi, Jack Bowyer, Javier Marin Tur, Jérémie du Boisberranger, Jérôme Dockès, Jiawei Zhang, João Morais, Joe Cainey, Joel Nothman, Johanna Bayer, John Cant, John Enblom, John Hopfensperger, jpcars, jpienaar-tuks, Julian Chan, Julian Libiseller-Egger, Julien Jerphanion, KanchiMoe, Kaushik Amar Das, keyber, Koustav Ghosh, kraktus, Krsto Proroković, Lars, ldwy4, LeoGrin, lihaitao, Linus Sommer, Loic Esteve, Lucy Liu, Lukas Geiger, m-maggi, manasimj, Manuel Labbé, Manuel Morales, Marco Edward Gorelli, Marco Wolsza, Maren Westermann, Marija Vlajic, Mark Elliot, Martin Helm, Mateusz Sokół, mathurinm, Mavs, Michael Dawson, Michael Higgins, Michael Mayer, miguelcsilva, Miki Watanabe, Mohammed Hamdy, myenugula, Nathan Goldbaum, Naziya Mahimkar, nbrown-ScottLogic, Neto, Nithish Bolleddula, notPlancha, Olivier Grisel, Omar Salman, ParsifalXu, Patrick Wang, Pierre de Fréminville, Piotr, Priyank Shroff, Priyansh Gupta, Priyash Shah, Puneeth K, Rahil Parikh, raisadz, Raj Pulapakura, Ralf Gommers, Ralph Urlus, Randolf Scholz, renaissance0ne, Reshama Shaikh, Richard Barnes, Robert Pollak, Roberto Rosati, Rodrigo Romero, rwelsch427, Saad Mahmood, Salim Dohri, Sandip Dutta, SarahRemus, scikit-learn-bot, Shaharyar Choudhry, Shubham, sperret6, Stefanie Senger, Steffen Schneider, Suha Siddiqui, Thanh Lam DANG, thebabush, Thomas, Thomas J. Fan, Thomas Lazarus, Tialo, Tim Head, Tuhin Sharma, Tushar Parimi, VarunChaduvula, Vineet Joshi, virchan, Waël Boukhobza, Weyb, Will Dean, Xavier Beltran, Xiao Yuan, Xuefeng Xu, Yao Xiao, yareyaredesuyo, Ziad Amerr, Štěpán Sršeň