SelectKBest
What is SelectKBest?
SelectKBest is one of the most commonly used feature selection methods. SelectKBest is a type of filter-based feature selection method in machine learning.
SelectKBest uses statistical tests like chi-squared test, ANOVA F-test, or mutual information score to score and rank the features based on their relationship with the output variable. Then, it selects the K features with the highest scores to be included in the final feature subset.
Syntax
SelectkBest = SelectKBest(f_classif, k=3)SelectKBest has 2 parameters: score function & number of fetures(k)
Score function
Score function is used to evaluate the feature importance. We have different types of score functions.
Some of the commonly used score_func functions in SelectKBest:
f_regression: It is used for linear regression problems and computes F-value between feature and target.mutual_info_regression: It is used for regression problems and computes mutual information between two random variables.f_classif: It is used for classification problems and computes ANOVA F-value between feature and target.mutual_info_classif: It is used for classification problems and computes mutual information between two discrete variables.chi2: It is used for classification problems and computes chi-squared statistics between each feature and target.SelectPercentile: It is used to select the highest X% of the features based on the score_func.
How to select the right score function?
For regression, the most commonly used scoring functions are f_regression and mutual_info_regression
For classification, the most commonly used scoring function is chi_2, mutual_info_classif and f_classif
Commands
from sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import f_regressionX_new = SelectKBest(f_regression, k=2)f_regression defines that we are making a regression model.
X_new = SelectKBest(f_regression, k=2)k=2 defines that we want 2 features to use from the dataframe, the algorithm will decide whitch will it be
X_new = SelectKBest(f_regression, k=2).fit_transform(X_train, y_train).fit_transform(X_train, y_train) trains the data from the X_train, y_train splited dataframes and stores them in X_new in this case.
Last updated