Function choice is a crucial step in lots of machine studying pipelines. In observe, we typically have a variety of variables out there as predictors for our fashions, however just a few of them are associated to our goal. Function choice consists of discovering a lowered set of those options, primarily for:
- Improved generalization — utilizing a lowered variety of options minimizes the chance of overfitting.
- Higher inference — by eradicating redundant options (for instance, two options very correlated with one another), we are able to retain solely one in every of them and higher seize its impact.
- Environment friendly coaching — having much less options means shorter coaching instances.
- Higher interpretation — lowering the variety of options produces extra parsimonious fashions that are simpler to know.
There are numerous strategies out there to carry out characteristic choice, every with various complexity. On this article, I need to share a approach of utilizing a robust open supply optimization instrument, Optuna, to carry out the characteristic choice process in an modern approach. The primary concept is to have a versatile instrument that may deal with characteristic choice for a variety of duties, by effectively testing totally different characteristic combos (e.g., not attempting all of them one after the other). Under, we’ll undergo a hands-on instance implementing this method, and likewise evaluating it to different frequent characteristic choice methods. To experiment with the characteristic choice strategies mentioned, you possibly can comply with together with this Colab Notebook.
On this instance, we’ll give attention to a classification process based mostly on the Mobile Price Classification dataset from Kaggle. Now we have 20 options, together with ‘battery_power’, ‘clock_speed’ and ‘ram’, to foretell the ‘price_range’ characteristic, which may belong to 4 totally different bands: 0, 1, 2 and three.
We first break up our dataset into practice and check units, and we additionally put together a 5-fold validation break up throughout the practice set — this shall be helpful in a while.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFoldSEED = 32
# Load information
filename = "practice.csv" # practice.csv from https://www.kaggle.com/datasets/iabhishekofficial/mobile-price-classification
df = pd.read_csv(filename)
# Practice - check break up
df_train, df_test = train_test_split(df, test_size=0.2, stratify=df.iloc[:,-1], random_state=SEED)
df_train = df_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
# The final column is the goal variable
X_train = df_train.iloc[:,0:20]
y_train = df_train.iloc[:,-1]
X_test = df_test.iloc[:,0:20]
y_test = df_test.iloc[:,-1]
# Stratified kfold over the practice set for cross validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
splits = record(skf.break up(X_train, y_train))
The mannequin we’ll use all through the instance is the Random Forest Classifier, utilizing the scikit-learn implementation and default parameters. We first practice the mannequin utilizing all options to set our benchmark. The metric we’ll measure is the F1 rating weighted for all 4 worth ranges. After becoming the mannequin over the practice set, we consider it on the check set, acquiring an F1 rating of round 0.87.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, classification_reportmannequin = RandomForestClassifier(random_state=SEED)
mannequin.match(X_train,y_train)
preds = mannequin.predict(X_test)
print(classification_report(y_test, preds))
print(f"World F1: {f1_score(y_test, preds, common='weighted')}")
The aim now’s to enhance these metrics by choosing a lowered characteristic set. We’ll first define how our Optuna-based method works, after which check and examine it with different frequent characteristic choice methods.
Optuna is an optimization framework primarily used for hyperparameter tuning. One of many key options of the framework is its use of Bayesian optimization strategies to go looking the parameter area. The primary concept is that Optuna tries totally different combos of parameters and evaluates how the target perform modifications with every configuration. From these trials, it builds a probabilistic mannequin used to estimate which parameter values are prone to yield higher outcomes.
This technique is way more environment friendly in comparison with grid or random search. For instance, if we had n options, and tried to attempt every potential characteristic subset, we must carry out 2^n trials. With 20 options these could be greater than 1,000,000 trials. As an alternative, with Optuna, we are able to discover the search area with a lot fewer trials.
Optuna presents numerous samplers to attempt. For our case, we’ll use the default one, the TPESampler, based mostly on the Tree-structured Parzen Estimator algorithm (TPE). This sampler is probably the most generally used, and it’s really helpful for looking categorical parameters, which is our case as we’ll see under. In keeping with the documentation, this algorithm “matches one Gaussian Combination Mannequin (GMM) l(x) to the set of parameter values related to the very best goal values, and one other GMM g(x) to the remaining parameter values. It chooses the parameter worth x that maximizes the ratio l(x)/g(x).”
As talked about earlier, Optuna is often used for hyperparameter tuning. That is normally carried out by coaching the mannequin repeatedly on the identical information utilizing a set set of options, and in every trial testing a brand new set of hyperparameters decided by the sampler. The parameter set that minimizes the given goal perform is then returned as the very best trial.
In our case, nonetheless, we’ll use a set mannequin with predetermined parameters, and in every trial, we’ll permit Optuna to pick which options to attempt. The method goals to search out the set of options that minimizes the loss perform. In our case, we’ll information the algorithm to maximise the F1 rating (or decrease the damaging of the F1). Moreover, we’ll add a small penalty for every characteristic used, to encourage smaller characteristic units (if two characteristic units yield related outcomes, we’ll choose the one with fewer options).
The information we’ll use is the practice dataset, break up into 5 folds. In every trial, we’ll match the classifier 5 instances utilizing 4 of the 5 folds for coaching and the remaining fold for validation. We’ll then common the validation metrics and add the penalty time period to calculate the trial’s loss.
Under is the applied class to carry out the characteristic choice search:
import optunaclass FeatureSelectionOptuna:
"""
This class implements characteristic choice utilizing Optuna optimization framework.
Parameters:
- mannequin (object): The predictive mannequin to judge; this must be any object that implements match() and predict() strategies.
- loss_fn (perform): The loss perform to make use of for evaluating the mannequin efficiency. This perform ought to take the true labels and the
predictions as inputs and return a loss worth.
- options (record of str): A listing containing the names of all potential options that may be chosen for the mannequin.
- X (DataFrame): The entire set of characteristic information (pandas DataFrame) from which subsets shall be chosen for coaching the mannequin.
- y (Collection): The goal variable related to the X information (pandas Collection).
- splits (record of tuples): A listing of tuples the place every tuple comprises two components, the practice indices and the validation indices.
- penalty (float, non-obligatory): An element used to penalize the target perform based mostly on the variety of options used.
"""
def __init__(self,
mannequin,
loss_fn,
options,
X,
y,
splits,
penalty=0):
self.mannequin = mannequin
self.loss_fn = loss_fn
self.options = options
self.X = X
self.y = y
self.splits = splits
self.penalty = penalty
def __call__(self,
trial: optuna.trial.Trial):
# Choose True / False for every characteristic
selected_features = [trial.suggest_categorical(name, [True, False]) for title in self.options]
# Listing with names of chosen options
selected_feature_names = [name for name, selected in zip(self.features, selected_features) if selected]
# Elective: provides a penalty for the quantity of options used
n_used = len(selected_feature_names)
total_penalty = n_used * self.penalty
loss = 0
for break up in self.splits:
train_idx = break up[0]
valid_idx = break up[1]
X_train = self.X.iloc[train_idx].copy()
y_train = self.y.iloc[train_idx].copy()
X_valid = self.X.iloc[valid_idx].copy()
y_valid = self.y.iloc[valid_idx].copy()
X_train_selected = X_train[selected_feature_names].copy()
X_valid_selected = X_valid[selected_feature_names].copy()
# Practice mannequin, get predictions and accumulate loss
self.mannequin.match(X_train_selected, y_train)
pred = self.mannequin.predict(X_valid_selected)
loss += self.loss_fn(y_valid, pred)
# Take the typical loss throughout all splits
loss /= len(self.splits)
# Add the penalty to the loss
loss += total_penalty
return loss
The important thing half is the place we outline which options to make use of. We deal with every characteristic as one parameter, which may take the values True or False. These values point out whether or not the characteristic must be included within the mannequin. We use the suggest_categorical technique in order that Optuna selects one of many two potential values for every characteristic.
We now initialize our Optuna examine and carry out the seek for 100 trials. Discover that we enqueue a primary trial utilizing all options, as a place to begin for the search, permitting Optuna to match subsequent trials towards a fully-featured mannequin:
from optuna.samplers import TPESamplerdef loss_fn(y_true, y_pred):
"""
Returns the damaging F1 rating, to be handled as a loss perform.
"""
res = -f1_score(y_true, y_pred, common='weighted')
return res
options = record(X_train.columns)
mannequin = RandomForestClassifier(random_state=SEED)
sampler = TPESampler(seed = SEED)
examine = optuna.create_study(path="decrease",sampler=sampler)
# We first attempt the mannequin utilizing all options
default_features = {ft: True for ft in options}
examine.enqueue_trial(default_features)
examine.optimize(FeatureSelectionOptuna(
mannequin=mannequin,
loss_fn=loss_fn,
options=options,
X=X_train,
y=y_train,
splits=splits,
penalty = 1e-4,
), n_trials=100)
After finishing the 100 trials, we retrieve the very best one from the examine and the options utilized in it. These are the next:
[‘battery_power’, ‘blue’, ‘dual_sim’, ‘fc’, ‘mobile_wt’, ‘px_height’, ‘px_width’, ‘ram’, ‘sc_w’]
Discover that from the unique 20 options, the search concluded with solely 9 of them, which is a big discount. These options yielded a minimal validation lack of round -0.9117, which suggests they achieved a median F1 rating of round 0.9108 throughout all folds (after adjusting for the penalty time period).
The subsequent step is to coach the mannequin on the complete practice set utilizing these chosen options and consider it on the check set. This leads to an F1 rating of round 0.882:
By choosing the fitting options, we have been capable of cut back our characteristic set by greater than half, whereas nonetheless attaining a better F1 rating than with the complete set. Under we are going to focus on some execs and cons of utilizing Optuna for characteristic choice:
Execs:
- Searches throughout characteristic units effectively, taking into consideration which characteristic combos are almost definitely to provide good outcomes.
- Adaptable for a lot of eventualities: So long as there’s a mannequin and a loss perform, we are able to use it for any characteristic choice process.
- Sees the entire image: In contrast to strategies that consider options individually, Optuna takes under consideration which options are inclined to go nicely with one another, and which don’t.
- Dynamically determines the variety of options as a part of the optimization course of. This may be tuned with the penalty time period.
Cons:
- It’s not as easy as easier strategies, and for smaller and easier datasets it won’t be value it.
- Though it requires a lot fewer trials than different strategies (like exhaustive search), it nonetheless sometimes requires round 100 to 1000 trials. Relying on the mannequin and dataset, this may be time-consuming and computationally costly.
Subsequent, we’ll examine our method to different frequent characteristic choice methods.
Filter Strategies — Chi-Squared
One of many easiest alternate options is to judge every characteristic individually utilizing a statistial check and retain the highest okay options based mostly on their scores. Discover that this method doesn’t require any machine studying mannequin. For instance, for the classification process, we are able to select the chi-squared check, which determines whether or not there’s a statistically important affiliation between every characteristic and the goal variable. We’ll use the SelectKBest class from scikit-learn, which applies the rating perform (chi-squared) to every characteristic and returns the highest okay scoring variables. In contrast to the Optuna technique, the variety of options isn’t decided within the choice course of, however have to be set beforehand. On this case, we’ll set this quantity at ten. These strategies fall throughout the filter strategies class. They are typically the simplest and quickest to compute since they don’t require any mannequin behind.
from sklearn.feature_selection import SelectKBest, chi2skb = SelectKBest(score_func=chi2, okay=10)
skb.match(X_train,y_train)
scores = pd.DataFrame(skb.scores_)
cols = pd.DataFrame(X_train.columns)
featureScores = pd.concat([cols,scores],axis=1)
featureScores.columns = ['feature','score']
featureScores.nlargest(10, 'rating')
In our case, ram scored the best by far within the chi-squared check, adopted by px_height and battery_power. Discover that these options have been additionally chosen by our Optuna technique above, together with px_width, mobile_wt and sc_w. Nevertheless, there are some new additions like int_memory and talk_time — these weren’t picked by the Optuna examine. After coaching the random forest with these 10 options and evaluating it on the check set, we achieved an F1 rating barely larger than our earlier finest, at roughly 0.888:
Execs:
- Mannequin agnostic: doesn’t require a machine studying mannequin.
- Straightforward and quick to implement and run.
Cons:
- It must be tailored for every process. As an example, some rating capabilities are solely relevant for classification duties, and others just for regression duties.
- Grasping: relying on the choice used, it normally appears at options one after the other, with out taking into consideration that are already included within the set.
- Requires the variety of options to pick to be set beforehand.
Wrapper Strategies — Ahead Search
Wrapper strategies are one other class of characteristic choice methods. These are iterative strategies; they contain coaching the mannequin with a set of options, evaluating its efficiency, after which deciding whether or not so as to add or take away options. Our Optuna technique falls inside these strategies. Nevertheless, commonest examples embody ahead choice or backward choice. With ahead choice, we start with no options and, at every step, we greedily add the characteristic that gives the best efficiency achieve, till a cease criterion is met (variety of options or efficiency decline). Conversely, backward choice begins with all options and iteratively removes the least important ones at every step.
Under, we attempt the SequentialFeatureSelector class from scikit-learn, performing a ahead choice till we discover the highest 10 options. This technique may also make use of the 5-fold break up we carried out above, averaging efficiency throughout the validation splits at every step.
from sklearn.feature_selection import SequentialFeatureSelectormannequin = RandomForestClassifier(random_state=SEED)
sfs = SequentialFeatureSelector(mannequin, n_features_to_select=10, cv=splits)
sfs.match(X_train, y_train);
selected_features = record(X_train.columns[sfs.get_support()])
print(selected_features)
This technique finally ends up choosing the next options:
[‘battery_power’, ‘blue’, ‘fc’, ‘mobile_wt’, ‘px_height’, ‘px_width’, ‘ram’, ‘talk_time’, ‘three_g’, ‘touch_screen’]
Once more, some are frequent to the earlier strategies, and a few are new (e.g., three_g and touch_screen. Utilizing these options, the Random Forest achieves a decrease check F1 rating, barely under 0.88.
Execs
- Straightforward to implement in only a few strains of code.
- It can be used to find out the variety of options to make use of (utilizing the tolerance parameter).
Cons
- Time consuming: Beginning with zero options, it trains the mannequin every time utilizing a distinct variable, and retains the very best one. For the subsequent step, it once more tries out all options (now together with the earlier one), and once more selects the very best one. That is repeated till the specified variety of options is reached.
- Grasping: As soon as a characteristic is included, it stays. This will likely result in suboptimal outcomes, because the characteristic offering the best particular person achieve in early rounds won’t be your best option within the context of different characteristic interactions.
Function Significance
Lastly, we’ll discover one other easy choice technique, which includes utilizing the characteristic importances the mannequin learns (if out there). Sure fashions, like Random Forests, present a measure of which options are most essential for prediction. We are able to use these rankings to filter out these options that, in response to the mannequin, have the least significance. On this case, we practice the mannequin on the complete practice dataset, and retain the ten most essential options:
mannequin = RandomForestClassifier(random_state=SEED)
mannequin.match(X_train,y_train)significance = pd.DataFrame({'characteristic':X_train.columns, 'significance':mannequin.feature_importances_})
significance.nlargest(10, 'significance')
Discover how, as soon as once more, ram is ranked highest, far above the second most essential characteristic. Coaching with these 10 options, we acquire a check F1 rating of just about 0.883, much like those we’ve been seeing. Additionally, notice how the options chosen by way of characteristic significance are the identical as these chosen utilizing the chi-squared check, though they’re ranked in a different way. This distinction in rating leads to a barely totally different final result.
Execs:
- Straightforward and quick to implement: it requires a single coaching of the mannequin and immediately makes use of the derived characteristic importances.
- It may be tailored right into a recursive model, wherein at every step the least essential characteristic is eliminated and the mannequin is then educated once more (see Recursive Feature Elimination).
- Contained throughout the mannequin: If the mannequin we’re utilizing supplies characteristic importances, we have already got a characteristic choice different out there at no extra price.
Cons:
- Function significance won’t be aligned with our finish aim. As an example, a characteristic may seem unimportant by itself however could possibly be crucial resulting from its interplay with different options. Additionally, an essential characteristic is perhaps counterproductive total, by affecting the efficiency of different helpful predictors.
- Not all fashions provide characteristic significance estimation.
- Requires the variety of options to pick to be predefined.
To conclude, we’ve seen the way to use Optuna, a robust optimization instrument, for the characteristic choice process. By effectively navigating the search area, it is ready to discover good characteristic subsets with comparatively few trials. Not solely that, however it’s also versatile and may be tailored to many eventualities so long as we have now a mannequin and a loss perform outlined.
All through our examples, we noticed that each one strategies yielded related characteristic units and outcomes. That is primarily as a result of the dataset we used is relatively easy. In these circumstances, easier strategies already produce a great characteristic choice, so it wouldn’t make a lot sense to go along with the Optuna method. Nevertheless, for extra complicated datasets, with extra options and complicated relationships between them, utilizing Optuna is perhaps a good suggestion. So, all in all, given its relative ease of implementation and talent to ship good outcomes, utilizing Optuna for characteristic choice is a worthwhile addition to the info scientist’s toolkit.
Thanks for studying!