Chapter 8: Feature Selection Using Metaheuristic Algorithms
Each feature in a dataset can have a
main effect and an interaction effect, because of which, different combinations
of features have varying model performance. This makes feature selection an
inherently combinatorial problem. We need to find the combination of features
that gives the best model performance. As the number of features keeps
increasing, the number of possible combinations keeps increasing, and so does
the computational cost of trying all the possible combinations. Metaheuristic
algorithms can help us solve this problem by searching for a limited and lesser
number of solutions. It does so by searching for better solutions iteratively.
At the beginning of the algorithm, it starts with randomly generated solutions
and tries to improve the solution at each iteration. Metaheuristic algorithms
are procedures that can find a good solution for optimization problems, which
are difficult and complex otherwise to solve manually. These partial search
algorithms may provide a good enough solution, if not a perfect solution. These
are very useful for feature selection, as they can help us find better feature
sets than otherwise possible through manually trying different combinations.
We will discuss 4 metaheuristics
algorithms in this chapter. These are genetic algorithm, simulated annealing,
ant colony optimization, and particle swarm optimization. We have developed a
companion python library MetaheuristicsFS,
which has all 4 metaheuristics feature selection algorithms. Its module FeatureSelection helps us perform
the desired feature selection.
Some parameters are common across
all metaheuristic algorithms in this library. For example, cross-validation
dataset, validation data set, and name of all input features. Imagine a
scenario where you want to try multiple metaheuristic algorithms. You will need
to enter these common parameters repeatedly for all the algorithms separately.
To avoid this situation, we have used the singleton approach in the MetaheuristicsFS library. The
first step creates a feature selection object by providing common input
parameters. In the second step, we can initialize any desired metaheuristic
algorithm from the 4 listed algorithms. We will discuss this in the subsequent
sections.
First, we will import the FeatureSelection module from the
MetaheuristicsFS python library using the below syntax.
from MetaheuristicsFS import FeatureSelection
|
Let s understand all the required
input fields for the module FeatureSelection.
columns_list: It is a python list object and contains the names of all
the features as strings, separated by a comma. These feature names are present
in the training, test, validation, and external validation datasets. For
example, if there are 3 features x1, x2, and x3, it will be represented as
columns_list = [ x1 , x2 , x3 ]. Based on this input list of
features, search
algorithms create different combinations, to find the best possible feature
combination.
data_dict: It is a python dictionary object and contains training and
test data for multiple cross-validations.
Its key represents each unique
cross-validation. For example, 0,1,2,3,4 represent 5 separate
cross-validations. If a user wants to perform 5-fold cross-validation,
data_dict should have 5 key-value pairs with 0,1,2,3, and 4 as keys. Each pair
is created by shuffling the dataframe and splitting into train and test.
Each key has a nested dictionary
containing training and test data. The value against each cross-validation key
is a nested dictionary and contains features and dependent variables as a
dataframe object. Key 'x_train', and 'x_test'
contain feature dataframe for training and test
data as value pairs. Similarly, the 'y_train', and 'y_test' key contains a dependent variable as a dataframe
object for training, and test data respectively.
Below is what the dictionary
structure looks like for 2-fold cross-validations.
{ |
x_validation_dataframe: It has feature dataframe for validation data set.
y_validation_dataframe: It has the dependent variable, stored as a dataframe for
the validation data set.
model: It is the initialized model, stored as an object. For
example, for the linear regression function in the Sklearn python library, the
model object can be initialized as model
= LinearRegression()
This object model will then be used for training the model by
using training data and predicting for test and validation data. It should have
a .fit attribute for
training, and a .predict
attribute for predicting. Sklearn models, as well as Xgboost and other major
modeling techniques, have these 2 functionalities. It does not support deep
learning models, however.
cost_function_improvement: There are 2 values for this parameter, depending on the
goal of the optimization. We can select either increase or decrease as the
string
value.
Setting the value for increase
will enable the feature selection algorithm to look for solutions where model
metric values increase in each iteration. One example can be f1 score for
classification model. We will like to obtain a model that gives us highest F1
score.
Setting the value as decrease will
enable the feature selection algorithm to search for solutions where cost is
lowest. For example, for regression models, RMSE is a commonly used cost
function. It is desirable to obtain a model which has lowest amount of RMSE.
Setting decrease for this parameter will enable the algorithm to search for a
feature set which gives lowest RMSE.
cost_function: Cost function is for finding cost between actual and
predicted values. For a regression problem, some examples are root mean square
error, mean absolute error, etc. For a classification problem, some examples
are f1 score, precision, and recall. The cost function should have 2 input
parameters 'actual' and 'predicted' as arrays. It should return the cost
between actual and predicted values. It supports all the cost functions
available in Sklearn.
It also supports custom-made cost
functions, as long as there are 2 parameters in the cost function, actual
value, followed by predicted values, and returns cost value.
average: In the case of multi-class classification problems, cost
functions such as precision, recall, f1 score, etc. in Sklearn have a parameter
average . This criterion specifies the type of averaging to be used for the
cost function if the dependent variable has multiple classes. We can assign the
average value for Sklearn cost functions for multi-class classifications in
this parameter.
Now let s initialize a feature
selection object for a regression problem, where we will use a linear
regression model with 3 features, and mean squared error as a cost function.
from Sklearn.metrics import mean_squared_error |
After the feature selection object
has been initialized, it can be used for executing a specified metaheuristic
algorithm. Before we get into metaheuristic algorithms, let's understand why we
need these algorithms in the first place by going through the first section of
the chapter exhaustive feature selection .