4.1: Engineering Categorical Features

A categorical feature or nominal feature has multiple unique categories, where there is no order of importance for one category against the other. An example can be different departments in a company, such as finance, human resource, sales, and marketing. Categorical features cannot be introduced to machine learning models as they are. Instead, they should be preprocessed and converted into a format that can be understood by the model. These techniques are called as encodings . There are many different types of encodings available. We will now look at different types of encodings for categorical features.

4.1.1 Dummy Encoding or One-Hot Encoding

Linear algorithms cannot learn from categorical features as it is. Instead, these features need to be converted to a one-hot encoded format for the model to be able to learn. This is especially useful in the event of auto-correlation, where we can create seasonality related dummy variables to treat auto-correlation.

Let's take the example of the feature 'weather' from the coupon recommendation dataset. It has 3 categories, namely 'Sunny', 'Rainy', and 'Snowy'. We can use the Pandas' function get_dummies to create dummy encoded variables. Below is what the output will look like.

While creating dummy encoding, we should be careful about the 'dummy variable trap'. This is a scenario when variables are highly correlated to each other. For linear models, it can be problematic and could lead to multicollinearity. To mitigate this, we can drop one of the columns. In Pandas, we can do this by changing the parameter in pd.get_dummies from drop_first=False, to drop_first=True.

4.1.2 Label Encoding

Tree-based algorithms can learn from categorical features without having to create dummy features. Tree models instead require categorical features to be represented as labels. In this method, a unique number is assigned to each category. The numbers assigned to each category are to distinguish from other categories. These numbers do not represent rank in any order of importance or usefulness.

We can use the LabelEncoder function from the Sklearn library to convert categorical features into label-encoded features. Below is how the time feature will look, before and after label encoding.

4.1.3 Count, and Percentage Encoding

In count encoding, we replace the respective categories with their count of occurrence. In percent encoding, we replace categories with percentages. It can be used for both linear and tree-based algorithms. If a specific category is present more often than others, in such a situation, count replacement for the category could become an outlier. In this situation, we can perform log transformation, as it will smoothen the effect of the outlier. Also, count or percentage encoding should be done based on the count of categories in training data only and not on the entire data set. As otherwise it will lead to data leakage and overfitting.

Below is the encoding for the passanger feature in the coupon recommendation data set with count and percentage.

4.1.4 Encoding by Rank of Counts

One of the problems with count encoding is that if a specific category is the present majority of the time, it can make the counts look skewed. One way to fix this problem is by performing log transformation. We can mitigate this challenge also by ranking categories based on the count for the categories and replacing categories with the rank. We first take the count of each category in the feature. Categories are sorted in the order of their counts in ascending order. The category which has the lowest count is given value 1. The count for other categories in ascending order is incremented by 1. These encodings should be developed only from training data.

One additional advantage of the rank of counts encoding is that this is useful for both linear and non-linear models.

For the passanger feature in the coupon recommendation data set, below is what the rank of counts will look like.

4.1.5 Target Encoding

We can take summary statistics for categories against the dependent variable and use it to replace the categories. The most commonly used summary statistics are the mean value of the dependent variable. We can calculate the mean value of the dependent variable for each category and replace these mean values with actual categories. Mean encodings can be applied for both regression and classification problems.

For regression problems, we can also use quantiles, such as the 25th percentile, median, and 75th percentile instead of the mean. We can also use the standard deviation for each category, to replace each category. If data in the dependent variable has outliers, we can instead convert the dependent variable to a log scale and then calculate desired summary statistics for each category. Below is an example of mean encoding for the coupon recommendation dataset for the occupation feature.

This encoding is useful when we have too many categories in the categorical feature. We should however calculate these encodings only on the training data. After calculating the encodings, we can apply these to test data and validation data. This will prevent overfitting.

For categorical features with high cardinality, for the last 3 encodings discussed, it can be possible that a category is present only in the test data and validation data, whereas it is absent in training data. In such a case, encoding for the categorical feature will not be representative of all data. We can drop the encoding feature of high cardinality features in such a case. Although not an ideal solution, we can also include encoding from training data of another cross-validation sample for the missed categories, as a workaround. It will ensure that we have encoding value for the missing category, and at the same time, there is minimal data leakage, as we are only taking encoding for the missing category and from training data of another cross-validation sample. Including encoding from the entire dataset should always be avoided, as it will lead to data leakage and overfitting. In rare occurrences when the specific category for the feature is only present in test data and not in training data of any cross-validation sample, we may obtain encoding for the specific category from test data. If there are too many such categories in a feature that are not present in training data, we should avoid using count, percent, rank percent, and target encoding for such features.