5.1: Interaction Plot

The interaction plot is an extension of two-way ANOVA, which tests if two factors affect the dependent variable. If the p-value is below 0.05, we will believe that there is a significant interaction effect between the two factors. Before that, let's understand what is one-way and two-way ANOVA.

The one-way ANOVA tests variance in the group means within a sample while considering only one categorical feature. In the case of two-way ANOVA, it tests variance in the group means within a sample while considering levels of two categorical features. It tests the interaction between categorical features, for the continuous dependent variables.

There are two steps for concluding the relationship between two categorical features. In the first step, we perform two-way ANOVA and check the p-value of the interaction effect. If the interaction effect is significant, we perform the second step. In the second step, we plot the relationship with the help of an interaction plot. Interaction plot can help us understand the relationship visually. If the lines in the plot are parallel, we can conclude that there is no interaction. On the other hand, if lines intersect each other, we can say that there is an interaction among the features. If lines neither intersect nor run parallel, we can say that there is some degree of interaction.

The final decision of accepting or rejecting the presence of interaction effect should be done after checking the p-value of ANOVA. To avoid overfitting, we will perform ANOVA on training data across all cross-validations. If the result is significant across all cross-validations, we will consider it statistically significant and the interaction effect valid.

For the hotel room booking dataset, all the categorical features are derived either from the check-in date, or quartile features from numerical features such as booking trend, and revenue. Hence It will not be meaningful to do ANOVA between the dependent variable and the categorical variables derived from the dependent variable. It will neither be meaningful to perform ANOVA with only categorical variables derived from the seasonality of the date feature.

Let s go through the car sales regression dataset. We will use a few categorical features to explain the interaction plot. We will start with the  fuel  and  sellertype  categorical features against the dependent variable  sellingprice . Figure 5.1 shows the interaction plot between the features and the dependent variable..

Figure 5.1: Interaction plot of fuel and sellertype against sellingprice

From the graph we can see that dealer car have the highest price, regardless of the fuel type. This is followed by petrol cars sold by Trustmark dealers. LPG, followed by CNG cars has the lowest prices when sold by individual sellers. Although we can infer these relationships from the graph, it was not validated by the p-value of ANOVA for training data across all cross-validation samples. Also, none of the lines in the plot cross each other. Hence, we conclude that the interaction effect is not present between  fuel  and  sellertype .

Now let s look at the relationship between fuel and owner on car selling price in figure 5.2.

Figure 5.2: Interaction plot of fuel and owner against sellingprice

From figure 5.2 we can see that petrol test drive cars have the highest price. We can infer this from the position of the blue dot at the top right section of the plot. This is followed by diesel cars sold by first owners, which have the second highest price. Petrol cars sold by first owners fetch less price than diesel cars sold by first owners.

Second, third, and fourth owners intersect at LPG for all owner types, which means LPG cars are sold at similar prices, regardless of the number of times it has been sold in the past. The only exception is the first owner, for whom the selling price is relatively higher.

We also tested the relationship with ANOVA for training data across 5 cross-validation samples. In all the samples, the relationship came as significant. We can finally create an interaction effect feature between these two categorical features.

The best way to go about finding the interaction effect through an interaction plot is that first, we should perform ANOVA across different samples of training data in cross-validation. If it came significant in all the cross-validation samples, we can then use an interaction plot to get an intuitive explanation of the type of relationship that exists. We can finally use it for creating interaction effect features.