3.2: Car Sales

Dependent variable for this dataset is numerical and the modeling problem is regression. Let's explore numerical features through a pair plot in figure 3.2.1 and ask the 3 questions.

Question 1: What do the patterns in this visualization say?

It appears that there is a strong relationship between the selling price of the car and the year since when the car has been used. This is also observed for how many kilometers the car has been used for. Interestingly, the car price has a not so strong impact on the number of seats.

Max_power and engine have a strong relationship with the selling price. Mileage also has some noticeable patterns.

Figure 3.2.1: pair plot of the dependent variable with numerical features

Question 2: So, what does this pattern say about my problem statement, and how it can affect my problem statement?

Let's confirm what we saw in the first plot by quantifying the extent of the relationship by checking the correlation heat map in figure 3.2.2.

Figure 3.2.2: Correlation heatmap of numerical features with the selling price.

We can see that max_power has the strongest correlation with the selling price, followed by the engine. Mileage on the other hand has a mild negative correlation with the selling price of used cars.

The selling price has a positive correlation with the year. If the year is higher, then it can be sold for a higher price. In other words, if the car is new, it can be sold for a higher price. On the other hand, car price has a negative relationship with the number of kilometers driven. If the car has been driven for long distances, it has been through wear and tear. It is likely to fetch less price. Finally, the number of seats has a weak positive correlation with the selling price. People would like to have a car with a higher number of seats, but not so much.

Question 3: Now what should I do to inculcate the patterns discovered during EDA? Should I include this information as a new feature or should I perform data cleaning?

For the features which have a high correlation, we can check if the original features give better performance or if we can get better performance by using higher order features of these features. This is even more applicable for seats. We will need to check if there is any higher order feature for seats that can help us get better performance.

If we go by common knowledge about cars, the sports cars are sold at the highest price, although they have the lowest number of seats. This is followed by SUV cars, which have a higher number of seats but fetch higher prices. However, cars that have 4-5 seats are used by middle class people and have a relatively lower selling price. Hence, there is a non-linear relationship between the number of seats and car price. This could also be applicable for used cars. Higher order feature engineering might be able to uncover this non-linear relationship. We can also consider the number of seats as ordinal and try higher order ordinal features to see if it works better. Now, let's also explore categorical features.

Let's start with the  owner  feature in figure 3.2.3. This feature has values representing how many people have previously owned the car.  First Owner  suggests that the car is owned by a first-time buyer, whereas  Second Owner  means the car has been owned by 2 owners, including the current owner. In our dataset, the first owner and second owner are 65.9% and 25.5% respectively. They constitute the majority group.

 

Figure 3.2.3: Boxplot of selling_price for each type of  owner 

Question 1: What do the patterns in this visualization say?

There is a huge degree of variation in prices at which used cars are sold for the first owner and second owners, as evident from the number of outliers in the boxplot for these 2 categories. This is also true for the third owner. Test drive cars and cars which have been owned 4 times have relatively stable prices, as they do not have any outliers.

Cars that have been owned a fourth time or above, fetch the lowest prices as evidenced by the average selling price. This can be seen in figure 3.2.4.

Figure 3.2.4: Average selling price by type of owner.

Question 2: So, what does this pattern say about my problem statement and how it can affect my problem statement?

It seems that first, second, and third-owner cars have lower average selling prices than test-driving cars. However, many outliers have very high prices. We need to account for it.

Question 3: Now what should I do to inculcate the patterns discovered during EDA? Should I include this information as a new feature or should I perform data cleaning?

As this is common knowledge, we know that sports cars and SUVs have higher prices than other cars. Let's verify this by checking the car brand names from the  brand  column for cars that have above 90 percentiles selling_price, through a word cloud. Figure 3.2.5 has the word cloud for cars with higher percentile selling price.

Figure 3.2.5: Brand names with above 90 percentiles selling_price.

Most of the cars for which selling_price is outliers, are considered as premium and luxury cars in the Indian market. If we could distinguish these cars from others while modeling, it can give us comparatively better performance. We can represent this information as a feature in 2 different ways. Firstly, we can create a binary 1|0 feature that represents these car brands as an additional feature. Secondly, we can create a higher order feature, which will have the average selling price for each car model or brand.