1.5: Datasets Used
We will use 4 separate datasets
throughout this book for regression and classification problems. We will
benchmark model performance and compare the performance of different
techniques. These datasets are explained in the first four datasets from
sections 1.5.1 to 1.5.3. We will also use 2 datasets for signal processing in
chapter 11. These datasets are explained in sections 1.5.4 and 1.5.5.
1.5.1 Hotel Booking
Demand Datasets
This dataset [2] has
demand data for 2 hotels. Hotel H1 is a resort hotel that attracts customers
who will like to stay at the hotel for recreation purposes. Hotel H2 is a city
hotel, that people visit for business purposes.
We will use data from hotel H1 for
developing classification models to predict the likelihood of cancellation for
reservations. The objective of the classification model is to minimize losses
from cancellations and increase profitability. Many customers cancel their
reservations. This could mean a loss of revenue. If we know in advance which
reservation is going to be canceled, we can try to resell the room, even though
it is not canceled at the moment. However, it can lead to a situation when both
guests arrive at the hotel and demand the room they have reserved. To avoid
this situation, our modeling objective is to build a model which can predict
cancellation with a high degree of accuracy. In other words, 'precision' will
be our cost function. As a result, we might have a model which has a low
recall, but high precision. We are fine with such a model, as long as it can
predict cancellations with a low margin of error. Imagine a hypothetical model
that has 95 precision and 30 recall. It simply means, for all predictions it
makes for cancellations, 95 times out of 100 could actually be a cancellation.
Although the precision is not the ideal 100, 95 precision is a good score for a
statistician. A low recall of 30 means, out of all cancellations that happened,
the model could only identify 30 percent of such cancellations. For a
statistician or machine learning engineer, recall value of 30 is a less than
ideal situation. However, for business owners, this is an acceptable situation.
For a hotel revenue manager, recall value of 30 means that the hotel will be
able to reduce their cancellation losses by 30 percent. This is still better
than a situation where the hotel cannot save anything without the model. If not
all, in many cases businesses desire a workable model. A perfect model which
meets all criteria of theoretical statistics and theoretical machine learning
is not always desired by real world businesses.
Data for hotel H2 will be used for
developing a regression model to predict the total occupancy for rooms for
future check-in dates. This dataset was not in the format where it could have
been used for the regression model. The author has used his domain knowledge in
the hotel industry to perform preprocessing and data cleaning to bring it into
a usable format. Hotels have different types of customers. Two major groups are
'transient' and 'group'. Transient customers often seek short hotel stays, such
as people who travel to different cities for business purposes, or other
customers who want short-term stays. We will like to develop a model to predict
total transient bookings at the check-in date level, at each booking window
between 0 to 100 days before the date of check-in for the city hotel H2.
Booking day is the day when
customers book the room for the check-in date in advance. For example, if you
want to visit Hawaii on Christmas of 2022 and you are reserving the room on the
1st of October, 2022, then October 1st, 2022 will be the booking date and
December 25th, 2022, the day of Christmas will be the check-in date. The number
of days between these two dates is called lead time.
Sometimes customers book rooms, but
they do not show up. We will remove these customers from our modeling problem,
as these are very negligible. ADR is the average daily rate, for any check-in
date it represents total revenue from bookings divided by total rooms sold. We
will remove records for ADR which seem outliers, as we cannot get any
explanation from the original authors about the nature of outliers.
We will use ADR between the bottom
5th percentile and the top 99.99 percentile. There were transactions for which
no value was specified for the number of adults and children as guests. These
will also be removed.
Some customers buy rooms for
continuous stay. If a customer is reserving on the 1st of January for staying
at the H2 hotel on the 7th and 8th of January, the lead time for the 7th of
January will be 6 days and for the 8th of January will be 7 days.
We assume that one room can
accommodate 2 adults and 2 kids in a hotel. If a transaction has 2 adults and 2
kids, it will be considered as 1 room. If a transaction has 1 adult and 2 kids,
2 adults, and 1 kid, or 1 adult and 1 kid, it will be considered as 1 room.
If there are more than 2 adults, the
number of rooms will be calculated as the number of adults divided by 2. If the
result is a decimal, it will be rounded to a higher number. For example, if
there are 3 adults, 3 divided by 2 is 1.5. We will round 1.5 to a higher
ceiling of 2. Hence concluding that 3 adults can stay in 2 rooms. Similarly,
children are also considered for calculating the number of rooms.
1.5.2 Car Sales
This is sourced from Kaggle [3]
and has information about the attributes of used cars and its price in India.
It will be used for developing regression models. We will use the 'Car details
v3.csv' file from version 3 of the dataset. The original file has details of
8128 cars. We have removed certain records which had inadequate data or missing
data. These are mileage features for values of 0.0kmpl or left blank. Engine,
torque, or seats are left blank. max_power as 0, 'bhp' or blank. km_driven as
1. After removing these observations, 7905 observations remain. Mileage,
engine, and max_power were converted to numeric features after removing string
suffixes.
1.5.3 Coupon
Recommendation
In-vehicle coupon recommendation
Data Set [4] studies the behavior of individuals and whether they
will accept or not accept coupons. This dataset was generated through a survey.
This survey describes different driving scenarios and asks the person if they
will accept the coupon.
1.5.4 Raman
Spectroscopy of Skimmed Milk Samples
This dataset [5] has a
Matrix of quantitative whole spectrum analysis of 45 spectra on 21451 on
skimmed milk samples. We will use this for discussing Raman spectra in chapter
11.
1.5.5 Beaver Body
Temperatures
This dataset has body temperature of 2 beavers [6],
measured every 10 minutes by telemetry. This has data for less than a day for
each of the two beavers. We will discuss this in chapter 11 for filtering
method.