Before we start
This book aims to be a second course
for beginner machine learning engineers who use python. It aims to answer the
question "What to do now?" challenge faced by machine learning
engineers and researchers when they are stuck below than desired model
performance and are looking for ways to improve model performance. The ideal
audience for the book is those who did a course on data science/machine
learning, researchers who are exploring traditional machine learning algorithms
for their thesis, people who are working in their first job as a data
scientist, people with 1-3 years of experience in data science, and leaders who
transitioned to lead data science team but do not have the background and want
to speak with the technical team with confidence. Experienced, seasoned, and
veteran data scientists and researchers who want to try new feature
engineering, feature selection, and signal processing techniques, as well as to
look at existing techniques with a fresh perspective. Supplemental code is
provided in the accompanying GitHub page for the book to understand how all the
analysis was done. This book is neither mathematical notation heavy nor code
heavy. We have instead tried to explain the theoretical nuances, and in some
cases aided the explanation with plots and analysis results. This is to give a
feel of hand-held projects from beginning to completion.
The objective of the book is to make
the readers self-sufficient in the fine art of developing highly accurate
machine learning models that make it to production. We will cover different
stages of the machine learning project from development to the explanation of
model predictions. Learning the fundamental building blocks of building machine
learning models will make us better educated machine learning engineers,
researchers, capable consultants, and confident leaders. This book will help
beginners shorten their learning curve in their journey to become capable
machine learning engineers. For experienced consultants, we cover some methods
and techniques which are less often heard and used. Techniques that were not
available in python previously. For leaders in data science, it will help you
get perspective on how to advocate your models by using model explainability.
We assume that the readers are aware
of what is regression, and what is classification. Different machine-learning
techniques generally used. For example, linear regression, random forest,
logistic regression, etc. Different cost functions, such as root mean square
error (RMSE), F1 score, Precision, and Recall. The reader has tried their hands
at developing models and needs help in improving model performance from the existing
level to a higher level.
Although concepts covered in the
book are applicable across programming languages, this book is written for
python users. It is for the same purpose that several algorithms mentioned in
this book were developed from scratch and open-sourced as 4 separate python
libraries by the author. It is assumed that the reader has a basic
understanding of libraries such as Sklearn for machine learning, Pandas, and
Numpy for data manipulation, as well as matplotlib and seaborn for data visualization.
This book is organized in the manner
of how a project should be executed in a practical sense from feature
engineering to model explanation. Some methods beyond feature engineering and
feature selection that can help improve model performance, as well as denoising
techniques for signal processing data are explained in Chapter 11.
Although we discuss multiple
approaches for improving model performance, there is however no silver bullet
in machine learning. It can be possible that even after performing feature
engineering, feature extraction, feature selection, ensembling, etc. we might
not get desirable model performance. In fact, there are machine learning
projects which fail due to data inadequacy, and data quality issues, both of
which are beyond the scope of the book. We will use 4 example datasets, out of
which 2 datasets are failure examples and 2 are of machine learning success to
explain the phenomenon.
Deep learning removes the need for
feature engineering and subsequent processes. However, deep learning is
suitable for unstructured data such as images, text, audio, and video. For
structured tabular data, knowing how to do feature engineering, how to enrich
constructed features by creating higher-level features from original features,
and knowing which features to keep in your model, can decide the outcome of
your project. Hence, for the problems for which you plan to use traditional
machine learning, this book tries to make a sincere attempt to educate about
the different steps involved in training a machine learning model, in the realm
of feature engineering and feature selection.
This book aims to discuss model
development after data cleaning has been performed and we have a clean dataset.
This book does not teach you how to clean data, i.e. how to perform outlier
treatment, and how to deal with missing data. There are many valuable resources
available on this topic that are outside the scope of this book