Before we start

This book aims to be a second course for beginner machine learning engineers who use python. It aims to answer the question "What to do now?" challenge faced by machine learning engineers and researchers when they are stuck below than desired model performance and are looking for ways to improve model performance. The ideal audience for the book is those who did a course on data science/machine learning, researchers who are exploring traditional machine learning algorithms for their thesis, people who are working in their first job as a data scientist, people with 1-3 years of experience in data science, and leaders who transitioned to lead data science team but do not have the background and want to speak with the technical team with confidence. Experienced, seasoned, and veteran data scientists and researchers who want to try new feature engineering, feature selection, and signal processing techniques, as well as to look at existing techniques with a fresh perspective. Supplemental code is provided in the accompanying GitHub page for the book to understand how all the analysis was done. This book is neither mathematical notation heavy nor code heavy. We have instead tried to explain the theoretical nuances, and in some cases aided the explanation with plots and analysis results. This is to give a feel of hand-held projects from beginning to completion.

The objective of the book is to make the readers self-sufficient in the fine art of developing highly accurate machine learning models that make it to production. We will cover different stages of the machine learning project from development to the explanation of model predictions. Learning the fundamental building blocks of building machine learning models will make us better educated machine learning engineers, researchers, capable consultants, and confident leaders. This book will help beginners shorten their learning curve in their journey to become capable machine learning engineers. For experienced consultants, we cover some methods and techniques which are less often heard and used. Techniques that were not available in python previously. For leaders in data science, it will help you get perspective on how to advocate your models by using model explainability.

We assume that the readers are aware of what is regression, and what is classification. Different machine-learning techniques generally used. For example, linear regression, random forest, logistic regression, etc. Different cost functions, such as root mean square error (RMSE), F1 score, Precision, and Recall. The reader has tried their hands at developing models and needs help in improving model performance from the existing level to a higher level.

Although concepts covered in the book are applicable across programming languages, this book is written for python users. It is for the same purpose that several algorithms mentioned in this book were developed from scratch and open-sourced as 4 separate python libraries by the author. It is assumed that the reader has a basic understanding of libraries such as Sklearn for machine learning, Pandas, and Numpy for data manipulation, as well as matplotlib and seaborn for data visualization.

This book is organized in the manner of how a project should be executed in a practical sense from feature engineering to model explanation. Some methods beyond feature engineering and feature selection that can help improve model performance, as well as denoising techniques for signal processing data are explained in Chapter 11.

Although we discuss multiple approaches for improving model performance, there is however no silver bullet in machine learning. It can be possible that even after performing feature engineering, feature extraction, feature selection, ensembling, etc. we might not get desirable model performance. In fact, there are machine learning projects which fail due to data inadequacy, and data quality issues, both of which are beyond the scope of the book. We will use 4 example datasets, out of which 2 datasets are failure examples and 2 are of machine learning success to explain the phenomenon.

Deep learning removes the need for feature engineering and subsequent processes. However, deep learning is suitable for unstructured data such as images, text, audio, and video. For structured tabular data, knowing how to do feature engineering, how to enrich constructed features by creating higher-level features from original features, and knowing which features to keep in your model, can decide the outcome of your project. Hence, for the problems for which you plan to use traditional machine learning, this book tries to make a sincere attempt to educate about the different steps involved in training a machine learning model, in the realm of feature engineering and feature selection.

This book aims to discuss model development after data cleaning has been performed and we have a clean dataset. This book does not teach you how to clean data, i.e. how to perform outlier treatment, and how to deal with missing data. There are many valuable resources available on this topic that are outside the scope of this book