11.4: Signal Processing
It is the domain that deals with
analyzing, modifying, and synthesizing signals. A signal can be audio, video,
radar measurement, etc. It converts and transforms data to enable us to see
things that are not possible via direct observation. The most common
applications of signal processing are audio and video compression, speech
recognition, improving audio quality in phone calls, oil exploration, etc.
Signal processing can help us
extract important parts of the signal, which can then be used as features to
train the model. If the features are derived from signals, it can help if we
clean the features using signal processing techniques methods. We will discuss
two such methods which are useful for machine learning applications. Filtering
of signals, and baseline removal for Raman spectra.
11.4.1 Filtering
In signal processing, filtering
denotes removing unwanted frequencies and frequency bands from the signal. It
helps in increasing the precision of the data without distorting the signal. It
is performed through a process known as convolution. It fits subsets of
adjacent data with low-degree polynomials using linear least squares. It has
wide use in radio, music synthesis, image processing, etc. Savitzky-Golay
filter is one of the commonly used methods for removing noise from data. Let s
discuss some practical machine-learning applications that use filtering.
Savitzky-Golay filter has been used [1]
in demand forecasting for eliminating outliers and noises in the non-stationary
time series. This helps time series models to learn better from the filtered
data and forecast more accurately. It can also be used with deep learning. For
example, it has been used [2] with one-dimensional CNN layers to
identify abnormal EEG signals, without using any explicit feature extraction
technique.
We will analyze the beaver body
temperature discussed in chapter 1. There are 2 beavers, and we will only
analyze the body temperature of beaver 1 for the sake of simplicity. Let's look
at beaver1's body temperature with and without filtering in figure 11.4.1. We
can see that filtered temperatures have less volatility and are more stable. It
retains the information about patterns in temperature, and at the same time
filters and suppresses possible noise and extreme values. We used the
polynomial order of 5 and a window length of 11 for filtering. If needed, we
can increase or decrease the values, based on observed patterns in the data to
help the algorithm filter noise more accurately.
Figure 11.4.1: Beaver1 body
temperature with and without Savitzky-Golay filtering
11.4.2 Baseline Removal
Raman spectra are widely used in
different scientific fields that focus on studying macromolecules. It allows
both chemical and physical structural analysis of materials, using a small
sample, without damaging the samples. This is used by law enforcement agencies
for identifying contraband items, without physically inspecting them. It can
also be used for detecting diseases [3], without any need for
further medical diagnostics.
Despite its usefulness, Raman
spectra has one issue that needs to be taken care of before using. It carries a
background, otherwise known as the baseline. Unless treated and removed, the
baseline can cause negative effects in the qualitative and quantitative
analysis of spectra. Hence, Raman spectra are fitted and corrected to mitigate
this negative influence before being used. There are many methods of correcting
the baseline. We will discuss 3 methods with the help of the companion python library
BaselineRemoval
and see how the three algorithms can help remove background from the spectra.
Modified multi-polynomial fit, also
known as ModPoly uses thresholding, to iteratively
fit a polynomial baseline to data. Its limitation is that it is prone to
variability in data which has a low signal-to-noise ratio. It can smoothen the
spectrum by automatically eliminating Raman peaks and leaving behind baseline
fluorescence, which can finally be subtracted from the raw spectrum. It uses
least-square polynomial fitting functions. Data points are generated from this
curve. Data points with higher values than the respective input values are
assigned to the original intensity. This exercise is repeated for several
iterations, between 25 to 200. The number of repetitions depends on factors
such as the relative amount of fluorescence to Raman.
There are some major limitations for
ModPoly, as it is dependent on the spectral fitting
range and the polynomial order specified. It might not be an ideal solution for
high-noise situations, as noise is not appropriately dealt with in ModPoly. ModPoly tends to
introduce artificial peaks in the data in places where the original spectrum
was free of such peaks. Also, existing large peaks in the data tend to
contribute more to the polynomial fitting, which can in turn bias in the
results.
Improved ModPoly,
also known as IModPoly is an improvement on the ModPoly algorithm and is meant for noisy data. Identifying
and removing major peaks is limited to the first iteration only. This prevents
unnecessary data rejection. For each iteration of polynomial fitting, lower
values of the wave number are selected and concatenated. This is used for
constructing a modified spectrum. This in turn is then fitted again. Despite
the improved version of the algorithm, IModPoly, just
like its predecessor, requires user intervention and prior information, such as
detected peaks.
A new method was proposed by Zhang [4],
which doesn t require any user intervention and prior information, such as
detected peaks. It is named adaptive iteratively reweighted penalized least
squares. It is a fast and flexible method that performs adaptive iteratively
reweighted penalized least squares. This helps in approximating complex
baselines. In each iteration, weights are obtained adaptively by using SSE
between a previously fitted baseline and original signals. It uses a penalty
approach to control the smoothness of the fitted baseline. It does so by using
the sum squared derivatives of the fitted baseline. Lambda is a parameter
controlled by the user. Larger the lambda, the smoother the fitted vector.
Let s now look at data distribution
for original spectra and baseline corrected spectra for the skimmed milk
samples discussed in chapter 1.
Figure 11.4.2.1: Original Vs. ModPoly corrected
spectra of skimmed milk samples
We can see in figure 11.4.2.1 that
artificial peaks are introduced by ModPoly for
observations at 8000 till 20000. In this region, ModPoly
is higher than the original spectrum.
Figure 11.4.2.2: Original Vs. IModPoly corrected
spectra of skimmed milk samples
As seen in figure 11.4.2.2, ImodPoly performs
better than ModPoly.
It removed noise from the spectra, as seen in the plot, between 0 to 8000, and
further after 19000. Also, unlike ModPoly, it didn t
add an artificial peak.
Figure 11.4.2.3: Original Vs. airPLS corrected
spectra of skimmed milk samples
We can see in figure 11.4.2.3 for the airPLS
method that it removed noise from spectra better than previous algorithms.
Especially, for observations between 0 and 2500. As the denoised spectra in
this section resemble closely with the rest of the spectra.