Speech Enhancement Using Sliding Window Empirical Mode Decomposition and Hurst-based Technique

The most challenging in speech enhancement technique is tracking non-stationary noises for long speech segments and low Signal-to-Noise Ratio (SNR). Diﬀerent speech enhancement techniques have been proposed but, those techniques were inaccurate in tracking highly non-stationary noises. As a result, Empirical Mode Decomposition and Hurst-based (EMDH) approach is proposed to enhance the signals corrupted by non-stationary acoustic noises. Hurst exponent statistics was adopted for identifying and selecting the set of Intrinsic Mode Functions (IMF) that are most aﬀected by the noise components. Moreover, the speech signal was reconstructed by considering the least corrupted IMF. Though it increases SNR, the time and resource consumption were high. Also, it requires a signiﬁcant improvement under non-stationary noise scenario. Hence, in this article, EMDH approach is enhanced by using Sliding Window (SW) technique. In this SWEMDH approach, the computation of EMD is performed based on the small and sliding window along with the time axis. The sliding window depends on the signal frequency band. The possible discontinuities in IMF between windows are prevented by the total number of modes and the number of sifting iterations that should be set a priori. For each module, the number of sifting iterations is determined by decomposition of many signal windows by standard algorithm and calculating the average number of sifting steps for each module. Based on this approach, the time complexity is reduced signiﬁcantly with suitable quality of decomposition. Finally, the experimental results show the considerable improvements in speech enhancement under non-stationary noise environments.


Introduction
In recent years, the suppression of acoustic distortion in noisy speech signals has been mostly required to enhance the speech signals.Various speech enhancement techniques and algorithms have been proposed by many researchers to reduce the noise from the speech signals (Vishari et.al., 2016;Kulkarni et al., 2016).Typically, in real non-stationary environments, the major problem in speech enhancement is concerned with the estimation of the noise statistics precisely.The conventional estimators are based on Voice Activity Detectors (VAD) (Kasap, Arslan, 2013;Zhang et al., 2014).After that, the power spectrum of the noise components is determined as a smoothed adaptation of its previous values obtained during the speech pauses.These processes offer a reasonable accuracy for stationary background noises but they cannot accurately estimate time-varying spectra.The complexity in tracking the non-stationary noises becomes more obvious for long speech segments and low Signal-to-Noise Ratio (SNR) (Hawaldar, Dixit, 2011;Mai et al., 2015).Different power spectrum-based methods have been proposed to deal with such situations (Zhao et al., 2014;Jin et al., 2017b).
In the past researches, Time-Frequency-based (TF based) speech enhancement solutions (Soni et al., 2018) were proposed based on the Empirical Mode Decomposition (EMD) (Mai et al., 2015;Mert, Akan, 2014).Generally, the EMD is a nonlinear time-domain adaptive method to decompose the signals into a series of oscillatory Intrinsic Mode Functions (IMF) and a residual one (Mandic et al., 2013;Zeileret et al., 2010).It does not need a set of basic functions for appropriately analyzing the target signal.In addition, it does not restrict the stationary signals.To tackle the challenges in non-stationary noisy atmospheres, a novel EMD-based speech enhancement technique (Zao et al., 2014) was proposed in which the noise components of each IMF were identified and chosen by its Hurst exponent statistics.Here, the selection of IMF and the speech reconstruction were performed on the frame-by-frame basis by considering both quality and intelligibility objective measures.However, this technique consumes a lot of time and computer resources.Also, a significant improvement under Babble noise scenarios was not achieved effectively.
Hence in this article, Sliding Window EMDH (SWEMDH) is proposed to improve the EMDH approach.This approach is performed based on the calculation of EMD in a comparatively small window and sliding this window along with the time axis.Window size is depending on the signal's frequency band.The possible discontinuities in IMF between windows are prevented by the total number of modes and the number of sifting iterations that should be set a priori.The number of sifting steps should be tailored for each module.This parameter depends on the sampling frequency and on analyzed signal, its complexity and spectrum.The number of sifting iterations is determined by decomposition of many signal windows by a standard algorithm and calculating the average number of sifting steps for each module.Thus, the speech enhancement technique is improved efficiently.
The rest of the article is structured as follows: Sec. 2 presents the literature survey related to the speech enhancement techniques.Section 3 describes the proposed speech enhancement technique.Section 4 shows the experimental results of the proposed technique.Finally, Sec. 5 concludes the research work and presents the Future Enhancement.

Literature survey
A noise reduction algorithm (Taal et al., 2011) was proposed for the intelligibility prediction of timefrequency weighted noisy speech.A Short-Time Objective Intelligibility Measure (STOI) was proposed which has a strong monotonic relation with the intelligibility scores of various listening tests where noisy speech was processed by some type of TF-weighting.This model has a simple structure in the sense that it was based on only two free parameters.However, the performance was not effective.
A colored noise based multi-condition training technique (Zao, Coelho, 2011) was proposed for robust speaker identification in unknown noisy environments.In this technique, the colored noise samples generation was based on filtering a white Gaussian sequence.Gaussian Mixture Models (GMM) was applied for obtaining the speaker models by using the noisy speech signals with a single SNR.However, the identification accuracy was less precise.
The variational Bayesian algorithm (wa Maina, Walsh, 2011) was proposed for joint speech enhance-ment and speaker identification.This technique was constructed on the intuition that speaker dependent priors may operate better than priors that attempt for capturing global speech properties.An iterative algorithm was derived that exchanges information between the speech enhancement and speaker identification processes.However, the computational complexity of this algorithm was high.
A novel technique (Gerkmann, Hendriks, 2012) was proposed to estimate the noise power spectral density by means of an unbiased Minimum Mean-Square Error (MMSE) optimal estimation.In this technique, a VAD-based noise power estimator was used that neglects the bias compensation by a soft Speech Presence Probability (SPP) with fixed priors.By selecting fixed priors, decoupling of the noise power estimator was achieved such as the estimation of the speech power and the estimation of the clean speech.However, the processing time was high.
EMD-based Filtering (EMDF) of low-frequency noise (Chatlani, Soraghan, 2012) was proposed for speech enhancement.In this technique, an adaptive method was developed for selecting the IMF index to separate the noise components from the speech according to the second-order IMF statistics.Then, the low-frequency noise components were separated by a partial reconstruction from the IMF.Based on this technique, a residual noise was suppressed from the speech signals that were enhanced by the conventional optimally modified log-spectral amplitude approach that utilizes a minimum statistics-based noise estimate.However, a minor improvement was required with the non-stationary Babble noise.
The speech enhancement strategy (Khaldi et al., 2014) was proposed based on time adaptive thresholding of IMF of the signal extracted by EMD.The denoised signal was reconstructed by the superposition of its adaptive thresholded IMFs.The adaptive thresholds were estimated by using the Teager-Kaiser energy operator (TKEO) of signal IMFs.It was used to identify the type of frame by expanding differences between speech and non-speech frames in each IMF.However, the parameters used for implementing a compromise between noise removal and speech distortion were required to optimize for further improvement.
Enhancement of speech dynamics for VAD (Dwijayanti et al., 2018) was proposed by using Deep Neural Network (DNN).In this technique, the dynamics were highlighted by speech period candidates which are computed based on the heuristic rules for the patterns of the first and second derivatives of the input signals.Then, these candidates combined with the log power spectra were given as input to the DNN for obtaining VAD decisions.However, the performance of VAD was degraded while it eliminates the sub bands F0 and its neighbours.
A fast and robust VAD (Ghahabi et al., 2018) was proposed for a real-time Automatic Speech Recognition (ASR) process.The major objective of this method was filtering the non-speech segments before processing the speech segments of the audio signal by the decoder.This method was a hybrid supervised or unsupervised model based on the zero-order Baum-Welch statistics obtained from a Universal Background Model (UBM).During testing, the Baum-Welch statistics of an unknown audio segment was compared with speech and non-speech VAD vectors.Finally, the decision was made based on the robust threshold.However, Equal Error Rate (EER) was high.

Proposed methodology
In this section, the proposed SWEMDH approach is explained in brief.The basic block diagram of the proposed approach is shown in Fig. 1.The enhancement of speech signals involves the following processes: • Initially, noisy speech signals are collected from the database and the extrema are extracted i.e., the noisy speech signals are decomposed into a set of windowed IMFs by using SWEMDH technique.• Once all windowed IMFs are obtained, the Hurst exponent is applied to select the most optimal IMF low-frequency noisy components.• Finally, the noisy speech signals are reconstructed by using the selected windowed IMFs efficiently.

Sliding Window Empirical Mode Decomposition (SWEMD)
Initially, the extrema i.e., maxima and minima are extracted from the original signal x(t).Then, the upper (e max ) and lower (e min ) envelopes are obtained by interpolating the local maxima and minima, respectively.The average between these envelopes is computed as: The obtained average value is subtracted from the original signal to obtain imf as: Generally, this process is known as sifting process.The computed imf 1 (t) is used as the input for next sifting process which is applied on the residual as: ( This sifting process is iterated until imf 1 (t) satisfies the conditions of imf signal.The original signal is reduced by the first mode while the sifting process is completed The residue r 1 (t) is used as input for extracting the second IMF and this process is looped for extracting all IMF as follows: where i refers to the index of current mode.When residue r i (t) consists of less than three extrema or all its points are closely equal to zero, the decomposition process is completed.The original signal is obtained by sum of all IMF components and the residue as: where n refers to the number of all modes.

Termination criteria for sifting process
The following criteria are used to terminate the sifting process: • The first one is that the number of extrema and the number of zero-crossings should vary at most by 1. • The second one is that the mean between the upper and lower envelopes should equal to zero at each point of IMF.In this proposed approach, the second criterion is used for terminating the sifting process.In accordance with the second criteria of IMF, the mean of its envelope is equivalent to zero at each point of IMF.Therefore, a termination criterion is used for the sifting process.In each iteration, the ratio of the mean value of the envelope of iterated mode and the amplitude of this envelope is verified where There are two thresholds used such as ϑ 1 and ϑ 2 , where ϑ 1 is ensuring globally small fluctuations of the envelope mean around zero and ϑ 2 is locally allowing higher fluctuations.The sifting process is terminated if τ (t) < ϑ 1 is true for (1−ε) part of the number of signal's points and if τ (t) < ϑ 2 is true for the remaining points.The typical values of these parameters are given as: These values of parameters tolerate a compromise between quality and speed of the decomposition process.

Hurst-based IMF selection
Once all IMF are obtained, the Hurst exponent is applied to decide which IMFs should be chosen for the speech signal reconstruction.Since those selected IMFs affect by the noise components.Once all IMF are obtained, the Hurst exponent is applied to select IMF for speech signal reconstruction.Consider the speech signal x(t) with the normalized autocorrelation coefficient function (δ(k)) as: In equation (Soni et al., 2018), µ x refers the mean of x(t) and k refers the time lag.For a fractional Gaussian noise, δ(k) is given as: where 0 ≤ H ≤ 1 refers the Hurst exponent of x(t).The value of H is defined by using autocorrelation coefficient function decaying rate whose asymptotic characteristic is given by, The Hurst exponent defines the time-dependence or scaling degree of x(t) and is associated with its spectral characteristics.Within the entire range [0, 1], the power spectral density S x (f ) is exposed to be proportional to f 1−2H when f → 0 (Zhao et al., 2014).For H = 1 2 , S x (f ) is a constant over the entire frequency spectrum, where low frequencies are important in the case where H > 1 2 and H → 1.The Hurst exponent is estimated from non-overlapping frames of samples and it is used to enable the identification criteria for selecting the IMF low-frequency noise components.Figure 2 illustrates the first five IMFs obtained from decomposing the sample input speech signal segment of 2500 ms collected from the NOISEX-92 database.It shows that the first IMF is composed of faster oscillations than the second one which in its turn has faster fluctuations than the third one and so on.It implies that, at each time interval, the SWEMD applies a high-frequency versus low-frequency partition between IMFs.Therefore, the first mode should present the high-frequency content of the signal.Also, the cut-off frequency between consecutive IMFs is time-varying and signal dependent.

SWEMDH speech signal reconstruction
The speech signal reconstruction is performed to validate the decomposition.Normally, the speech signal reconstruction defines the determination of an original speech signal from a sequence of equally spaced segments i.e., IMFs.It initiates with the decomposition of the input noisy speech into n modes by using (Mai et al., 2015).After that, windowed IMF are obtained by separating each mode into Q non-overlapping shorttime frames where q ∈ {0, ..., Q − 1} refers the frame index and T d refers the fixed time-duration of the frames.Then, the Hurst exponent is estimated to all the windowed IMF (w imf i,q (t)) to select the IMF low-frequency noise components for each frame index q.In the next step, for each frame, the index N q of the last windowed IMF whose value of H is below a given threshold i.e., H q (N q ) < H th .If x(t) is an enhanced speech signal, then each of it's xq (t) is reconstructed as follows: Finally, x(t) is given as follows: Thus, based on this proposed SWEMDH, the sudden changes in the power spectrum of non-stationary noises are avoided and the selection of IMF for entire speech signal is achieved efficiently.

Results and discussions
In this section, performance effectiveness of the proposed SWEMDH is evaluated and compared with the existing EMDH approaches by using MATLAB 2014a.In this experiment, a subset of 12 speakers including 7 male and 5 female is randomly chosen that provides a total of 420 speech data segments, 10 per speaker In Eq. ( 17), P signal is the average power of speech signal and P noise is the average power of noise.It can be rewritten as: In equation (Taal et al., 2011), A signal and A noise are the Root Mean Square (RMS) amplitude of signal and noise, respectively.• Mean Square Error (MSE): It represents the cumulative squared error between the reconstructed and original speech signal.The MSE is calculated as: In equation (Zao, Coelho, 2011), l refers the signal length and e refers the error between the original signal x(t) and reconstructed signal x(t).• Peak Signal-to-Noise Ratio (PSNR): It is defined as the fraction of maximum possible signal power to the corrupting noise power.Generally, it is computed by using MSE as: • Mean Absolute Error (MAE): It is defined as the absolute error between the reconstructed speech signal and original signal.It is computed as: • Perceptual Evaluation of Speech Quality (PESQ): It can be applied to provide an endto-end quality assessment for characterizing the listening quality as perceived by users.
where α 0 = 0.1, α 1 = 0.1, and α 2 = 0.0309.The following Table 1 and Fig. 3 give the comparison results of MSE for both EMDH and SWEMDH using different acoustic noises that corrupt the speech signal during transmission.
The following Table 2 and Fig. 4 gives the comparison results of MAE for both EMDH and SWEMDH using different acoustic noises that corrupt the speech signal during transmission.
The following Table 3 and Fig. 5 gives the comparison results of SNR for both EMDH and SWEMDH using different acoustic noises that corrupt the speech signal during transmission.
The following Table 4 and Fig. 6 give the comparison results of PSNR for both EMDH and SWEMDH using different acoustic noises that corrupt the speech signal during transmission.
The following Table 5 and Fig. 7 give the comparison results of PESQ for both EMDH and SWEMDH using different acoustic noises that corrupt the speech signal during transmission.From this analysis, it is observed that SWEMDH approach achieves higher performance than the existing EMDH based speech enhancement.For example, consider the Babble noise environment with SNR is 15dB.For this case, the MSE of SWEMDH is 92.67% reduced differently than the EMDH technique.The MAE of SWEMDH is 63.64% less than the EMDH.Similarly, the PSNR value for the proposed SWEMDH technique is 68.63% increased than the existing technique.In addition, the PESQ of proposed technique is 0.74% higher than the existing EMDH technique.Thus, the proposed SWEMDH technique achieves high PSNR, SNR and PESQ with less MSE and MAE compared to the EMDH technique.

Conclusions
In this article, a Sliding Window-based EMDH approach (SWEMDH) is proposed to improve the speech enhancement under non-stationary acoustic noise envi-ronments.In this proposed approach, the EMD computation is performed that estimates IMF according to the small and sliding window that depends on the signal's time frequency.To compute consecutive IMFs for each frame, the number of sifting iterations is determined by decomposition of many signal's windows by a standard algorithm and calculating the average number of sifting steps.After that, the Hurst exponent is applied on all IMFs to select the IMF low frequency components which are used to reconstruct the original speech signal.Thus, the time complexity of speech enhancement is reduced with an appropriate decomposition quality.Finally, the experimental results prove that the proposed SWEMDH approach has better performance than the existing EMDH in speech enhancement under non-stationary noise scenarios.
with sampling rate of 16 kHz and average time duration of 2 seconds.Also, acoustic noises such as Airport, Babble, Car, Exhibition, Restaurant, Station, Street and Train are used for corrupting the speech signals considering different SNR values like 0 dB, 5 dB, 10 dB and 15 dB.The noises are collected from the NOISEX-92 database.The following are the performance metrics used to evaluate the effectiveness of the proposed technique: • Signal-to-Noise Ratio (SNR): It is defined as the fraction of the speech signal power to the corrupting noise power.It is computed as: SNR [dB] = 10 log 10 P signal P noise .