Heart Rate Detection and Classiﬁcation from Speech Spectral Features Using Machine Learning

Measurement of vital signs of the human body such as heart rate, blood pressure, body temperature and respiratory rate is an important part of diagnosing medical conditions and these are usually measured using medical equipment. In this paper, we propose to estimate an important vital sign – heart rate from speech signals using machine learning algorithms. Existing literature, observation and experience suggest the existence of a correlation between speech characteristics and physiological, psychological as well as emotional conditions. In this work, we estimate the heart rate of individuals by applying machine learning based regression algorithms to Mel frequency cepstrum coeﬃcients, which represent speech features in the spectral domain as well as the temporal variation of spectral features. The estimated heart rate is compared with actual measurement made using a conventional medical device at the time of recording speech. We obtain estimation accuracy close to 94% between the estimated and actual measured heart rate values. Binary classiﬁcation of heart rate as ‘normal’ or ‘abnormal’ is also achieved with 100% accuracy. A comparison of machine learning algorithms in terms of heart rate estimation and classiﬁcation accuracy is also presented. Heart rate measurement using speech has applications in remote monitoring of patients, professional athletes and can facilitate telemedicine.


Introduction
Vital signs of human body are conventionally measured using medical equipment.These can be complicated to use, expensive and also cause inconvenience to the patient/individual.This is especially true for measuring vital signs of athletes during their train-ing which involves intense physical activity.Connecting electrodes, sensors and other medical equipment on athletes while they are training is likely to be intrusive and affect their performance.Physiological as well as emotional changes in an individual result in variations in the speech produced (Trouvain, Truong, 2015; Science Encyclopedia, 2019; Borkovec et al., 1974;Ramig, 1983;Reynolds, Paivio, 1968).Ageing, health condition, stress level, exposure to pollution as well as physical exercise and activity are some factors which can cause physiological changes in the human body.While existing literature suggests that speech production process is affected by physiological changes in individuals, the effect of such physiological changes on the actual speech parameters needs thorough investigation (Trouvain, Truong, 2015).The work presented in this paper is directed towards the estimation of heart rate from features extracted from speech signals using machine learning.While there is sufficient evidence from published literature linking physiological and emotional conditions to speech production, research on the actual estimation of physiological parameters accurately using speech is still at a nascent stage.If the prediction is indeed accurate, it would substantiate clinical findings that there exists a correlation between physiological condition and speech characteristics of individuals and pave the way for non-invasive and non-contact based, remote medical monitoring.It should be noted however that such speech based medical monitoring shall complement existing medical devices rather than replace them.
The topic of this research has the potential for rapid development leading to a plethora of application scenarios, if estimation accuracy is improved.The results presented in this article will have valuable impact and can lead to interdisciplinary research involving electronics, signal processing and medicine.By being able to measure vital parameters of the human body without complex and expensive medical equipment, it will simplify the cost of medical diagnosis and treatment.Furthermore, it can make it possible for medical practitioners and professional sport trainers to monitor patients/athletes from remote location by collecting their speech samples over the telephone or internet.

Related work
The existence of relationship between human speech and physiological parameters is evident from published literature.In (Schuller et al., 2013), measurement of heart rate and skin conductance as well as classification of pulse rate as 'high' or 'low' has been done using audio recordings of breath and sustained vowel sounds with nominal accuracy.Extraction of electrocardiogram (ECG) features from two dimensional spectrum of vowel speech is demonstrated in (Skopin, Baglikov, 2009;Mesleh et al., 2012) in which the vowel sound 'i' as in the word 'email' is shown to yield better accuracy compared to the other vowel sounds.Heart rate extraction using statistical analysis of speech is presented in (Kaur, Kaur, 2014), but there is no mention regarding the accuracy of the technique.A data mining approach is used in (Sakai, 2015b) to establish a correlation between heart rate and vocal frequency from which heart rate is estimated using multiple speech recordings from only two users.A comparison of different classifiers to detect emotions based on Mel frequency cepstrum coefficients (MFCC) is presented in (James, 2015).Blood pressure (BP) detection from speech using support vector machine (SVM) is suggested to be feasible in (Sakai, 2015a) with high correlation between estimated and actually measured values of BP.It is also shown in (Orlikoff, Baken, 1989) that heartbeat has an influence on the vocal fundamental frequency causing it to fluctuate.In (Schuller et al., 2014), measurement of heart rate and skin conductance from various speech features, using machine learning algorithms such as support vector regression (SVR), SVM, artificial neural networks (ANN) as well as linear regression has been presented with moderate accuracy and the authors conclude that MFCC features are particularly relevant for the task of measuring heart rate from speech.
Heart rate is affected by physical activity performed by an individual and based on observation, speech is affected when performing physical activity such as exercise or sport.It has been shown in (Usman, 2017) that the accuracy of speaker recognition system based on MFCC is reduced when speech is recorded immediately after intense physical activity, suggesting that speech features are altered.Heart rate variation depends on the level of physical activity as well as the fitness of the individual, in addition to other factors (James, 2015).Furthermore, the physiological response to activity depends on the intensity, duration and regularity of performing the activity (Burton et al., 2004).These strongly suggest that there exists some correlation between speech and heart rate which provides a basis and motivation for conducting this research.Accurate prediction of heart rate and other physiological parameters based on speech signals has the potential to revolutionize medical care by monitoring patients remotely and provide timely medical intervention.With the advent of telemedicine and wide availability of portable medical devices, this could be a game changer as the ubiquitous and humble smartphone can extend its functionality as a medical device, without the need to incorporate additional sensors.
Non-contact based measurement of physiological parameters such as heart rate, heart rate variability, respiratory rate and blood volume pulse, by applying independent component analysis (ICA) to facial images and video has been proposed in (Poh et al., 2011).Extraction of heart rate, heart rate variability, blood oxygen saturation and breathing rate using video of finger tip has been presented in (Scully et al., 2012).While there is significant evidence from literature suggesting the existence of correlation between speech and certain physiological parameters, the focus of this work is to measure heart rate from speech and compare it with actual heart rate measured concurrently at the time of recording speech using a conventional medical device.

Speech samples and heart rate measurement
Speech recordings have been made for 42 individuals, all male, in the age group of 20-45 years using a Logitech H540 headphone set, which is equipped with a noise-cancelling microphone to minimize background noise, in a quiet office environment.The sentence uttered is 'A quick brown fox jumped over the lazy dogs' which is chosen in order to capture the sounds of all letters in the English alphabet.The duration of each audio recording is 5 seconds stereo format and sampling rate is f s = 16 000 samples per second, which is a standard value used in speech processing since it corresponds to wideband (8 kHz) representation of speech that faithfully restores all frequency components of the speech signal (Usman et al., 2018).As most of the salient features of speech lie within the 8 kHz bandwidth, increasing the sampling rate beyond 16 000 samples per second leads to a point of diminishing returns while increasing the length of data leading to increased computational complexity.Lower sampling rate can cause aliasing effect of some high frequency components and hence 16 000 samples per second is considered a reasonable choice to avoid aliasing effects as well as avoiding unnecessary increase in complexity.Higher sampling rate is required for non-speech sounds such as breathing sounds, cough sounds etc.The focus of this article is on speech sounds and therefore 16 000 samples per second is an appropriate sampling rate.The recordings are stored on a PC as uncompressed WAV file format that uses a quantization depth of 16 bits per sample (Kabal, 2017) resulting in audio bit rate of 256 kbps.Heart rate measurements of each individual are taken using a pulse oximeter (CONTEC Model No. CMS50DL) concurrently during the speech recording.These measurements are used for comparison with the predicted heart rate values to obtain the accuracy of the machine learning methods used to predict heart rate from speech.A pulse oximeter is a medical device that is attached to the finger tip to measure pulse rate and blood oxygen saturation."Pulse rate is exactly equal to the heart rate as the contraction of the heart leads to a noticeable pulse" (MacGill, 2017).

Speech pre-processing
The speech recordings are preprocessed to remove any unwanted components as well as silence inter-vals in speech that may have been introduced during the recording process.PC audio cards introduce a small DC offset (Partila et al., 2012) that is removed using a DC removal filter which is a first order infinite impulse response (IIR) filter.Silence intervals in the recorded sentence, which do not contain voice activity are removed using a voice activity detection (VAD) algorithm (Tan, Lindberg, 2010).The VAD algorithm identifies speech frames containing voice activity by assigning higher frame rate to consonant sounds, lower frame rate to vowel sounds and no frames to silence intervals.The effect of noise is also mitigated by the VAD algorithm using a posteriori signal to noise ratio (SNR) weighting to emphasize reliable segments of speech even under low SNR.The identified frames are then concatenated together resulting in uttered sentence without silence intervals and improved SNR.

Feature extraction
Feature extraction is a term derived from the discipline of pattern recognition and refers to characterizing a signal in a manner that allows some algorithm to recognize a pattern (Wolf, 1980).We extend this definition to "characterize a signal that allows some algorithm to recognize a pattern or some 'intrinsic' parameter associated with that pattern".We conjecture that such an intrinsic parameter obtained from patterns in speech features to be a representation of a physiological parameter such as heart rate of the individual who uttered that speech.This is based on the fact that speech production process involves movement of air from the lungs and through the vocal tract.As lungs interact with heart for oxygenation of blood, it is suggested in (Reilly, Moore, 2003) that cardiovascular responses are affected by cognitive activity such as reading, which involves speech production.As breathing is utilized for speech production, the inhalation and exhalation rates are governed by the speech production mechanism, thus altering the breathing pattern during speech production (Von Euler, 1982).Heart rate variability due to changes in respiratory pattern, termed as respiratory sinus arrhythmia (RSA) is discussed in (Yasuma, Hayano, 2004).These strongly suggest that speech signals contain information about heart rate and perhaps other physiological parameters which may be determined by extracting appropriate speech features and processing those features using machine learning algorithms.Results presented in this article indeed validate this idea as heart rate values are obtained from speech features with a high degree of accuracy.A variety of speech feature extraction techniques are available in the literature for various speech processing applications such as speech recognition, speaker recognition, speech enhancement etc.Some of the well-known techniques are linear prediction coefficients (LPC), lin-ear prediction cepstral coefficients (LPCC), Mel frequency cepstral coefficients (MFCC), perceptual linear prediction (PLP), feature extraction based on principal component analysis (PCA) and wavelets (Magre et al., 2013).
LPC, which represents speech parameters by an all pole filter using auto-regressive modeling and LPCC, which are cepstral coefficients computed from a smoothed auto-regressive power spectrum were widely used in automatic speech recognition until the introduction of MFCC (Huang et al., 2001).Since the introduction of MFCC's in 1980, it has been widely used in several speech processing applications and considered to be the most popular feature extraction method.Discriminating features in speech are better represented in spectral domain and temporal variation of spectral components also have a significant effect in characterizing speech.MFCC captures the spectral domain details along with their temporal variations elegantly with a low dimensional feature vector.A detailed discussion of MFCC is available in (Davis, Mermelstein, 1980).PLP is a spectral warping technique used to model and obtain an estimate of human auditory spectrum (Hynek Hermansky, 1990) and is more suited for speech recognition application.PCA is used to reduce the dimensionality of feature vectors by transforming the feature vectors to lower dimension (Huang et al., 2001).Wavelets have also been proposed in the literature to obtain a modified version of MFCC in which the discrete wavelet transform (DWT) is applied instead of discrete cosine transform (DCT) in the MFCC computation process, resulting in what is termed as Mel frequency discrete wavelet coefficients (MFDWC) (Tufekci, Gowdy, 2000).DWT has the advantage of providing better time-frequency resolution but there is not enough evidence from literature to purport a broad range of applications for DWT based speech features.Relative spectra (RASTA) is a technique which focuses on mitigating channel effects to improve speaker recognition systems.It is suggested that RASTA makes short term spectrum based techniques such as PLP more robust to linear spectral distortions (Hermansky, Morgan, 1994).In this work, MFCC's have been used as features applied to machine learning algorithms in order to estimate heart rate from speech signals.MFCC features have been chosen due to their ability to capture spectral details along with their temporal variations.Some existing results in literature to estimate heart rate from speech are also based on MFCC.

MFCC computation
Implementation of MFCC computation is available in (Davis, Mermelstein, 1980).The specifics of MFCC computation in the context of this work are described here for the sake of completeness.The preprocessed speech signal in which silence intervals are removed is framed using a Hamming window having length 256 samples with a 50% overlap with adjacent windows.At sampling rate of f s = 16 000 samples per second, this corresponds to each frame having a length of 16 ms which is within the stationary duration of 20-25 ms for speech signals and overlap duration of 8 ms.Each frame 'i' is denoted as x i (n), where n = 1, 2, ..., 256.An N -point Fast Fourier Transform (FFT) is computed with N = 256 for each 16 ms speech frame to obtain the spectrum of that segment.The combined process of windowing and FFT is represented as over the entire range of i, i.e. the total number of frames, k denotes the discrete Fourier transform (DFT) coefficients computed using FFT and K = 256.The energy spectral estimate of each frame is then computed as Energy spectral estimate is used rather than power spectral estimate as the length of each speech recording is short (less than 5 seconds) and the frame duration of 16 ms is not considered infinitesimally small relative to the length of each recording.The power spectral estimate is used when the signal duration is long enough to be considered infinite relative to the frame duration over which power is computed (Oppenheim, Verghese, 2015).Mel filterbank comprising of 20 triangular filters with 50% overlap between adjacent filters is then applied to each frame.Since the sampling rate is 16 000 samples per second, the frequency range for each frame extends from zero Hz to 8000 Hz.The corresponding minimum and maximum Mel frequencies are zero Mels and 2834.99Mels respectively obtained using (Lyons, 2012) To generate a filterbank with 20 filters, 20 linearly spaced points are generated between zero and 2834.The first Mel window extends from zero to 270 Mels, the second Mel window from 135 to 405 Mels and so on.The conversion from 'Mels' to 'Hz' is performed using resulting in f = {0, 89.2, 189.9, 303.
where ⌊.⌋ is the floor operator, N is the number of FFT points used, and f s is the sampling rate.For these chosen values, the FFT bin corresponding to 8000 Hz is bin 128.Thus, 20 Mel filter windows are produced each having a length of 256, which is chosen to be the same as the number of FFT points computed for each frame.Each of the 20 Mel filters is multiplied with the energy spectrum E i (k) and the coefficients are added to obtain the energy within each band.For each 16 ms speech frame, this results in a vector of length 20 where each element represents the signal energy within a Mel filter band.The log-energy is computed by taking the logarithm of the 20 filter-bank energies.The log filter-bank energies so obtained have a high degree of correlation due to overlapping filters in the filterbank.
A decorrelation transform is applied to decorrelate the Mel-spectral vector.Discrete cosine transform (DCT) is shown to be a near optimal decorrelation transform for log spectra of speech (Merhav, Lee, 1993;Logan, 2000).DCT is therefore applied to the 20 log filter-bank energies resulting in 20 coefficients for each 16 ms speech frame which are called the Mel Frequency Cepstral Coefficients.MFCC's are computed for all the frames that comprise the preprocessed speech signal resulting in a matrix of size 20 × I, where I is the total number of frames.This matrix of MFCC coefficients is analyzed using machine learning algorithms to estimate the heart rate of individuals from their speech signals and also to classify heart rate as 'normal' or 'abnormal'.

Predictive analytics for heart rate estimation
In this study, we have utilized the Microsoft Azure Machine Learning Studio (MAMLS) cloud platform which is accessible through a web interface.MAMLS allows for high volume secure data storage and transmission, computational analytics and remote visualization.Machine learning algorithms available in MAMLS have been tuned and configured for maximizing resting heart rate (HR) regression and classification accuracy.Machine learning techniques learn the statistical relationship between input data (e.g.MFCC co-efficients extracted from speech signals) and output data (e.g.HR) by fitting a flexible model to the data.The model hyper-parameters are optimally parameterized to minimize the regression/classification error in an independent test dataset, thereby allowing for creating a generalized model that can perform well not only on the training dataset, which would give rise to an over fitted solution but also on test dataset.For comparative analysis, six state-of-the art machine learning algorithms available in MAMLS, have been considered for regression (numerical estimation of HR of the individual) and binary classification analysis.Here, we have briefly summarized them for brevity.Linear Regression (LiR) is a very common statistical method utilized in machine learning for fitting a line to the input features and measuring the error.LiR tends to work well on high dimensional data sets that lack complexity (Kutner et al., 2004).Boosted Decision Tree (BDT) is classed as an ensemble learning technique, in which consecutive tree corrects for the errors of the previous tree thereby minimizing classification error.Class and value predictions are based on the entire ensemble of trees (Bühlmann, Yu, 2003).Decision Forest (DF) is another ensemble learning technique, in which each generated tree votes for the most popular class (Criminisi et al., 2011).
Neural Networks (NN's) are a set of interconnected layers.A typical NN, comprises of neurons in the three layers.The input feature set forms the first layer and is linked to the output layer via several interconnected hidden layers in the middle of the network.Each neuron is responsible for processing the input variables and passes the computed values to the neuron in the subsequent layer (Zhang et al., 1998).Logistic Regression (LoR) is another statistical technique for analyzing data in which there are one or more independent variables that determine an outcome.The outcome is normally measured with a dichotomous variable (i.e.having only two possible outcomes) (Dreiseitl, Ohno-Machado, 2002).
Support Vector Machines (SVM) work on the basic principle of recognizing patterns in a multi-dimensional hyper-plane and estimating a maximum margin between samples of the binary classes in a multi-dimensional input feature space (Nasrabadi, 2007).All of the algorithms mentioned above, have been successfully used in various application domains due to their relatively fast training, excellent performance and their robustness to over fitting.The performance of the various proposed regression and binary classification models are evaluated based on the metrics listed in Table 1.Depending on the task, the listed evaluation metrics (Roychowdhury, Bihis, 2016) for regression or binary classification are used.

RoC
Notations: n -total number of samples in the dataset, nt -number of samples in the training dataset, nt -number of samples in the test dataset, n k -number of classes for binary classification, tp -total number of true positive samples, tn -total number of true negative samples, fp -total number of false positive samples, fn -total number of false negative samples, a -actual value, a -mean of actual values, p -predicted value, RoC -receiver operating characteristic curve.

Data preprocessing
The raw dataset comprised of measured HR of 42 individuals and their corresponding MFCC frames.For each individual, we have a matrix of size 20 rows (coefficients in each frame) × 385 columns (frames) resulting in n = 323, 400 coefficients for all the 42 individuals.It is understood that the HR of the individual remains unchanged during short time intervals, such as the duration of the speech segments in our dataset.Here, we have utilized the measured HR-MFCC dataset to develop a numeric HR prediction (regression) and a binary-classification model using machine learning statistical techniques.
Initially, feature ranking was performed to determine which MFCC frames are statistically significant for regression-classification study.We utilized the Filter Based Feature Selection (FBFS) module in MAMLS to score all the 385 MFCC frames in our dataset using Pearson's correlation coefficient score (Lin, 1989).Based on this score, it is found that more than 95% of the MFCC frames are statistically significant.Hence for our regression and classification study, we utilized all the 385 MFCC frames.
The dataset was also checked for any missing values in the extracted MFCC coefficients and was then normalized using MinMax normalizer to scale the MFCC coefficients in the range of [0,1] interval.Rows which had missing MFCC coefficients were discarded and not used in the analysis.Of the 42 × 20 = 840 rows of MFCC coefficients, 60 rows were discarded due to missing MFCC coefficients.This results in a total of 780 rows of MFCC coefficients with their corresponding HR values.A histogram of HR values corresponding to each of these MFCC coefficients is shown in Fig. 1.MFCC coefficient (denoted as x) so that the minimal value is 0, and dividing by the new maximal coefficient value, as follows: Since the dataset has more MFCC frames than individuals samples, in this study, we have subjected the data to (n T = 80% n t = 20%) split to ensure more samples are available for training and learning.

Regression analysis
The goal of this study is to apply regression machine learning algorithms mentioned in Subsec.3.5, on the aforementioned dataset for predicting the HR of the individual from the MFCC coefficients extracted from speech signals.The schematic for the processing done post feature extraction is depicted in Fig. 2.
We trained two models using the optimally parameterized four regression algorithms, one without the predefined class data and the second with the predefined class data.It was observed that the HR prediction accuracy of the trained model with the inclusion of predefined class data is significantly higher for all the ML regression algorithms.The estimated heart rate obtained with and without HR class information along with the actual measured HR is shown in Figs 3-6 for   BDT, NN, LiR and DF algorithms respectively.The measured HR of the 42 individuals was divided into 5 classes as shown in Table 2.
All of the regression algorithms were optimally parameterized to achieve the best performance (the one with the lowest RMSE and highest R 2 values).For each of the aforementioned regression algorithms, the mean of the five evaluation metrics were computed after 10-fold cross validation, which are shown in Table 3, in which the standard deviation between folds for each of the performance metrics are listed inside round brackets.However, the coefficient of determina- tion (R 2 ) metric is widely used for exemplifying the predictive power of the regression model as a value between 0 and 1, with 1 being a perfect fit.We plot in Fig. 7 the four regression models as a function of CoD (plotted as %) to predict the HR from the speech MFCC coefficient dataset.It is observed from Fig. 8 that the best performance (RMSE = 2.95, CoD = 0.94) is achieved for BDT algorithm.The 4 trained models were also compared on a test dataset consisting of 20 measured heart rate samples to predict the HR from the test data MFCC frames.A comparison of estimated HR values obtained used BDT, NN, LiR, and DF algorithms with actual measured HR values is shown in Fig. 8.

Binary classification analysis
As shown in Fig. 9, for binary classification study, the measured resting HR of the 42 individuals in the dataset was divided into two binary classes namely: Class 1 for Normal (i.e.60-100 bpm) and anything below 60 bpm and higher than 100 bpm was classed as Class 0 for Abnormal (Laskowski, 2018).The preprocessed HR-MFCC dataset utilized for regression was also applied to each of the aforementioned binary classification algorithms, we then compute the performance evaluation metrics i.e.P RE , R EC , A CC , F 1score and the computed area under the curve (AUC) from the receiver operating characteristic curve (RoC) plot after 10-fold cross validation.
Optimal parameterization of each algorithm was performed to achieve the best classification performance (i.e.highest A CC , F 1-score) on the test dataset.The best accuracy (A CC = 100%) and F 1-score = 1 was again achieved using the BDT binary classification algorithm as observed in Table 4. Finally, we tested the models trained using the binary classification algorithms on a test dataset of 20 samples.It can be observed from Table 5 that the BDT trained model is able to accurately classify all the 20 test samples.

Conclusions
Speech signals contain intrinsic information regarding physiological, psychological as well as emotional conditions of the speaker.Accurate measurement of physiological parameters using speech signals can facilitate remote monitoring of patients and early diagnosis of medical conditions.The focus of this work is on estimating heart rate, which is a vital sign of individuals, from speech signals of the individuals.Heart rate estimation with high accuracy is achieved using speech spectral domain features (MFCC) as input to machine learning algorithms such as LiR, BDT, DF, and NN.HR estimation accuracy is highest for BDT algorithm.In addition to estimating the heart rate, a binary clas-sification scheme is also implemented to classify an individual's heart rate as 'normal' or 'abnormal'.Five techniques, BDT, DF, NN, LoR, and SVM, have been evaluated to address the classification problem.High accuracy is achieved for all the five techniques with DF having an accuracy close to 90% and BDT achieving 100% classification accuracy.Due to the unbalanced nature of the dataset used in this work, F1 score is a more indicative performance metric.Based on F 1score as well, BDT algorithm has the best classification performance followed by DF algorithm.Such high accuracies have been obtained by labeling the samples with predefined class information.
The proposed method has the following advantages over other methods available in the literature.In (Schuller et al., 2013), classification accuracy of 82.7% and minimum MAE for HR estimation equal to 8.1 is reported.In comparison, the classification accuracy in this work is 100% using BDT algorithm and MAE is less than 5 for all the four algorithms used.While an accuracy greater than 95% is reported in (Mesleh et al., 2012), it is restricted to only vowel sounds having a duration of at least 6 seconds and involves a lengthy procedure for each measurement.In contrast, the results in this work are not restricted to vowel sounds and once the AI algorithms are trained, the testing phase is relatively simple in terms of implementation complexity.It is indicated in (Kaur, Kaur, 2014), that accuracy of HR estimation from speech depends on various factors without actually quantifying it.Furthermore, voice recordings of 60 s duration are used as compared to less than 5 s segments in this work.The work in (Sakai, 2015b) is based on speech signals from only two individuals and the accuracy achieved is not specified.The classification of emotions based on speech MFCC features, reported in (James, 2015) exhibits large variation in accuracy across individuals.Compared to results available in literature, our results indicate better accuracy with fewer constraints.A limitation of this work is the small sample size (42) and lack of female speech samples which will be addressed going forward.It is intended to collect data from more individuals representing a much broader segment of the population which would further generalize the findings reported in this article.
Future work aims to achieve high accuracy without predefined class labeling and to detect atrial fibrillation for early detection of stroke.Measuring other physiological parameters such as blood pressure as well as monitoring of psychological and emotional conditions based on speech signals shall also be investigated in future.It is also intended to investigate the feasibility of using novel speech features, instead of MFCC's to measure physiological parameters from speech.The use of deep learning on raw speech signals rather than features extracted from speech signals shall also be in-vestigated in future.The effect of varying the acoustic devices used for recording as well as varying the parameters of the recording devices is also a part of future work.Developing and training algorithms which are agnostic to the recording device will make the application of this work more useful and involves collecting data from individuals using multiple acoustic devices.

Fig. 8 .
Fig. 8.Comparison of the measured (actual) and the predicted HR from the 4 trained models using BDT, NN, and DF regression algorithms on 20 test samples.

Table 1 .
List of performance evaluation metrics.

Table 2 .
Heart rate classes for regression analysis.

Table 4 .
Performance metrics computed for binary classification.