This study sought to evaluate the effect of speech intensity on performance of the Callsign Acquisition Test (CAT) and Modified Rhyme Test (MRT) presented in noise. Fourteen normally hearing listeners performed both tests in 65 dB A white background noise. Speech intensity varied while background noise remained constant to form speech-to-noise ratios (SNRs) of -18, -15, -12, -9, and -6 dB. Results showed that CAT recognition scores were significantly higher than MRT scores at the same SNRs; however, the scores from both tests were highly correlated and their relationship for the SNRs tested can be expressed by a simple linear function. The concept of CAT can be easily ported to other languages for testing speech communication under adverse listening conditions.
A phoneme segmentation method based on the analysis of discrete wavelet transform spectra is described. The localization of phoneme boundaries is particularly useful in speech recognition. It enables one to use more accurate acoustic models since the length of phonemes provide more information for parametrization. Our method relies on the values of power envelopes and their first derivatives for six frequency subbands. Specific scenarios that are typical for phoneme boundaries are searched for. Discrete times with such events are noted and graded using a distribution-like event function, which represent the change of the energy distribution in the frequency domain. The exact definition of this method is described in the paper. The final decision on localization of boundaries is taken by analysis of the event function. Boundaries are, therefore, extracted using information from all subbands. The method was developed on a small set of Polish hand segmented words and tested on another large corpus containing 16 425 utterances. A recall and precision measure specifically designed to measure the quality of speech segmentation was adapted by using fuzzy sets. From this, results with F-score equal to 72.49% were obtained.
The performance of binaural processing may be disturbed in the presence of hearing loss, especially of sensorineural type. To assess the impact of hearing loss on speech perception in noise regarding binaural processing, series of speech recognition measurements in controlled laboratory conditions were carried out. The spatial conditions were simulated using dummy head recordings played back on headphones. The Intelligibility Level Difference (ILD) was determined by measuring the change in the speech reception thresholds (SRT) between two configurations of a masking signal source (N) and a speech source (S), namely the S0N90 condition (where numbers stand for angles in horizontal plane) and the co-located condition (S0N0). To disentangle the head shadow effect (better ear effect) from binaural processing in the brain, the difference between binaural and monaural S0N90 condition (so-called Binaural Intelligibility Level Difference, BILD) value was calculated.
Measurements were performed with a control group of normal-hearing listeners and a group of sensorineural hearing-impaired subjects. In all conditions performance of the hearing-impaired listeners was significantly lower than normal-hearing ones, resulting in higher SRT values (3 dB difference in the S0N0 configuration, 7.6 dB in S0N90 and 5 dB in monaural S0N90). The SRT improvement due to the spatial separation of target and masking signal (ILD) was also higher in the control group (8.1 dB) than in hearing-impaired listeners (3.5 dB). Moreover, a significant deterioration of the binaural processing described by BILD was found in people with sensorineural deficits. This parameter for normal-hearing listeners reached a value of 3 to 6 dB (4.6 dB on average) and decreased more than two times in the hearing-impaired group to 1.9 dB on average (with a deviation of 1.4 dB). These findings could not be explained by individual average hearing threshold (standard in audiological diagnostics) only. The outcomes indicate that there is a contribution of suprathershold deficits and it may be useful to consider binaural SRT measurements in noise in addition to the pure tone audiometry resulting in better diagnostics and hearing aid fitting.
The main goal of the research was to obtain a set of data for ability of speech in noise recognition using Polish word test (New Articulation Lists – NAL-93) with two different masking signals. The attempt was also made to standardise the background noise for Polish speech tests by creating babble noise for NAL-93. Two types of background noise were used for Polish word test – the babble noise and the speech noise. The short method was chosen in the study as it provided similar results to constant stimuli method using less word material. The experiment using both maskers was presented to 10 listeners with normal hearing.
The mean SRT values for NAL-93 were −3.4 dB SNR for speech noise and 3.0 dB SNR for babble noise. In this regard, babble noise provided more efficient results. However, the SRT parameter for speech noise was more similar to values obtained for other Polish speech tests. The measurement of speech recognition using Polish word test is possible for both types of masking signals presented in the study. The decision as to which type of noise would be better in practice of hearing aid prosthetics remains an open-end question.
This study examined whether differences in reverberation time (RT) between typical sound field test rooms used in audiology clinics have an effect on speech recognition in multi-talker environments. Separate groups of participants listened to target speech sentences presented simultaneously with 0-to-3 competing sentences through four spatially-separated loudspeakers in two sound field test rooms having RT = 0:6 sec (Site 1: N = 16) and RT = 0:4 sec (Site 2: N = 12). Speech recognition scores (SRSs) for the Synchronized Sentence Set (S3) test and subjective estimates of perceived task difficulty were recorded. Obtained results indicate that the change in room RT from 0.4 to 0.6 sec did not significantly influence SRSs in quiet or in the presence of one competing sentence. However, this small change in RT affected SRSs when 2 and 3 competing sentences were present, resulting in mean SRSs that were about 8-10% better in the room with RT = 0:4 sec. Perceived task difficulty ratings increased as the complexity of the task increased, with average ratings similar across test sites for each level of sentence competition. These results suggest that site-specific normative data must be collected for sound field rooms if clinicians would like to use two or more directional speech maskers during routine sound field testing.
Although the emotions and learning based on emotional reaction are individual-specific, the main features are consistent among all people. Depending on the emotional states of the persons, various physical and physiological changes can be observed in pulse and breathing, blood flow velocity, hormonal balance, sound properties, face expression and hand movements. The diversity, size and grade of these changes are shaped by different emotional states. Acoustic analysis, which is an objective evaluation method, is used to determine the emotional state of people’s voice characteristics. In this study, the reflection of anxiety disorder in people’s voices was investigated through acoustic parameters. The study is a case-control study in cross-sectional quality. Voice recordings were obtained from healthy people and patients. With acoustic analysis, 122 acoustic parameters were obtained from these voice recordings. The relation of these parameters to anxious state was investigated statistically. According to the results obtained, 42 acoustic parameters are variable in the anxious state. In the anxious state, the subglottic pressure increases and the vocalization of the vowels decreases. The MFCC parameter, which changes in the anxious state, indicates that people can perceive this situation while listening to the speech. It has also been shown that text reading is also effective in triggering the emotions. These findings show that there is a change in the voice in the anxious state and that the acoustic parameters are influenced by the anxious state. For this reason, acoustic analysis can be used as an expert decision support system for the diagnosis of anxiety.
Laughter is one of the most important paralinguistic events, and it has specific roles in human conversation. The automatic detection of laughter occurrences in human speech can aid automatic speech recognition systems as well as some paralinguistic tasks such as emotion detection. In this study we apply Deep Neural Networks (DNN) for laughter detection, as this technology is nowadays considered state-of-the-art in similar tasks like phoneme identification. We carry out our experiments using two corpora containing spontaneous speech in two languages (Hungarian and English). Also, as we find it reasonable that not all frequency regions are required for efficient laughter detection, we will perform feature selection to find the sufficient feature subset.
Despite various speech enhancement techniques have been developed for different applications, existing methods are limited in noisy environments with high ambient noise levels. Speech presence probability (SPP) estimation is a speech enhancement technique to reduce speech distortions, especially in low signalto-noise ratios (SNRs) scenario. In this paper, we propose a new two-dimensional (2D) Teager-energyoperators (TEOs) improved SPP estimator for speech enhancement in time-frequency (T-F) domain. Wavelet packet transform (WPT) as a multiband decomposition technique is used to concentrate the energy distribution of speech components. A minimum mean-square error (MMSE) estimator is obtained based on the generalized gamma distribution speech model in WPT domain. In addition, the speech samples corrupted by environment and occupational noises (i.e., machine shop, factory and station) at different input SNRs are used to validate the proposed algorithm. Results suggest that the proposed method achieves a significant enhancement on perceptual quality, compared with four conventional speech enhancement algorithms (i.e., MMSE-84, MMSE-04, Wiener-96, and BTW).