Study on Chinese Speech Intelligibility Under Diﬀerent Low-Frequency Characteristics of Reverberation Time Using a Hybrid Method

Reverberation time (RT) is an important indicator of room acoustics, however, most studies focus on the mid-high frequency RT, and less on the low-frequency RT. In this paper, a hybrid approach based on geometric and wave methods was proposed to build a more accurate and wide frequency-band room acoustic impulse response. This hybrid method utilized the ﬁnite-diﬀerence time-domain (FDTD) method modeling at low frequencies and the Odeon simulation at mid-high frequencies, which was investigated in a university classroom. The inﬂuence of the low-frequency RT on speech intelligibility was explored. For the low-frequency part, different impedance boundary conditions were employed and the eﬀectiveness of the hybrid method has also been veriﬁed. From the results of objective acoustical parameters and subjective listening experiments, the smaller the low-frequency RT was, the higher the Chinese speech intelligibility score was. The syllables, consonants, vowels, and the syllable order also had signiﬁcant eﬀects on the intelligibility score.


Introduction
Concerning the reverberation time (RT) in a room, most studies only pay attention to the mid-high frequency RT, less care about the low-frequency part of RT in a room.RT is the primary index of the acoustical design for all kinds of halls, but the requirement of RT in the low-frequency range is still controversial.Around the 1960s, Beranek defined the RT ratio of low frequency (125-250 Hz) and medium frequency (500-1000 Hz) as the bass ratio (BR), and put forward to the ideal value of 1.1-1.5 for BR (Beranek, 1962), namely the bass-rise reverberation characteristic.This baserise characteristic has been seen as desirable or at least tolerated in auditoria, especially in the USA (Barron, 2010).However, a flatten RT curve has been more favorable in Europe recently, even Beranek (2010) questioned himself after numerous and elaborate investigations.After measuring many performing venues with good sound quality, Fuchs and Steinke (2015) found that these buildings had a relatively flat frequency curve of RTs, hence they suggested that the BR close to 1 was more favorable to the low-frequency sound.In addition, Adelman-Larsen (2015) emphasized the necessity to improve clarity by controlling the low-frequency RT in his analysis of large-scale venues.
The low-frequency sound in an auditorium can increase the feeling of warmth in the hall (Beranek 1996), however intelligibility is more important than warmth in a speech hall, such as a classroom.In China, most of the Chinese acoustic standards are still based on the characteristic of the bass-rise RT (GB/T, 2005; GB, 2010; JGJ/T, 2012).Moreover, there are few sound standards for frequency below 500 Hz.As we know, Chinese is a tonal language, which is different from the intonation language of western countries.For Chinese, vowels are longer than consonants, as the low-frequency vowels are easy to mask the midhigh frequency consonants, which ultimately affect the Chinese speech intelligibility in a room.Some studies have demonstrated acoustic problems by applying low-frequency sound-absorbing structures and soundabsorbing materials into actual buildings.By adding low-frequency sound absorption structures in the classroom, Zha and Lyu (2020) reduced the low-frequency noise and flattened the low-frequency RT characteristics, in which way they obtained the satisfactory sound environment for a classroom.Peng et al. (2020) compared the objective parameters and subjective speech perception in two classrooms with similar RTs, but one classroom had a better speech perception than the other.They suggested that this discrepancy might be due to the difference in a low-frequency RT or a background noise level.In the follow-up study, Xu et al.
(2021) used the Odeon software to calculate RTs of these two classrooms and then carried out the Chinese speech intelligibility listening test by a headphone, which eliminated the influence of background noise and confirmed that reducing the low-frequency RTs was helpful to improve the Chinese speech intelligibility in the classroom.
The room acoustic simulation is an important part of the architectural design process, which is convenient and cost-saving.However, popular methods of predicting room acoustic characteristics are based on a geometric acoustic model of ray-like sound propagation, considering that these kinds of software are suitable for small wavelengths in comparison to the dimensions of the enclosure and internal objects.At larger wavelengths, the ray-like assumption no longer holds and the phenomena such as diffraction caused by a lowfrequency acoustic wave cannot be ignored (Southern et al., 2013).Solving the low-frequency sound problem the method based on the acoustic wave theory should be used.Moreover, the finite-difference timedomain (FDTD) method can simulate the frequencydependent boundaries with desired sound absorption characteristics through the digital impedance filter (Kowalczyk, van Walstijn, 2008).As the FDTD method has become more mature, this method is showing its superiority in both simulation accuracy and calculation speed (Botteldooren, 1995;Oxnard, 2018).Even so, the FDTD method is very memoryintensive, especially when modeling large volumes or a wide frequency bandwidth such as the range of hearing.Thus, this work used a combined method of FDTD and a geometric method to get a synthetic wide frequency-band room impulse response (RIR), which includes the 63 Hz octave frequency band.
This study aims to investigate the effect of different RTs in low frequencies on speech intelligibility.It is better to control RTs in the mid-high frequencies by modeling methods and change only the RTs in low frequencies.Through the combined method four kinds of low-frequency reverberation characteristics were established, and the differences in Chinese speech intelligibility in a classroom before and after the improvement were compared.

Room
In universities, most classrooms are large with a rectangular shape.In the study, a large classroom with volume of 15.82 × 8.22 × 4.9 m 3 was selected in the university, which has two windows and two metal doors on the left side, four windows on the right side and a blackboard on the front wall.All the walls and the ceiling are of plastered brick, the floor is covered with ceramic tiles, and the seats and desks are composed of multi-plywood.The RIRs of six receiving positions in the classroom were recorded by using the B&K 4189 microphone, and using the B&K 4296 dodecahedral loudspeaker as an omnidirectional sound source.The sound source (S) with a height of 1.4 m from the ground and all the receiving positions (R1-R6) with the same height of 1.2 m were arranged as shown in Fig. 1.During the measurement, the doors and windows were closed, and the RIRs were measured by using the swept-frequency method in the unoccupied classroom.Whereafter, RIRs were calculated by Dirac4.1 software, and objective acoustic parameters such as RT were obtained.After the calculation, the measured average RT of the six receiving positions was 3 s at 63 Hz, rising to 3.7 s at 125 Hz, and then decreasing gradually.In the Odeon model, the sound absorption coefficients were basically set according to the material of each surface in the actual classroom, while slightly adjusted to make the average RT of each receiving position close to the measured average RT.The sound absorption coefficients of all surfaces in the Odeon model are listed in Table 1.

FDTD acoustic model
The formulation of the FDTD scheme used in this study is the numerical solution of the wave equation that governs sound wave propagation in an ideal isotropic medium: where p denotes sound pressure and c denotes the sound speed which is set to 344 m/s, t is time in seconds and ∂z 2 is the 3D Laplacian operator.FDTD schemes for numerical simulation of the wave equation are derived by approximating time and space derivatives with finite difference operators according to Kowalczyk and van Walstijn (2008).Assuming an equal distance between grid points in all directions, the 3D discretized wave equations take the form of Eq. ( 2): where κ = cT X denotes the Courant number, T is the time step, X is the grid spacing, l, m, and i denote the spatial indexes in x, y, and z directions, and n is the time index.To ensure numerical stability in simulations, the stability condition should be satisfied, that is κ ≤ 1 √ 3 .Besides, the grid spacing should not be too long, generally less than one tenth of the wavelength.Combined with the stability condition, in this study the grid spacing was set to 0.06 m and the time step was set to 100.7 µs, and the derivative of a Gaussian function was chosen as the excitation source.Under these conditions, the spectral characteristic of the excitation source is non-flat in the whole frequency band, which will distort the listening material.To eliminate the non-flat effects of the excitation source, the RIRs calculated by the FDTD method were corrected by an inverse-filtering technique (Sakamoto et al., 2008).
In general, the reflected wave has a phase and an amplitude that differ from those of the incident wave, and such changes diverge with frequency.Assuming that a digital filter also has a frequency response in amplitude and phase, so the frequency-dependent boundary can be incorporated in a FDTD model with a digital filter.Since the infinite impulse response filter (IIR filter) and the specific impedance of the boundary have a similar form, this study expressed the boundary in terms of IIR filter.Kowalczyk and van Walstijn (2008) presented the FDTD formulation of the digital impedance filter (DIF) in a rectilinear grid, and the update formula for a boundary node could be expressed as: where a, b, and g have been clarified in (Kowalczyk, van Walstijn, 2008) and are not explained here.In this FDTD model, the DIFs combined Bessel highpass filters and Bessel low-pass filters were used to obtain the sound absorption coefficient of each interface.From the parameters a and b of the filter, the corresponding specific acoustic impedance (ξ) can be obtained.According to the relationship among the specific acoustic impedance, the reflection coefficient and the absorption coefficient (α), the absorption coefficient can be expressed as:

Verification
The RIRs obtained by the FDTD method were processed by low-pass filter with upper limit frequency of 355 Hz (355 Hz is corresponding to the upper cut-off frequency of 250 Hz octave band), while the RIRs obtained by Odeon software were processed by the highpass filter with the lower cut-off frequency of 355 Hz.Then, these two kinds of simulation results were combined to obtain synthetic RIRs over the entire audible spectrum.
To verify the accuracy of the DIF boundary, a classroom model named Model A, was established, whose absorption coefficients of all boundaries were adjusted so that the average RTs of the six receiving positions were close to the measurements of the classroom.Based on Model A, Model B with relatively flat RT at low frequencies was established.According to the specification (GB, 2010), the RT in the classroom larger than 300 m 3 should be lower than 0.8 s.Therefore, both of Model C and Model D with an average RT of 0.8 s at mid-high frequency were established.In addition, Model C and Model D have a rising and a flat RT at low frequencies, respectively.The RTs of these four models are shown in Fig. 3 and the sound absorption coefficients of these FDTD models are listed in Table 2.The absorption coefficients of FDTD models were obtained through Eq. ( 4).Most of the sound absorption processing were for plastered brick walls and

Speech intelligibility test
As the simulation results of Model A were consistent with the measured results, this study only conducted listening tests on the simulated models.In a quiet room where the background noise level was less than 30 dBA, subjective evaluation of the speech intelligibility was conducted using recordings of each receiving position.These signals had been processed based on the Mandarin Chinese speech intelligibility test word list as specified by GB 15508-1995 (GB, 1995).Each receiving position used two different lists, which were recorded by a man and a woman reading, respectively.Each list has 75 syllables, which are randomly divided into 25 rows of three syllables with no coherent meaning.The lists keep the balance of difficulty and phonemic characteristic, where each consonant, vowel, or tone appears with the same frequency in each list.Each row is preceded by a carrier phrase, for example "the tenth row is ā, ér, jìng", where the carrier phrase gives a hint of row number and "ā, ér, jìng" stands for the three syllables.The words were recorded in an anechoic chamber, and spoken by a male or a female speaker at the rate of about 4 syllables per second.There is a pause of about 5 seconds between each row for the listener to write down the syllables.The testing word lists were convolved with simulated binaural RIRs by Cooledit Pro software.
Fourteen graduate students aged between 22-30 participated in this speech intelligibility test, who were trained and familiar with Chinese phonetic alphabet.All participants are native speakers of Mandarin Chinese and had absolute thresholds of less than 15 dB HL at octave frequencies between 125 Hz and 8000 Hz.For each test condition, up to 4 participants could participate in the test at the same time using the HP-S4 headphone amplifier, and each participant wore the same type of the Sennheiser HD580 headphone at a speech sound pressure level of 60-65 dBA.The listening material was played through Cool Edit Pro software and controlled by the tester.Finally, the results were scored by testers against the correct answer.As the tonal error rate was very low and some subjects accidentally mismarked tones, the tonal results will not be discussed in the study.Only the consonant and the vowel in each syllable are correct, syllable is counted as a correct syllable.The correct rate of syllables in each list is calculated by percentage, then the average score of all participants is the Chinese speech intelligibility score of each receiving position.The speech intelligibility score of consonants depends on the correct rate of consonants, no matter whether its vowel part is correct or not.Similarly, the speech intelligibility score of vowels does not consider whether the consonants are right or not.

Results
Figure 4 shows the Chinese speech intelligibility scores at each receiving position in four models.It can be obviously seen from the results that the Chinese speech intelligibility scores of Model B at each receiving position are higher than the scores of Model A, and the scores of Model D at each receiving position are higher than Model C, which indicates the intelligibility score of flat RT at a low frequency is higher than that of rising RT at a low frequency.Besides, the standard deviation of the Model A and Model B at each receiving position is basically larger than Model C and Model D. The scores of Model C and Model D are much higher than that of Model A and Model B, which indicates that the RT characteristic in the original classroom is insufficient.From the repeated measurement analysis of variance, the model (F(3, 234) = 400.927)and receiving position (F(5, 78) = 49.377)have significant effects on the Chinese speech intelligibility scores (p < 0.001).As there is a significant interaction be- tween the model and the listening location (p < 0.001), it needs paired comparison of the simple effect.
The results of the paired comparison indicates that the speech intelligibility scores of Model C and Model D are significantly higher than that of Model A and Model B (p < 0.05).For receiving positions R1, R2, R5, and R6, there is a significant difference between Model A and Model B (p < 0.05), whereas there is no significant difference between Model A and Model B for receiving positions R3 and R4 (p > 0.05).Meanwhile, between Model C and Model D, there is no significant difference for receiving positions R1, R3, R4, and R6, while there is a significant difference for receiving positions R2 and R5.Therefore, for most locations, flattening the low-frequency characteristics in the classroom with RT (500-1000 Hz) of 2 s can significantly improve the speech intelligibility.However, flattening the low-frequency characteristics in the classroom with RT (500-1000 Hz) of 0.8 s can improve the speech intelligibility but not significantly.Namely, it is more important to improve the entire-frequency RT than improve the low-frequency RT only.

Discussion
To further explore the effect of low-frequency RT on speech intelligibility, the following is analyzed in terms of syllables.Mandarin Chinese speech sounds range from very low (about 100-125 Hz) to very high frequencies (above 10 kHz or 12 kHz for some sounds).A Chinese syllable must contain an initial consonant and a vowel, or only contain a vowel.The vowels are low in frequency and high in sound energy, while the initial consonants are much higher in frequency and lower in sound energy (Wu, 1964).Gelfand (1998) wrote in his book that low frequencies tended to be effective maskers over a very wide range of frequencies, while higher frequencies were not good maskers of low frequencies.When the low-frequency RT is much longer than the mid-frequency RT, low-frequency reverberant sounds can be emphasized by room modes and then mask speech sounds in a classroom (Wu et al., 2014).For this speech intelligibility test, three syllables are in a row, and each row is preceded by a carrier phrase.Therefore, due to the difference in RTs, syllables in different positions and carrier phrase will affect the test results.The speech intelligibility scores of syllable, the consonant and the vowel (SCV) for tested syllable orders in turn are shown in Fig. 5.The analyses of variance for average speech intelligibility scores of six receiving positions show that the model (F(3, 468) = 614.176),SCV (F(2, 468) = 419.248)and the orders (F(2, 468) = 14.972) have significant effects on the test (p < 0.001).Figure 5 shows that the scores of Model B are better than Model A, and Model D is better than Model C, which emphasized that the lower the low-frequency RT, the higher the speech intelligibility score.Due to the carrier phrase before the first syllables is fixed and slightly stopped, the first syllables will be less affected by the preceding syllables.Hence, the scores of syllables, consonants and vowels in first orders are significantly higher than the second and third orders (p < 0.001).The scores of vowels and consonants are significantly higher than syllables (p < 0.001), meanwhile the scores of vowels are significantly higher than consonants (p < 0.001).

Element Speech intelligibility scores [%]
There is a significant interaction between the model and SCV (p < 0.001), and the other interactions are not significant (p > 0.05).For the same element of SCV, there is a significant difference between each two models (p < 0.05), however, there is not significant difference between each two elements of SCV.For Mod-els A and B, there is a significant difference between each two elements of SCV (p < 0.001).In Model A and Model B, the average score of consonants is lower than the average score of vowels in the same model (the consonant of Model A = 72%, the vowel of Model A = 76%, the consonant of Model B = 76%, the vowel of Model B = 81%).As both the average RTs of Model A and Model B at 500-1000 Hz are 2 s, while the reading speed is about 4 syllables per second, the RT is greater than the time interval between two syllables, which will bring serious interference.Since both the low frequencies and higher frequencies have a masking effect on the high frequencies, the front syllables of Model A and Model B have large masking effect on the higher frequency sound of the next syllables.Therefore, the scores of consonants are lower than vowels in Model A and Model B. For Model C and Model D, there is no significant difference between vowels and consonants (p > 0.05), but there is a significant difference between vowels and syllables (p < 0.001) or between consonants and syllables (p < 0.001).In Model C and Model D, the average scores of the initial consonant and vowel of the same model are basically the same (the consonant of Model C = 88%, the vowel of Model C = 88%, the consonant of Model D = 91%, the vowel of Model D = 91%).Owing to that the average RTs of Model C and Model D at 500-1000 Hz are only 0.8 s, the RT is too small to cause an obvious masking effect.Therefore, the scores of consonants are basically the same as that of vowels in Model C and Model D.
The above findings indicate that the vowel sounds have a certain masking effect on the consonant sounds, namely the low-frequency sound has a certain masking effect on the mid-high frequency sound, which confirmed the well-known "upward spread of masking" (Oxenham, Plack, 1998).The result of this study showed that too high reverberation time in the low frequency band significantly deteriorates the intelligibility of speech.To improve the speech intelligibility of a classroom, the average RT in the entire frequencies should be smaller to diminish the masking effect and the low-frequency RT is better not to be rising.

Conclusions
In this study, four models based on a large classroom were established by a hybrid method.The synthetic RIRs of the classroom models were obtained through the FDTD method modeling at low frequencies and the Odeon simulation at higher frequencies.The Chinese speech intelligibility listening test was conducted after verification.The average scores of each receiving position show that the scores of Model B are better than Model A, and Model D is better than Model C. The scores of syllables, consonants and vowels in first orders are significantly higher than the second and third orders.The scores of vowels and conso-nants are significantly higher than syllables.The scores of consonants are lower than vowels in Model A and Model B, while the scores of consonants are basically the same as that of vowels in Model C and Model D. The above results indicate that RT in the entire frequencies should be smaller and the low-frequency RT is better to be flat to obtain better speech intelligibility.

Fig. 1 .
Fig. 1.Location of source and receiving points in the Odeon model.

Figure 2 Fig. 2 .
Figure2exhibits that the simulated average RTs of six receiving positions are basically consistent with the measured results, which is within plus and minus 5% of the measurement.The comparison result has verified the effectiveness of this combined simulation method.

Fig. 5 .
Fig.5.Speech intelligibility scores of SCV.In the figure, 1S, 2S, and 3S denote syllable in the first, second, and third order, respectively; 1C, 2C, and 3C denote consonant in the first, second, and third order, respectively; 1V, 2V, and 3V denote vowel in the first, second, and third order, respectively.

Table 1 .
Sound absorption coefficients of the classrooms in the Odeon models.

Table 2 .
Sound absorption coefficients of the classrooms in the FDTD models.The rests models stand for models with the same absorption coefficient.