Short Utterance Speaker Recognition Based on Speech High Frequency Information Compensation and Dynamic Feature Enhancement Methods

This work aims to further compensate for the weaknesses of feature sparsity and insuﬃcient discriminative acoustic features in existing short-duration speaker recognition. To address this issue, we propose the Bark-scaled Gauss and the linear ﬁlter bank superposition cepstral coeﬃcients (BGLCC), and the multi-dimensional central diﬀerence (MDCD) acoustic feature extracted method. The Bark-scaled Gauss ﬁlter bank focuses on low-frequency information, while linear ﬁltering is uniformly distributed, therefore, the ﬁlter superposition can obtain more discriminative and richer acoustic features of short-duration audio signals. In addition, the multi-dimensional central diﬀerence method captures better dynamics features of speakers for improving the performance of short utterance speaker veriﬁcation. Extensive experiments are conducted on short-duration text-independent speaker veriﬁcation datasets generated from the VoxCeleb, SITW, and NIST SRE corpora, respectively, which contain speech samples of diverse lengths, and diﬀerent scenarios. The results demonstrate that the proposed method outperforms the existing acoustic feature extraction approach by at least 10% in the test set. The ablation experiments further illustrate that our proposed approaches can achieve substantial improvement over prior methods.


Introduction
Speaker recognition, as one of the most popular biometric technologies (Wu et al., 2016) today has been widely used in many fields such as access control, forensic evidence provision, security, and telephone banking user authentication (Vogt et al., 2010).The purpose of speaker recognition is to recognize the claimed identity of the speaker, which includes speaker verification and speaker identification (Campbell, 1997).One of its main purposes is to determine whether the test sound from the speaker is acceptable.After decades of development, the technology of speaker verification has been extensively studied, and the recognition system has achieved relatively satisfactory performance, provided that the enrollment and test voices are long enough and the signal-to-noise ratio (SNR) is large enough (Zinchenko et al., 2017;Greenberg et al., 2013;Kinnunen, Li, 2010).However, in some application scenarios, it is not easy to collect a suitable speech.The current speaker verification system has a significant decrease of the recognition rate in a short utterance environment (Nosratighods et al., 2010).A short-duration speech means that the speech contains insufficient acoustic characteristics.Obtaining enough speech data is difficult for many real-world applications and users are reluctant to provide sufficient voice data, especially during the testing phase asking the user to speak for a long time, for instance in phone banking.In other cases, it is very difficult to collect enough data, e.g., in forensic applications, in the security field.The performance degradation caused by insufficient data is called the short-duration issue.
Current speaker recognition systems have achieved great success and performed well when the enrollment and test data are sufficiently long; hence, the traditional acoustic feature extraction methods are designed based on long-duration speech, and the long-duration speech feature extraction filter arrangement method mainly focuses on the low-frequency domain, this makes high-frequency domain features more sparse in the short duration speech, and high-frequency domain information best represents timbre and detail (Huang, Pun, 2020).At the same time, the traditional acoustics features include fewer dynamic features of speakers, as a result, fewer acoustic features are extracted that can be discriminated for speaker recognition.Research on the more challenging short-duration text-independent speaker recognition of discriminative feature compensation has been more in demand lately, which is also our focus in this work.
Although the traditional speaker model has obvious feature specificity, because the number of features is too few, it is still susceptible to noise interference, and awful recognition performance.The acoustic feature extraction design should address how to extract the high discriminative embeddings more effectively in short-duration audio speaker recognition.Therefore, how improving the effectiveness of discriminative acoustic feature extraction, in short utterance speaker environment, is an urgent problem to be solved.
To address the problems, the solution is proposed in this paper.In the Bark-scaled Gauss filter bank acoustic feature extraction method the filter bank distribution puts more emphasis on the low-frequency bands, which portray the low-frequency spectrum of speech in great detail.In comparison, the Bark-scaled Gauss filter distribution less emphasizes the highfrequency bands, so some helpful information is easily lost from the high-frequency domain.However, the details of the high frequency can enhance the information of one's timbre.To enhance the valuable information on the high-frequency, the Bark-scaled Gauss and linear filter bank superposition cepstrum coefficients (BGLCC) are proposed to portray more precise high-frequency details.The filter bank of the conventional acoustic feature extraction method puts more emphasis on the low-frequency band.In contrast, the linear triangle filter is uniformly distributed, which can remedy the weakness of the sparse high-frequency information and insufficient acoustic feature extraction brought by the uneven distribution of a single filter, thus, integrating the advantages of both and constructing new hybrid feature parameters is a way to enhance the feature sparsity problem.
Moreover, aiming to capture better dynamics features of speakers, we propose multi-dimensional central difference (MDCD) features based on the BGLCC features matrix, simultaneously, to improve the performance of short utterance speaker recognition.The MDCD are multi-dimensional central difference features in the time-frequency plane.Different speakers speak the same word or sentence in different ways.The proposed MDCD feature concatenate information about the speaker from four different dimensions, this can explain why it performs significantly better than traditionally used speech features in speaker recognition tasks under various conditions.Therefore, the MDCD features can further compensate for the limited and sparse dynamic acoustic characteristics of shortduration audio signals based on extracting dynamic speaker features.

Related works
To enhance the efficiency of performance of shortduration audio speaker recognition algorithms, some approaches have been presented by previous research studies.In terms of front-end acoustic feature extraction, the vast majority of existing acoustic feature extraction is based on some form of the shortterm frequency spectrum to implement short utterance speaker recognition algorithms like Mel-frequency cepstral coefficients (MFCCs) (Herrera-Camacho et al., 2019; Paseddula, Gangashetty, 2018) linear prediction cepstral coefficients (LPCCs) (Yang et al., 2019;Atal, 1974) and constant Q cepstral coefficients (CQCC) (Todisco et al., 2017), acoustic features.For instance, by judiciously combining MFCC and LPCC for short-duration audio signal speaker recognition (Chowdhury, Ross, 2020), the hypothesis is that MFCC and LPC capture two different aspects of speech, namely, speech perception and speech production.By using the model method, there is speaker recognition based on GMM-UBM from MFCC features in the limited enrollment and test data (Omar, Pelecanos, 2010).Another work is the I-vector approach and factor analysis subspace estimation introduced by (Kenny et al., 2005;Dehak et al., 2010) to reduce the number of redundant model parameters, resulting in more accurate speaker models.Some approaches attempt to increase performance by selecting segments with better discriminability based on speaker features (Nosratighods et al., 2010) GMM and the CNN hybrid method (Liu et al., 2018), the method is an initial alignment method for short utterance feature, which can improve the effect of short utterance speaker recognition.In their work, front-end feature extraction methods are based on Fourier transform Mel-triangle filtering and linear prediction cepstral coefficients for model training and testing as well as model inference.
With further developments in deep learning, various methods for speaker recognition or short utterance speaker recognition have been proposed, by Povey et al. (2018), the factorized time delay neural network (F-TDNN) has been proposed which divides the parameter matrix of TDNN into smaller matrices to increase the training effectiveness and the extended time delay neural networks (E-TDNN) was proposed in (Snyder et al., 2019), E-TDNN is based on its broader and deeper network structure, thus allow-ing more information to be learned, they both improve speaker recognition performance significantly.In (Villalba et al., 2020), based on F-TDNN and E-TDNN, the best results were obtained for speaker evaluation in SRE18 and in the field.In addition, a focus on aggregation information, channel attention, and propagation method were proposed (Desplanques et al., 2020), called TDNN-based speaker verification (ECAPA-TDNN), which further improves the robustness of speaker recognition.After years of development, the performance of short utterance speaker recognition has improved considerably, but it is still unsatisfactory in some complex scenarios.
Most of the aforementioned methods would benefit from the optimization model, enhance data characteristics and extract more discriminative features for speaker recognition.With 5∼10 seconds of speech duration, they all improve speaker recognition performance when audio speech becomes shorter, but they still face significant challenges.
Generally speaking, there are two types of speech recognition features, namely linear prediction cepstral coefficients (LPCCs) and Mel-frequency cepstral coefficients (MFCCs), but when used in a short-duration environment, they suffer from a drop in performance.As we know, there is no reasonably good short-duration speaker verification model.Unfortunately, there is no better feature extraction method to obtain sufficient and discriminative speaker information models from short-duration speech signals, there are no better training methods.

Contribution
To compensate for the problems of difficult shortutterance discriminative feature capture and insufficient discriminative acoustic features, we propose a filter superposition-based multi-dimensional central difference discriminative acoustic feature extraction method for feature compensation and enhancement of short-duration speech speaker recognition.The proposed method can significantly improve the performance and accuracy of the the short-duration speech speaker recognition system.
The contributions of this paper: -we propose the Bark-scaled Gauss and linear filter bank superposition acoustic feature extraction method, which compensates for the weakness of the sparse filter and the sparse feature in the highfrequency information for a short utterance feature, this can improve the performance of short utterance speaker recognition by providing rich timbre information; -we propose the multi-dimensional central difference method for capturing the dynamic features of speakers, which is used to simulate real speech and enhance the diversity of acoustic features with limited speech data.

Organization
This paper is organized as follows.Section 2 details the proposed filter superposition-based multidimensional central difference discriminative acoustic feature extraction method.Then we analyze the experiments and results of the proposed method in Sec. 3. Finally, the conclusion is given in Sec. 4.

Proposed method
In this section, which mainly includes the discriminative acoustic feature extraction algorithm, we elaborate on the proposed feature extraction technique, which the design based on the Bark-scaled Gauss and linear filter banks superposition algorithm and then the multi-dimensional central difference dynamic features extraction method based on the BGLCC features matrix.In addition, the effect of the introduced feature extraction of BGLCC and MDCD was achieved through mathematical analysis.

BGLCC feature extraction method
The speech signal is performed by a high-pass filter as pre-emphasis, this filter is equivalent to: where a is a pre-emphasis coefficient, the value is chosen in the interval [0.95, 0.97] and it can increase the energy of higher frequencies.
The following Hamming window w is used for smoothing the edge of framed speech signals: where In speech processing, the Bark-frequency cepstrum (BFC) affects the speech short-term power spectrum, which is transformed on the Bark-scale of frequency.The BFC can be obtained as: In contrast to the well-known Mel-scaled triangular filter, the proposed Bark-scaled Gauss filter structure has a smoother response and enhances the correlation between adjacent sub-bands.The coefficients are derived from a type of cepstral representation of the speech clip.The frequency response of the Bark-scaled Gauss filter bank can be obtained as: where σ b is the standard deviation, and f (b) is the b-th filter boundary point (Bark-scaled center frequency), as defined: where α is equal to 2.0.The signal presents 24 critical bands in the band, which is also the Bark center frequency, and this is the Bark domain.
Next, the linear triangle filter bank processing details.The power spectrum is then processed, on the frequency, by a linear uniform filter bank.In these linear filter banks, each filter is a triangle filter.The filter can be defined as: where f (l) is the center frequency, 0 ≤ l < L, and L is the number of filter banks, and the value of L is 24.
We use more filter bands than usual on account that the resolution of high-frequency domains is essential for the timbre.Finally, we get the linear filter features.The raw speech signal x(n) is preprocessed to obtain x w (n).Subsequently, the fast Fourier transform of the framed speech signal to transform the speech data from the time domain to the frequency domain, the mathematical calculation can be written as: where x w (n) indicates that after adding the window function i is the number of speech frames.The power spectrum is calculated as: Therefore, the Bark-scaled Gauss and linear filter banks superposition feature extraction is made based on the power spectrum of the output from the fast Fourier transform.Thus, the BGLCC power calculation procedure can be given by: where t denotes the t-th superposition filter, b denotes the b-th Bark-scaled Gauss filter, and l denotes the l-th linear triangle filter, respectively, u is the number of the Bark-scaled Gauss filter, v is the number of the linear triangle filter, t, u, v all are 48-channel filter banks; S(i, t) is equivalent to multiplying the power spectrum E(i, n) and the superposition of H Bark b (k), the Bark-scaled Gauss filter and H Linear l (k) the linear triangle filter on the frequency domain.
where S(i, t) is the BGLCC power, i denotes the i-th frame, r is the spectral line after discrete cosine transformation, t denotes the t-th superposition filter, T is the number of superposition filters, and the value of T is 48.The Bark-scaled Gauss and linear filter bank superposition features (BGLCC) are processed as shown in Fig. 1.

MDCD dynamic feature extraction method
The proposed multi-dimensional central difference dynamic feature extraction method was applied to the different dimensions of the BGLCC time-frequency matrix, where the horizontal dimension is the time domain axis and the vertical is the frequency domain axis dimension and it captures speech time-domain relevance and speech high-low-frequency correlation of the speaker.Similarly, the central difference of linear regression is applied to the time-frequency matrix principal diagonal and counter diagonal, therefore it can capture the voiceprint of the speaker.
The process of the proposed method is shown in Fig. 1; MDCD dynamic feature extraction of different dimensions on the BGLCC time-frequency matrix.First, a series of pre-processing is performed on a frame of the speech signal, which converts the input signal from a time-domain speech signal to a frequencydomain speech signal.Next, the proposed Bark-scaled Gauss and linear filter bank features superposition is applied to divide the spectrum into certain frequency bands, and the log compression is applied.Then, multidimensional central difference obtains four different types of features based on the BGLCC time-frequency matrix, which are calculated as in Eqs. ( 11)-( 14): time-domain: frequency-domain: counter-diagonal domain: principal-diagonal domain: In these equations, the value of h is 2, as the central difference of linear regression has been applied.Here, t stands for the time domain axis and f stands for the frequency domain axis.M is the point along which different dimensions of the axis have been applied.
The time domain's central difference and the frequency domain's central difference can better capture the contour of the speaker formants.By doing the matrix principal diagonal's central difference and matrix counter diagonal's central difference, speaker information about the uttering text phoneme of each speaker can be captured.Thus, the different dimensions of the time-frequency spectrum central difference can be regarded as multi-dimensional dynamic speaker information of each speaker and this explains the excellent results of the proposed MDCD features.To reduce the computationally derived high-dimensional MDCD features, we compress and decorrelate them by DCT.
It was our goal to perform speaker verification through the proposed BGLCC-MDCD as acoustic features, and use 34-layer ResNet as the backbone model, to perform the short-duration speaker verification.The detailed configuration is listed in Table 1.
3. Experiments and analysis

Experiments
The short-duration speaker verification experiments presented in this paper are conducted using the three well-known speaker recognition datasets with different scenarios: VoxCeleb ( The short-duration text-independent dataset is generated from the VoxCeleb, SITW, and NIST SRE corpus, respectively.After removing silence frames using an energy-based VAD, the speech utterances are chopped into short segments (ranging from 0.25 to 10 seconds).This is to illustrate the efficiency of the work of our proposed method under short-duration audio conditions.
The three different scenarios of speech datasets: VoxCeleb, SITW, and NIST SRE corpus aim to evaluate the generalizability of the methods across a range of different audio lengths of scenarios.We focus on conducting speaker verification trials on voice samples of different speech lengths, which are used to investigate the effect of testing speech sample length changes and to validate the efficiency of the presented method on the performance of the speaker verification method.One thing to keep in mind is that in all of our tests, we assume that there is only one speaker in each voice sample and that there is no overlapping voice from several speakers in any of the training or testing speeches.

VoxCeleb and SITW corpus
VoxCeleb is a large open-source speaker recognition dataset with over a million utterances, 7000 speakers, and 2000 hours of audio.The average duration of utterances in the VoxCeleb dataset is 8 seconds, and the majority of utterances have a duration of fewer than 10 seconds.The audio sampling rate is 16kHz.VoxCeleb includes two sub-datasets, VoxCeleb-1 and VoxCeleb-2.The SITW dataset contains open-source media recordings of 299 public celebrities.The SITW dataset is used to generate the short-duration textindependent dataset.SITW speech segments range in length from 6 seconds to 180 seconds, where the majority are long utterances.As a result, the two datasets can be used to assess the performance of our proposed architectures on utterances of varying lengths as well as the model's generalizability.
Each of the three datasets, VoxCeleb-1, VoxCeleb-2, and SITW, is divided into two parts: development and testing (evaluation).The training set consists of 1 092 009 utterances and 5994 speakers from the VoxCeleb-2 development part (VoxCeleb2-Dev).The remaining datasets were treated as test sets, with two parts: the VoxCeleb-1 dataset and the SITW evaluation (SITW-Eval) set.There are 4706 utterances and 37 611 trials in the VoxCeleb-1.There are 1202 utterances and 721 788 trials in the SITW evaluation (SITW-Eval).

NIST SRE corpus
The NIST SRE corpus was used to generate the short-duration text-independent dataset.The SRE04-08, Switchboard II phase 2, 3, and Switchboard Cellular Part 1, Part 2 comprise the training set.The final training set includes 4000 speakers with 40 short utterances each.Similarly, the enrollment and test sets are derived from NIST SRE 2010.The enrollment speech includes 150 male and 150 female speakers, each of whom is enrolled by five utterances.The 4500 utterances in the enrollment speech data are used to test from the same 300 speakers.The trial list that was generated contains 392 660 trials.The website GitHub provides access to the trial list and the comprehensive segmentation files.

Feature extraction
All experiments use a 64-dimensional input feature from a 25 ms window with a 10 ms frameshift.The experiments evaluate using features: LPCC, MFCC, MFCC-LPCC, the proposed BGCC, BGLCC, and BGLCC-MDCD.The 64-dimensional features were extracted for LPCCs, with 32 for linear regression along the time axis and 32 along the frequency axis.The MFCCs used 64-dimensional features, and the 64-dimensional MFCC-LPCC features contain 32-dimensional MFCC and LPCC features, respectively.The use of delta 1/2 inputs is also a 64dimensional feature.For the proposed acoustic feature, BGCC, BGLCC, the 64-dimensional feature vector has been extracted, BGCC-MDCD, BGLCC-MDCD, which contain 16 time-domain features, 16 frequencydomain features, 16 counter-diagonal domain features, 16 principal-diagonal domain features, respectively.

Loss function
In (Schroff et al., 2015), the triplet loss was initially proposed to learn discriminatory image embedding.The embeddings need to satisfy the following relationship for model training to be successful.The cosine triplet embedded Loss (Zhang et al., 2018) for training the model is: The cosine triplet embedding the loss function L is used here, where τ is the batch of triplet, with (s a i , s p i , s n i ) is a triplet.N is the batch size.Samples of speech from a specific "a" are s a i , the anchor sample, and s p i , the positive sample with the same person.The negative sample, s n i , is a sample of speech from another person "b", so that a ≠ b.The α margin is a user-tunable hyper-parameter at the value of 0.25 that determines the minimum distance between negative and positive speech samples.

Implementation and reproducibility
The proposed discriminative acoustic feature method uses the PyTorch (Paszke et al., 2017) toolkit to conduct the experiment, and training using the Triplet-loss (Schroff et al., 2015).The initial learning rate is 0.001 and lasts for 200 epochs.The experiment embeds the cosine triplet loss, and the value of the α margin hyper-parameter is 0.25, which is the best trade-off.The network is optimized using the Adam optimizer with a minibatch size of 32 and softmax as a classifier.The fully connected layers after the statistic pooling layer have 512 nodes.The training was done on a single Nvidia A100 GPU.

Evaluation metrics
We use the following metrics to evaluate the model performance: the Equal Error Rate (EER, in %), and the minimum detection cost function at the prior probability of specifying the targeted speaker of (Min-DCF*100), which is a standard-setting (Nagrani et al., 2017), and partial AUC (pAUC) with α = 0 and β = 0.05, the pAUC represents the partial area under the ROC curve, it meets the evaluation requirement of real-world applications that work on different parts of ROC curves.It is a supplement evaluation metric to the existing metrics.The pAUC is defined by two false positive rate (FPR) parameters: α and β, which is a detailed calculation (Bai et al., 2020).The pAUCMetric evaluates the similarity between two speaker features by the squared Mahalanobis distance.

Overall performance
Performance comparison of different acoustic features.Table 2 and Fig. 2 show the performance of our proposed acoustic features and the compared acoustic features on VoxCeleb-1, SITW, and NIST SRE 2010 datasets, respectively.Table 2 lists the results in terms of EER, Min-DCF, and pAUC, Fig. 2 plots the detection error trade-off (DET) curves of different acoustic features under 10 s speech length that include no dynamic features, using delta 1/2 dynamic features and using MDCD dynamic features.The acoustic feature extraction level for the short-duration audio signal, contains three conventional baseline features, which are MFCC, LPCC, MFCC-LPCC, and our proposed BGCC and BGLCC acoustic features.The speech length ranges from 0.25 to 10 seconds, including 3 segments.
Across the LPCC experiment in Table 2, on the VoxCeleb-1 dataset, compared to LPCC features, the proposed BGLCC features improve by 15.0%, compared to LPCC-delta1/2 features, BGLCC-MDCD features improve 19.0%, under 2 s duration speech length in terms of EER.
Across the MFCC experiment in Table 2, on the VoxCeleb-1 dataset, compared to MFCC features, the proposed BGLCC features improve by 10.6%, compared to MFCC-delta1/2 features, BGLCC-MDCD features improve 15.0%, under 2 s duration speech length in terms of EER.
At the same time, on the other speech with different lengths from VoxCeleb-1, SITW, and NIST SRE 2010 datasets, the proposed BGLCC-MDCD acoustic features for short-duration speaker verification achieve better performance, compared with conventional MFCC, LPCC, and MFCC-LPCC fusion acoustic features.The comparison of the performance of the baseline is shown in Table 2.
In order to visualize the effectiveness of our proposed acoustic features on the different length speech, we plot detection error trade-off (DET) curves for all comparable features, as illustrated in Fig. 2. The performance advantage of proposed BGLCC and MDCD can also be seen from the DET curves in Fig. 2. For example, the results of experiment 1 present the DET curves of the LPCC acoustic feature under three conditions: no dynamic features, using delta 1/2 dynamic features, and using our MDCD dynamic features, under 10 s speech length on the VoxCeleb-1 dataset; the results of experiment 2 present the DET curves of the LPCC acoustic feature under three conditions: no dynamic features, using delta 1/2 dynamic features, and using our MDCD dynamic features, under 10 s speech length on the SITW dataset; the results of experiment 3 present the DET curves of the LPCC acoustic feature under three conditions: no dynamic features, using delta 1/2 dynamic features, and using our MDCD dynamic features, under 10 s speech length on the NIST SRE 2010 dataset.
The experimental results also show the lower DET curves achieved using our proposed MDCD dynamic features, compared to no dynamic features, and using delta 1/2 dynamic features on VoxCeleb-1, SITW, and NIST SRE 2010 datasets.
The proposed MDCD dynamic acoustic feature achieves lower EER, Min-DCF, and highest pAUC than delta 1/2, thus demonstrating that the proposed multi-dimensional central difference dynamic features perform better and are more effective than singledimensional dynamic features.The results of that comparison are listed in Table 2.At the same time, in the experiments comparing the different attributes of source information combination for short-duration speaker recognition (Das et al., 2016), the proposed multi-source discriminative acoustic feature achieves consistent performance benefits across short-duration speech dataset experiments.

Ablation experiments
To evaluate each component of the BGLCC-MDCD feature, we conducted several ablation experiments on VoxCeleb-1, SITW, and NIST SRE 2010 datasets, where the results are shown in Tables 2 and 3, and Figs. 2 and 3.
First, we evaluate the effectiveness of our proposed enhancement of discriminative acoustic features.Table 2 lists the EER, Min-DCF, and pAUC results of different features on VoxCeleb-1, SITW, and NIST SRE 2010 datasets.From Table 2, it can be obser-  ved that the proposed acoustic feature vastly outperforms the baseline feature, and it is seen from Fig. 2 that the DET curve of using MDCD dynamic features is lower than that without dynamic features, and using delta 1/2 dynamic features.The main reason for the performance improvement is our proposed BGLCC feature which employs the Bark-scaled Gauss and the linear filter bank superposition methods, it can remedy the weakness of the sparse high-frequency information and insufficient acoustic feature extraction by enhancing more high-frequency domain information.Similarly, MDCD through four different dimension differences captures better dynamics features of voiceprints, and it can further compensate for the limited and sparse dynamic acoustic features of short-duration audio signals.The experimental results also prove this.
To verify that different multi-dimensional central differences can capture dynamic features of the voiceprint, we conducted several ablation experiments, where the results are shown in Table 3 and Fig. 3. Compared to the diagonal domain, the time-frequency domain central difference captures better dynamic features, and the MDCD achieves the lower EER and Min-DCF.Figure 3 visualizes the DET curve of each dimension branch under the 10 s length utterance.The time-frequency domain performs better than the diagonal domain which is since the signal is mainly analyzed in the time-frequency domain.
Hence, the proposed BGLCC-MDCD discriminative acoustic features are the key reasons for the performance improvement in short utterance speaker verification, which: (a) extracts speaker-reliant characteristics successfully, from the BGLCC features to remedy the weakness of insufficient acoustic features to solve the problem of less emphasizes high-frequency information from the conventional acoustic feature extraction filter design; (b) then, the MDCD method can capture better dynamics features of voiceprints from short-duration audio signals.

Conclusion
In this paper, we propose the Bark-scaled Gauss and the linear filter bank superposition acoustic features extraction methods to enhance high-frequency domain information of short-duration audio, to deal with the problem of the high-frequency band feature sparsity.Compared with traditional acoustic features such as MFCC, LPCC, etc., our proposed BGLCC feature extraction method emphasizes a focus on both the low-high frequency band of speech, which is more helpful in extracting more discriminative acoustic features to compensate the sparsity of the effective information.Furthermore, a multi-dimensional central difference dynamic acoustic feature is proposed following the BGLCC spectrum characteristics, aiming to capture more diverse dynamic information.The MDCD feature concatenate information of the speaker from four different dimensions, this can explain why it performs significantly better than traditionally used speech features in short utterance speaker verification tasks under various conditions.
The proposed methods are evaluated on well-known datasets, VoxCeleb-1, SITW, and NIST SRE 2010 corpus.From the experimental results, the proposed method achieves continuous improvement over traditional acoustic features in all test sets.The ablation experiments further indicate that the proposed approaches substantially improve the enhanced discriminant features for speaker verification tasks.Future work involves the combination of acoustic featurebased and model-based compensations for shortduration speech speaker verification, and to improve the performance, accuracy, and richness of acoustic feature extraction in short-duration audio signals.

Fig. 1 .
Fig. 1.Structure of the proposed acoustic features extraction method.

Table 1 .
Detailed configuration of the backbone model of 34-layer ResNet.The input size is T × 64.

Table 2 .
Comparison results of different acoustic features and proposed acoustic features under varying audio lengths using the ResNet-34 network on VoxCeleb-1, SITW, and NIST SRE 2010 datasets.

Table 3 .
Ablation study for different multi-dimensional dynamic features based on BGLCC under varying audio lengths using the ResNet-34 network on VoxCeleb-1, SITW, and NIST SRE 2010 datasets.