Statistical Metrics for the Temporal Acoustics of Durationally Contrastive Vocalics: A Proposal Tested with Data from Arabic and Japanese

,


Introduction
The durationally contrastive vocalics1 (i.e., short and long vowels) in natural and synthetic speech have been investigated in research on both native vs. nonnative and normal vs. impaired production and perception in languages with durationally contrastive vocalics, such as Arabic.Most of the early research took a theoretical perspective and focused on the observation that, as in other Semitic languages such as Hebrew, a vocalic duration is inherently phonemically contrastive in Arabic (e.g., Harris, 1942;Cantineau, 1956;Ferguson, 1957;Cowan, 1970).Subsequent studies employed experimental or obser-Most of the aforementioned studies have documented and characterized the temporal acoustics of Arabic vocalics and the short-to-long duration ratio thereof both interdialectally and cross-dialectally.Other studies sought to compare and contrast the Arabic native production of vocalic duration with that of other Semitic languages that exhibit vocalic durational contrast, such as Hebrew (e.g., Amir et al., 2012), and non-Semitic languages that either feature durational contrast, such as Japanese and Thai (Tsukada, 2009), or do not, such as English (Mitleb, 1984).Other studies examined non-native production and perception of vocalic duration by speakers of Arabic as a second language (L2) whose native language (L1) was Japanese (Tsukada, 2012a;2012b), English (Flege, Port, 1981; Lababidi, Park, 2014), Korean (Hong, Sarmah, 2009), or Hebrew (Zaltz, Segal, 2021).
To this end, researchers have utilized both the duration difference and the duration ratio in dialect and language comparisons.For instance, some Arabic dialects, specifically Jordanian, have been reported to have a short-to-long duration ratio of 0.65 (Mitleb, 1984), while others have demonstrated a considerably smaller ratio, for example, 0.39 in Palestinian (Saadah, 2011).This discrepancy may not necessarily be due to the use of different stimuli or methods but rather due to actual interdialectal variations, as the duration ratio does not truly convey much about vocalic duration in one dialect or another, nor does it allow for a clear cross-dialect or cross-language comparison.
The duration ratio does not directly express vocalic duration in time units (e.g., ms); instead, it shows only how large or small a value is in relation to another value.That is, the duration ratio of 100 to 200 ms is exactly the same as that of 200 to 400 ms (0.5 in both), which makes this measure unhelpful when comparing two language varieties.The duration difference only shows the quantitative relationship between two given vocalics as short in duration and long in duration, rather than reflecting the actual duration acoustics of the segments under investigation.In addition, sometimes, the short version becomes too long or the long version becomes too short, which results in a negative duration difference value when calculating the difference for each minimal pair (e.g., 70 − 100 = −30 ms).Considering that individual value (rather than the overall mean difference), the difference between two positive values should be any nonnegative value (including zero), but a negative duration ratio value will be uninterpretable.There are a few potential solutions to this particular issue, but each has its own problems.For instance, we could transform and normalize data to be at or above zero, but this would increase the overall mean duration difference.
Hence, this study proposes two statistical metrics to allow for direct comparison between different vari-eties in terms of vocalic duration.The first section provides background and describes the two metrics, the duration metric and the difference metric, that can be used instead of the duration difference or duration ratio values reported in previous studies.The two metrics are illustrated using available data from relevant literature.In the second section, a production experiment is conducted to test the two alternative metrics, using data from Arabic and Japanese as two languages that have been repeatedly compared and contrasted in the literature (e.g., Tsukada, 2009) because they share similar durationally contrastive vocalics (e.g., Aldholmi, 2022).

Traditional measures
As reported in some previous studies, the traditional method for obtaining a short-long duration ratio divides the mean duration of the short vowels by that of the long vowels.For instance, Mitleb (1984) reported a ratio of 0.65, calculated as (Eq.( 1)): ratio = mean short vowels mean long vowels , 0.65 = 83 ms 128 ms . ( In some cases, the duration difference is reported instead of the duration ratio.The duration difference is simply the difference between the mean duration of the long vowels and that of the short vowels, as shown in Eq. (2).Mitleb (1984) reported a duration difference of 45 ms: difference = mean long vowels − mean short vowels, 45 = 128 ms − 83 ms. (2) The duration ratio is sometimes reported in qualitative rather than numerical form.For instance, Tsukada (2011) stated that "long [Arabic] vowels are twice as long as their short counterparts" (p.989), while "long Japanese vowels tend to be more than twice as long as their short counterparts" (p.990).Regardless, both the duration ratio and the duration difference depend on the range of the two values, specifically the mean short vowel duration and the mean long vowel duration, which on their own are insufficient to precisely quantify the vocalic duration in a given dialect.For example, suppose that in one Arabic dialect the mean duration of two short vocalics (65 + 75 ms) is 70 ms while the mean duration of two long vocalics (165+175 ms) is 170 ms, and in another Arabic dialect the mean duration of two short vocalics (115 + 125 ms) is 120 ms while the mean duration of two long vocalics (285 + 295 ms) is 290 ms.In both scenarios, the ratio is approximately 0.41, but the difference is 100 ms in the first and 170 ms in the second.Using a duration ratio value makes the two dialects seem similar, but the range of the values and the difference do not.

Proposed metrics
The proposed duration difference is similar to the formant spacing -compact-diffuse (C-D) measure used in some studies in which the first formant (F 1, a smaller value) is subtracted from the second formant (F 2, a larger value) (e.g., Blomgren et al., 1998;Kent, Vorperian, 2018).Although computing the C-D value has a different purpose, namely, to evaluate tongue elevation (e.g., Jakobson et al., 1963), it reduces the two values into a single value that can be used for statistical description and inference.Another formant spacing value is the so-called grave-acute (G-A) measure (Kent, Vorperian, 2018), which describes tongue advancement (Jakobson et al., 1963;Blomgren et al., 1998).The G-A value has been computed according to Eq. ( 3), where X = each individual vocalic, and n = the total number of data points (vocalics): This method can form the basis of a new, alternative metric that can be used to describe the vocalic duration and the vowel difference in languages where the vocalic duration is contrastive.The proposed metric can be calculated by the given equation (Eq.( 4)): The output provides us with one value that lies between the original value of the short vowel and that of the long vowel, but it should better inform us about how short or long the two contrastive vocalics are in a given dialect or language.To illustrate this, consider the previous two scenarios, calculated as (a) and (b) for convenience.Note that we treat the mean durations as single data points for two individual vocalics: The two obtained values indicate that the first dialect has notably smaller short and long vocalic durations than the second dialect.In other words, the overall duration of vocalics in the second dialect is approximately 42% longer than that in the first dialect.Neither the duration ratio, which is identical in both dialects (0.41), nor the duration difference, which always depends on the distance between the short and long vowel durations, will provide a unified metric that allows for a direct comparison between the two dialects or languages.Nevertheless, the proposed duration metric here still does not show how far the duration value is from the original short and long durations.Hence, one further step is needed, which is to calculate the difference metric (Eqs.( 5) and ( 6)): difference metric = duration metric ± (duration metric − short vocalic) (5) or difference metric = duration metric ± (long vocalic − duration metric).( 6) Note that Eqs. ( 5) and ( 6) provide the exact value.
Consider the vowel difference computed for the aforementioned scenarios: The ± value is the difference metric that we can add to or subtract from the vowel duration metric to obtain the duration of the short vocalic or of the long vocalic.
In the first scenario, 120 ms ± 50 ms = 70 or 170 to yield the durations of the short and long vocalics, respectively.The difference metric shows that the difference between the short and long vowels is smaller in the first dialect than in the second dialect.The same applies to the second scenario.Thus, the duration metric provides us with one value that represents both short and long vocalics.This cannot be achieved via the traditional duration difference (where the short duration is subtracted from the long duration) because the short and long vowels can have large values (e.g., 200 and 250 ms, respectively), but the duration difference, which will be 50 ms in this case, cannot be used to calculate the exact duration of either vocalic.Similarly, two smaller values for short and long vowels (e.g., 50 and 110 ms) can have a larger duration difference, calculated here as 60 ms, but this value also indicates nothing about the duration of the short and long vocalics.The proposed duration metric does provide information about how long the short and long vocalics are.To illustrate this with a real-world example, we analyze data from Tsukada (2011).

An example from Arabic and Japanese
The short vocalic /a/ in trial 1 has a relatively small duration ratio (0.37) compared to its long counterpart, which is below the lowest value reported in the literature on Arabic (0.39), while the short vocalic /u/ in trial 1 has a relatively larger duration ratio (0.51), which is above the frequently reported range (39-45) in the literature (e.g., Tsukada, 2011).Nevertheless, Table 2 shows a duration metric in Japanese of 123.15 ms and a difference metric of ±50.45 ms, suggesting that Japanese vocalics generally tend to be shorter than Arabic vocalics (123.15 vs. 158.08ms, respectively) and that the difference (not the ratio) between short and long vocalics in Japanese is smaller than that in Arabic (50.45 vs. 58.58ms, respectively).Inspection of the means for both Arabic and Japanese short and long vowels supports this conclusion.

Duration metric and difference metric tested: An experiment
The current experiment utilizes the proposed duration and difference metrics for statistical analysis and compares them with the traditional duration difference and the duration ratio measures in Arabic and Japanese.

Methodology
The stimuli for this study consist of 18 MSA CVCVC vs. CV:CVC words and 18 Japanese CVCV vs. CV:CV words.The Arabic items selected for this experiment were inspired by (but not taken from) Hassan (2002), while the Japanese items were selected from Tsukada (2012b).The target vocalic in the stimuli from both languages was the first rather than the second/final vocalic because the final vocalic is subject to certain phonological processes such as shortening and lengthening (see Aldholmi, 2022).Following the same protocol by Aldholmi (2022), 22 male and 18 female native speakers of Arabic (n = 40) produced the items using an Arabic carrier sentence (/ʔ anaa ʔ ak- tubu ʔ aid ʕ an/ "I write as well").The Arabic participants spoke different Arabic dialects, including Najdi (Qassimi, n = 8), Hijazi (Jeddah and Madinah, n = 11), Southern (Faifa and Abha, n = 12), and Northern (Northern Borders, n = 9) dialects.The Arabic participants met face-to-face with the experimenter or other linguists who had volunteered to help the researcher collect the data at different Saudi institutions.
Twenty-four male and 16 2 female native speakers of Japanese (n = 40; the initial sample comprised 41 participants, but one was excluded for unclear speech) produced the Japanese items using a Japanese carrier sentence adopted from Tsukada (2011, p. 991) (/tsugiwa to iimasu/ "Next I say the word ") and performed the task entirely online (using Phonic.ai,2023).Approximately half of the Japanese participants (n = 19) came from Osaka, while the rest did not specify their origin.The target vocalics were isolated from the adjacent consonants by the experimenter, using both visual and auditory judgements for all items.Praat (Boersma, Weenink, 2021) was used for segmenting and marking the boundaries of segments for all items.

Results
As shown in Figs. 1 and  2 It would have been desirable to maintain gender balance for both Arabic and Japanese speakers, had the Japanese speakers been as accessible to the researcher as the Arabic speakers were.Nonetheless, an attempt was made to maintain a similar femaleto-male ratio in both groups, although previous studies did not always have gender-balanced groups.For instance, Tsukada had 7 Arabic speakers (4 male and 3 female) in some studies (e.g., Tsukada, 2011) and 9 Arabic speakers (6 male and 3 female) in some other studies (e.g., Tsukada, 2012a).It should also be recalled that speakers maintain duration distinction in languages that exhibit durationally contrastive vocalics such as Hebrew regardless of gender (e.g., Amir et al., 2012).The repeated-measures ANOVA with the vocalic length (short vs. long) as a within-subject factor and language as a between-subject factor was performed to test mean differences.As detailed in Table 4, the test provided evidence for a statistically significant difference between short and long vocalics with a very large effect size, F (1, 78) = 2047.16,p < 0.001, ω 2 = 0.86, and between Arabic and Japanese, also with a large effect size, F (1, 78) = 182.51,p < 0.001, ω 2 = 0.53.There was also a statistically significant interaction between the two factors with an intermediate effect, F (1, 78) = 27.89,p < 0.001, ω 2 = 0.07.Thus, we have strong evidence that Arabic and Japanese differ significantly in terms of duration for both short and long vocalics and that, within each language, short vocalics are shorter than their counterparts.Table 5 presents the duration differences and duration ratios for both languages.The duration difference for Arabic (128.16 ms) and for Japanese (101.82ms) and the duration ratio for Arabic (0.48) and Japanese (0.37) are similar to those calculated and obtained from the data provided in (Tsukada, 2011).Hence, the duration difference may be misinterpreted as indicative of an overall similarity between the vocalic duration in Arabic and Japanese, which is not precisely the case.Now consider both the proposed duration metric and the difference metric in Table 6.The duration metric for Arabic (173.03 ms) was substantially larger than that for Japanese (111.20 ms).Likewise, the difference metric for Arabic (64.08 ms) was considerably greater than that for Japanese (51.30ms).Thus, based on the aformentioned data, we observe that the duration metric and the difference metric better represent the vocalic duration facts in both languages.The values are re-reported side-by-side in Table 7, which arguably illustrates how the substantial dissimilarity between Arabic and Japanese and between short and long vocalics is reflected more clearly in the duration metric and the difference metric than in the duration difference and the duration ratio.To support this claim, an inverse regression was performed to test which of the four variables (duration difference, duration ratio, duration metric, or difference metric) would most accurately predict the language.We first compare the duration difference and the duration metric, as these two are similar; both inform us about the actual duration of the short vs. long vowels.Next, we compare the duration ratio and the difference metric, as these two are also similar; both inform  us about the relationship between two values.Despite the similarity in purpose between the members of each group, the difference metric and duration metric both have the added benefit of being able to inform us about the mean vocalic duration measures as well.
We fitted an inverse binary logistic regression model, first using the duration difference as a predictor variable and the language as a predicted variable.The results indicated a significant improvement in fit relative to an intercept-only model, χ 2 (1) = 79.58,p < .001,and that the duration difference was a statistically significant predictor of language, χ 2 (1) = 60.60,p < .001.Table 8 shows the −2 log-likelihood (−2LL) and the pseudo-R 2 values of the first model (model 1).As shown, in order from the largest pseudo-R 2 value to the smallest, the Nagelkerke R 2 , Tjur R 2 , Cox and Snell R 2 , and McFadden R 2 exhibited relatively similar, low values.These values become important later when we compare with another predictor variable.
Table 9 shows that the sensitivity of the model was 78.60%, the specificity of the model was 66.70%, and the overall accuracy was 72.60%.
The model was re-fitted using the duration metric value as a predictor variable.The results again showed a significant improvement in fit for the second model (model 2) relative to an intercept-only model, χ 2 (1) = 572.06,p < .001,and that duration metric was a statistically significant predictor of the language, χ 2 (1) = 159.56,p < .001.When the duration difference was used as a predictor for the language, the -2LL value was lower while the pseudo-R 2 values (Table 10) were higher than those obtained in the previous model, demonstrating the development of better fit in model 2. As shown in Table 11, the sensitivity (85.30%), specificity (92.80%), and overall accuracy (89.00%) all improved in model 2.
Thus, all indicators demonstrated that the duration metric proposed in the current study is a better alternative to the duration difference used in previous studies.We compare the two other indicators (the duration ratio vs. the difference metric), following the same steps used in comparing the duration difference and the duration metric.
An inverse binary logistic regression model (model 3) was performed with the duration ratio as a predictor and the language as a predicted variable.The output showed that, compared to an interceptonly model, model 3 demonstrated a significant improvement in fit, χ 2 (1) = 101.92,p < .001,and that duration ratio was a statistically significant predictor of the language, χ 2 (1) = 68.31,p < .001.The −2LL value (896.20) and the pseudo-R 2 values (Cox and Snell R 2 = 0.17, Nagelkerke R 2 = 0.13, Tjur R 2 = 0.12, and McFadden R 2 = 0.10) were very similar (Table 12)  to those obtained when the duration difference was used as an indicator.
The sensitivity (59.7%), specificity (70.8%), and overall accuracy (65.3%) of the model, as shown in Table 13, indicated that this model exhibited poor sensitivity and slightly poor overall accuracy.
Running the model again with the difference metric as a predictor, model 4 showed a significant improvement in fit relative to the intercept-only model, χ 2 (1) = 537.29,p < .001.It also indicated the difference metric as a statistically significant predictor of the language, χ 2 (1) = 167.56,p < .001.The −2LL value (460.84) and the pseudo-R 2 values (Nagelkerke R 2 = 0.71, Tjur R 2 = 0.60, Cox and Snell R 2 = 0.53, and McFadden R 2 = 0.52) were highly similar (Table 14) to those obtained when using the duration metric as an indicator.
The sensitivity (87.2%), specificity (84.2%), and overall accuracy (85.7%) of the model, as shown in Table 15, were notably higher than those in the previous model and indicated good fit.Thus, running the model again with the difference metric as a predictor significantly improved the model's goodness of fit compared to using the duration ratio as a predictor.

Discussion and conclusion
The findings above agree with a large body of literature that has shown that Arabic and Japanese contrast short and long vowels (e.g., Tsukada, 2013), as well as with previous observations that Arabic short vowels weigh approximately 50% of their long counterparts while Japanese short vowels weigh less than 50% of their long counterparts (e.g., Tsukada, 2011).The duration difference and the duration ratio were, respectively, 128.16 and 0.48 for Arabic vocalics and were, respectively, 101.82 and 0.37 for Japanese vocalics.The duration differences (128.16 and 101.82 ms) do not reflect the short and long durations in Arabic or Japanese; Arabic short vowels are approximately 55% longer than Japanese short vowels, Arabic long vowels are approximately 65% longer than Japanese long vowels, and, overall, Arabic vocalics are approximately 60% longer than Japanese vocalics.Likewise, the duration ratio does not convey much information about vocalic duration within-language (e.g., Arabic or Japanese) or between the two languages (Arabic and Japanese) nor in comparison with other languages.Based on the data we obtained in this experiment, the duration ratios in Arabic and Japanese are relatively similar: 0.48 in Arabic and 0.37 in Japanese.That is, the durations of short and long vowels in Arabic are nearly double those in Japanese, but we cannot deduce this from the duration ratio.
In comparison, the duration metric (173.03ms) and difference metric in Arabic (±64.08 ms) diverged from the duration metric (111.20 ms) and the difference metric in Japanese (±51.30ms).The duration metric shows the average length of both short and long vowels; we can see clearly that Arabic vocalics are considerably longer than Japanese vocalics.The duration metric shows the extent to which short vocalics and long vocalics are similar or different within and between Arabic and Japanese, and we can ascertain that the difference between short and long vocalics in Arabic is greater than that in Japanese and that, moreover, that duration is more variable in Arabic than in Japanese.The two metrics together show that the duration metric of short and long vocalics in Arabic (173.3 ms) is very close to the duration of long vocalics in Japanese (111.20 ± 51.30 = 162 ms) and that the duration metric of short and long vocalics in Japanese (111.20 ms) is also similar to the duration of short vowels in Arabic (173.03 − 64.08 = 108.95ms).
Neither the duration difference nor the duration ratio is a factual duration unit.Unlike the duration metric, the duration difference does not provide the actual duration of vocalics in Arabic vs. Japanese.Likewise, the duration ratio is a completely different measurement unit that no longer expresses the duration in time units and cannot indicate the duration of short vowels relative to long ones in Arabic vs. Japanese.The duration ratio cannot be used to compare vocalic durations between dialects or languages, because two different languages that have two distinct duration measurements for short and long vocalics may still have similar or even identical duration ratios.For instance, the duration ratio for Palestinian vocalics is approximately 0.39 (Saadah, 2011) and for Japanese vocalics in the current experiment was 0.37.These two values are extremely similar, but overall, Palestinian short and long vocalics are both longer than their Japanese counterparts.Using the actual vocalic duration measurements in a statistical test to compare vocalics in Palestinian and Japanese should reveal a significant difference, while using the duration ratio is unlikely to reveal any differences.This is probably the reason why the duration and difference metrics were better predictors of the language.
To summarize, this paper shows how the duration difference and duration ratio measures used in previous studies are not optimal metrics for comparing vocalic duration within and across languages.We propose two alternative metrics: the duration metric and the difference metric.Using data from a previous study (Tsukada, 2011), we illustrate the difference between the duration ratio and the duration difference, on the one hand, and between the duration metric and difference metric, on the other hand.We then conduct an experiment to examine the new metrics.The findings show that short and long vocalic durations differ in both Arabic and Japanese and that Arabic and Japanese also differ in terms of short and long vowel durations.More importantly, the key finding is that the proposed metrics were better predictors of the language than the traditional measures.This finding invites researchers on the vocalic duration, whether pho-neticians, language acquisitionists, or speech pathologists, to consider using (and testing) of the proposed metrics.We also call for a revisiting of the findings established in previous literature, especially those studies that compared several languages or dialects (e.g., Alghamdi, 1998).Future research can survey languages and dialects that have shown similar or dissimilar duration ratios and examine whether the proposed metrics will reveal patterns that differ from those revealed by the traditional duration ratio and duration difference measures.
the actual durations of both short vocalics are 114 ms and 109 ms, respectively, which exhibit only a negligible difference (114 − 109 = 5 ms).The overall duration ratio (0.47) and the overall duration difference (117.17ms) do not indicate the actual magnitude of the durations of short and long vocalics in Arabic.In contrast, the duration metric does show that Arabic short vocalics are generally shorter and Arabic long vocalics longer than 158.08 ms and that the distance between short or long vowels and this metric value is ±58.58 ms overall.

Table 3 .
Short and long vocalic means [ms], SDs, minimums and maximums of durations in Arabic and Japanese.

Table 5 .
Duration differences and duration ratios in Arabic and Japanese.

Table 6 .
Duration metrics and difference metrics in Arabic and Japanese.

Table 7 .
Vocalic durations, duration differences, duration ratios, duration metrics, and difference metrics in Arabic and Japanese.

Table 8 .
The −2LL and pseudo-R 2 values for model 1.Tjur R 2 Cox and Snell R 2 McFadden R 2

Table 9 .
Confusion matrix (sensitivity and specificity rates) and accuracy rate of model 1.

Table 10 .
The −2LL and pseudo-R 2 values for model 2. Tjur R 2 McFadden R 2 Cox and Snell R 2

Table 11 .
Confusion matrix (sensitivity and specificity rates) and accuracy rate of model 2.

Table 13 .
Confusion matrix (sensitivity and specificity rates) and accuracy rate of model 3.

Table 14 .
The −2LL and pseudo-R 2 values for model 4. Tjur R 2 Cox and Snell R 2 McFadden R 2

Table 15 .
Confusion matrix (sensitivity and specificity rates) and accuracy rate of model 4.