Acoustic Models for Pronunciation Assessment of Vowels of Indian English Study Notes

Acoustic Models for Pronunciation Assessment of Vowels of Indian English: Research Overview

Authors and Affiliation: Shrikant Joshi and Preeti Rao, Department of Electrical Engineering, Indian Institute of Technology Bombay, India.
Primary Objective: The research investigates the pronunciation assessment of Indian English (IE) vowels, specifically those uttered by speakers with Gujarati as their native language (L1).
Methodology: The study utilizes confidence measures obtained from automatic speech recognition (ASR), specifically the Goodness of Pronunciation (GOP) measure, derived from acoustic likelihood scores.
Challenge addressed: The effectiveness of acoustic models in detecting errors in the target language (Indian English) is highly dependent on the training data. There is a notable absence of labeled speech databases for either the source language (Gujarati) or the target language (Indian English).
Proposed Solution: The researchers investigate combinations of acoustic models trained on available databases of American English (AE) and Hindi, followed by adaptation with a limited amount of Indian English speech.
Key Finding: Indian English speech is better represented by Hindi speech models for shared vowels than by American English models. Model adaptation significantly improves system performance.

General Indian English (GIE) and Linguistic Context

Linguistic Diversity in India: India features at least two major language groups: Indo-Aryan and Dravidian. English serves as a lingua franca for official and social communication among diverse regional groups.
Origins of Regional Accents: Since Indian languages differ phonologically from each other and from English, L1 interference leads to specific regional varieties of spoken English.
Standard for Learners: The desirable target for learners is General Indian English (GIE), which is a form of spoken English devoid of regional influences and intelligible both within and outside India.
Phonological Characteristics of GIE:
    - GIE deviates from British Received Pronunciation (R.P.) in phone quality and phonology.
    - It incorporates common segmental features across Indian languages to create a version that is distinctly Indian but lacks prominent regional coloration.
    - Prosody in GIE is generally expected to follow British R.P.
Teaching Context: Educated Indian English speakers usually align with British R.P. in grammar and vocabulary but often fall short in pronunciation due to mismatched L1 phonologies and the contrast between phonemic Indian orthography and unusual English spelling-to-sound rules.

Gujarati L1 Interference and Phonological Comparison

Gujarati Vowel System: Gujarati has 6 pure vowels. This is a reduced set compared to the larger Sanskrit phoneme set, a trait shared with other Indo-Aryan languages like Marathi, Oriya, Assamese, and Bengali.
GIE Vowel System: GIE consists of 11 pure vowels, corresponding to the 12 pure vowels plus one diphthong of British R.P.
Specific L1 Confusions: Gujarati speakers often merge phonemes into "intermediate" qualities, leading to ambiguities between long and short vowels:
    - Confusions between $i:$ and $I$ .
    - Confusions between $u$ and $u:$ .
    - Confusions within clusters like $/e:, ɛ, æ/$ and $/ɒ, o:/$ .
Verbatim Examples of Ambiguity:
    - "Snack" vs. "Snake".
    - "Coat" vs. "Caught".
    - "Beat" vs. "Bit".
    - "Fool" vs. "Full".
Hindi Phonology as a Proxy: Unlike Gujarati, standard Hindi has a vowel system very similar to English, attributed to Persio-Arabic influences on the Sanskrit-origin language. This makes Hindi models a viable candidate for GIE assessment.

Vowel Mappings across Dialects and Languages

Comparison of British R.P., American English (AE), and GIE (Pure Vowels):
    - 1. British $i:$ / AE $i:$ / GIE $i:$
    - 2. British $ɪ$ / AE $ɪ$ / GIE $ɪ$
    - 3. British $eɪ$ (diphthong) / AE $eɪ$ / GIE $e:$
    - 4. British $e$ / AE $ɛ$ / GIE $ɛ$
    - 5. British $æ$ / AE $æ$ / GIE $æ$
    - 6. British ə, ɜ:, ͅ / AE ə, ɜ:, ͅ / GIE ə, ͅ
    - 7. British $a:$ / AE $a$ / GIE $a:$
    - 8. British $ɔ:$ / AE $ɔ:$ / GIE $o:$
    - 9. British $ɒ$ / AE $ɔ$ / GIE $ɒ$
    - 10. British $u:$ / AE $u:$ / GIE $u:$
    - 11. British $ʊ$ / AE $ʊ$ / GIE $ᴚ$
Gujarati Phonetic Mapping (Indices relate to GIE list above):
    - Indices 1 and 2 ( $i:, ɪ$ ) collapse to $i$ in Gujarati.
    - Indices 10 and 11 ( $u:, ᴚ$ ) collapse to $u$ in Gujarati.
    - Indices 3, 4, 5 ( $e:, ɛ, æ$ ) collapse to $e$ in Gujarati.
    - Indices 8 and 9 ( $o:, ɒ$ ) collapse to $ɔ$ in Gujarati.

Automatic Pronunciation Assessment and the GOP Algorithm

Core Technology: Pronunciation assessment is tied to Automatic Speech Recognition (ASR). However, while standard ASR relies on language models to overcome low phone recognition accuracy, pronunciation assessment must avoid language models to ensure errors aren't obscured.
The Goodness of Pronunciation (GOP): Developed by Witt and Young, GOP is an acoustic likelihood-based confidence measure calculated using Hidden Markov Models (HMM).
GOP Calculation:
    - It is based on the duration-normalized log posterior probability $p(p_j|O)$ .
    - The formula for a specific phone $p_j$ is defined as:
    - $GOP(p_j) = \frac{1}{NF_O(O)} imes \big| ext{log} \frac{p(O|p_j)}{ ext{max}_i p(O|p_i)} \big|$
Implementation Process:
    - Forced Alignment (Numerator): The system performs constrained decoding using the reference transcription to get the likelihood $p(O|p_j)$ .
    - Phone Loop (Denominator): The system performs free decoding (unconstrained phone recognition) to determine the maximum likelihood $p(O|p_i)$ .
    - Scoring: A correctly articulated phone results in a matching forced alignment and decoded output, yielding a GOP score near zero. Higher scores indicate bad articulation relative to the underlying model.

Signal Processing and System Architecture

Signal Pre-emphasis: Used filter $(1 - 0.97 z^{-1})$ .
Feature Extraction:
    - 12 Mel-Frequency Cepstral Coefficients (MFCC).
    - 1 Normalized log energy coefficient (computed by subtracting maximum log energy value from each frame and adding 1).
    - Delta and Acceleration coefficients ( $ext{delta-delta}$ ).
    - Total Feature Vector: 39 dimensions.
    - Windowing: 25 ms Hamming window with a 10 ms hop size.
Acoustic Model (HMM) Details:
    - Context-independent, left-to-right 5-state HMMs.
    - 12 Gaussian mixtures per state with diagonal covariance.
    - Non-emitting first and last states.
    - Silence model includes skip states (ergodic between states 2 and 4); other models do not.

Training and Adaptation Databases

American English (AE): TIMIT database was used.
    - 462 speakers across 8 dialect regions in the training set.
    - 10 sentences per speaker.
    - Mapped 61 TIMIT phones to 11 GIE vowels, 6 diphthongs, and 5 broad classes (nasal, obstruent, etc.).
Hindi (TIFR): Standard Hindi speech database.
    - 100 native Hindi speakers (76 used for training).
    - 10 sentences per speaker.
    - 8 GIE vowels mapped (excluding $/æ$ and ɒ/ due to low token counts).
GIE Adaptation Data:
    - 12 "model" IE speakers (6 Male, 6 Female) identified by the absence of regional L1 accents.
    - College students from Mumbai with various L1s (Marathi, Hindi, Kannada, Punjabi).
    - 42 short sentences (3-5 seconds) per speaker, manually labeled at the phone level.
Specifications: All databases are formatted at 16 kHz sampling, 16-bit word length.

Experimental Model Combinations

C1: AE-models only: The 11 GIE vowels are treated as a subset of the 12 AE monophthongs.
C2: Adapted AE models: C1 models adapted using the 12 model IE speakers' data through Maximum a posteriori (MAP) adaptation.
C3: Bilingual models: A hybrid set comprising 8 Hindi vowel models and 3 AE vowel models.
C4: Adapted Bilingual models: C3 models adapted using the model IE speakers' data (MAP adaptation).
Broad Classes: To reduce mismatch and improve alignment, non-vowel phones were grouped into broad classes (semivowel, nasal, obstruent, voice-bar, and silence).

Test Datasets and Evaluation

Model-IE Test Set: 20 model Indian English speakers (different from the adaptation set) reading word lists.
Guj-IE Test Set: 16 Gujarati-L1 speakers, mostly schooled in Gujarati medium, exhibiting perceptible L1 influence and varied proficiency.
Word Lists: Concentrated on 11 words (one for each GIE vowel), mostly monosyllabic, to minimize insertion/deletion errors and focus on vowel substitution.
Ground Truth Assessment: Gujarati-L1 data was manually annotated to identify actual perceived vowels. Confusions identified include:
    - i: $perceived as$ ɪ (17 instances).
    - ɪ $perceived as$ i: (41 instances).
    - u: $perceived as$ ᴚ (17 instances).
    - ᴚ $perceived as$ u: (43 instances).
    - Significant internal clusters in /e:, ɛ, æ/ $and$ /o:, ɒ/ groups.

Summary of Quantitative Performance

Training Token Counts (Examples):
    - i: (beat): TIMIT 4595 | Hindi 1353
    - ɪ (bit): TIMIT 11479 | Hindi 1331
    - e: (gate): TIMIT 2266 | Hindi 2380
    - ᴚ (put): TIMIT 495 | Hindi 788
Baseline Accuracies: Correct phone recognition was 71% for AE (on TIMIT test) and 74% for Hindi (on Hindi test).
Precision-Recall Metrics:
    - Used to detect mispronunciations (True Negatives).
    - Precision and Recall are computed by varying the GOP score threshold (standard deviation scaling of the model-IE distribution).
    - Recall (TN) formula:
    - ext{Recall} = rac{ ext{TN}}{ ext{TN} + ext{FP}}
    - Precision (TN) formula:
    - ext{Precision} = rac{ ext{TN}}{ ext{TN} + ext{FN}}

Key Results and Discussion Points

Scatter Plot Observations:
    - Ideally, model speakers should produce GOP scores near zero with low dispersion.
    - Model set C3 (Hindi) showed less scatter than C1 (AE), proving Hindi is a better primary representative for IE.
    - Model set C2 (Adapted AE) showed significant improvement over C1, suggesting that even limited adaptation data can correct large mismatches in AE models.
    - The least dispersion for vowels /i:, u:, ᴚ/ $was found in C2. For$ /e:/, Hindi models (C3) were the most stable.
Mispronunciation Detection:
    - Adapted models (C2 and C4) outperformed non-adapted models (C1 and C3).
    - Precision values were generally low (~10%) because only "gross" mispronunciations (clear substitutions) were labeled as incorrect by human judges, while the system detected many milder mis-articulations.
    - Long-short vowel pairs (i:/ɪ $;$ u:/ᴚ) are difficult to label because non-native speakers often produce "intermediate" durations.
    - Detection is more successful for /ɛ - æ/ $and$ /o: - ɒ/ due to more prominent phonemic differences.

Conclusion and Future Directions

Summary of Findings: The data source used for acoustic model training influences the GOP measure significantly. Bilingual models (AE + Hindi) with targeted adaptation provide a robust framework for assessing Indian English even without large native IE databases.
Proposed Future Work:
    - Incorporating Gujarati phone models directly to improve decoding accuracy for GOP.
    - Investigating discriminative training of acoustic models to better distinguish known confusion groups (i:/ɪ$$, etc.).
    - Collecting larger, more balanced datasets for pronunciation error detection research.
Acknowledgements: The researchers thanked Prof. Peri Bhaskararao for discussions on language phonologies.