Vowels and Formants

Speech signal is a combination of vocal tract properties functioning as a frequency selective filter.
Variations in air pressure correspond to this frequency selective filter.
Certain frequencies get through the filter, while others are suppressed.
This filtering effect varies continuously during speech, making it complex.

Airstream (typically pulmonic) passes through the vibrating vocal folds.
This results in a complex wave that can be broken down into individual frequencies.
The spectrum displays frequency (x-axis) and amplitude/loudness/intensity (y-axis).
The vibrating glottis generates noise (the source).
The vocal tract filters this noise, allowing some harmonics to pass while suppressing others (the filter).
The resulting output spectrum shows peaks corresponding to formants, which are peaks of resonance.
Vocal tract resonates at particular frequencies, which change with the configuration of vocal tract organs.
Spacing of harmonics relates to fundamental frequency.
The lowest part of the signal corresponds to the fundamental frequency.

Harmonics: Generated by the vibrating glottis (noise source).
Formants: Bundles of frequencies corresponding to the resonating points of the vocal tract.
Harmonics depend on the fundamental frequency.
Formants depend on the resonating properties of the vocal tract.

Laryngeal source (frequency domain: frequencies on x-axis, amplitude on y-axis).
Laryngeal source (time domain: airflow/air pressure fluctuations from vibrating glottis).
Filter response has spacing of resonating peaks.
Harmonics coinciding with these peaks are emphasized and become formants.
Different vocal tract postures or settings for different vowels produce varying filters.

Artificial speech created by combining the source filter.
Variations illustrate the changing pitch of the voice, resulting from changing vibrations of the vocal folds.
Generated noise component with variations illustrating pitch changes.
Filter response.

Interested in locating vowels in acoustic space relative to one another.
Hertz (Hz) is the basic parameter for measuring frequencies.
Vowel space is continuous; vowels can vary continuously.
No fixed $f1$ and $f2$ values due to variability.
Variability depends on individual vocal tract size, context, language, stress, preceding/following consonants.

Resonating frequencies of air in the vocal tract.
Peaks of resonance (bundles of frequencies).
Location depends on vocal tract configuration (tongue body position, lip rounding, etc.).
Vowel qualities are associated with articulatory properties.
Vowel sound contains multiple pitches or frequencies simultaneously.
Quality depends on overtone structure.
Formant differences occur in $f1$ and $f2$ .

Vocal tract can be modeled as a series of tubes open at one end (mouth).
A: Front cavity, C: Rear cavity, B: Area of maximum constriction.
Vocal fold vibration sets air in motion.
Formants for different vowels result from different vocal tract shapes and constrictions.
Example: /i/ (ee) has a shorter front cavity than /ɑ/ (ah).
Formant 1 ( $f_1$ ) is related to the lower portion of the pharyngeal cavity (C).
Formant 2 ( $f_2$ ) is related to the length of the front cavity (A).
Short front cavity leads to a higher $f_2$ .

Can feel formants by tapping the throat while voicelessly making different vowels.
Resonance is low for closed vowels and higher for open vowels.
Whispering vowels can also reveal major resonating frequency of the vocal tract.
High resonating frequency for close front vowels; lower for open/back vowels.
$f_3$ is important for languages with many front/close vowels, helping to distinguish rounding.

Resonances/overtones/formants are numbered from low to high: $f1$ , $f2$ , $f_3$ .
Do not confuse with harmonics.
Vowels are classified by their first two formants ( $f1$ and $f2$ ).

Spectrogram displays a series of spectra lined up in time.
Formant peaks appear as dark bands of energy (collections of frequencies).
Darkness indicates amplitude.
Vowels stand out from surrounding signals.
Narrowing of the signal in a waveform indicates a consonantal articulation (greater constriction).
Clear bands of energy in a spectrogram indicate vowel articulation.

The textbook labels here indicate the main acoustic properties of these valves, including $f1$ , $f2$ , and $f_3$ .
A blue line indicates an approximate measurement point, typically around the 50% midpoint, to minimize coarticulation effects with neighboring speech sounds.

Speaker physiology: Larger vocal tracts produce lower frequencies; smaller vocal tracts produce higher frequencies.
Language/dialect (e.g., American English vs. Australian English vowels).
Number of contrasting vowels in a language.
Stress: Unstressed vowels show undershoot (formant undershoot), centralizing towards schwa.
Casual speech: Vowel spaces shrink.
Consonant environment: Vowel-consonant and consonant-vowel coarticulation.

Source (vocal folds): Thicker vocal folds produce lower pitch (lower harmonics), thinner vocal folds produce higher pitch (higher harmonics).
Filter (vocal tract length): Longer vocal tracts have different peaks of resonance, resulting in lower formants.
Children have shorter vocal tract lengths and thinner vocal folds, resulting in higher pitch and higher frequency values for $f1$ and $f2$ .

The children, in the circle, present a graph or a visual presentation of the vowel space for the children.
In the biological males an oval indicating the same variable. This can be compared with a male.
Higher frequencies in children's vowels with a compressed vowel space.

Compare biological males and biological females for German's lax valves.
The dimensions of the female vowel space differ quite a bit from the male vowel space. Higher frequencies in general and particularly in that formant dimension we see quite a lot of variation there.

Example utterance: "Have these good soft shoes."
The red lines are automatically derived format tracks.
PRACT does a lot of work for you essentially. It identifies hopefully where you've got these valves.
Shows that the formant is relatively high compared to the next vowel e. Why is that? Because it's lower.

Measurements average and typical F1 is 860 kHz and F2. These are the values that will be shown on the screen.

The male speaker actually exaggerated these vowels but the results were quite nice.