Prosody outweighs statistics in 6-month-old German-learning infants' speech segmentation
Introduction
Infants exhibit remarkable abilities in segmenting continuous speech into individual words, relying on various pre-lexical acoustic and statistical cues.
Two primary types of cues are utilized:
Prosodic cues: These include features like lexical stress, rhythm, and intonation, which can mark word boundaries or highlight prominent syllables within words. For instance, in German, a language with a strong stress-trochaic bias, words often begin with a stressed syllable, providing a potential marker for word onset.
Statistical cues: These refer to the distributional regularities of syllables in fluent speech, specifically transitional probabilities (TPs). High transitional probabilities typically occur within words (e.g., go followed by bu in gobu), while low transitional probabilities are often found at word boundaries (e.g., bu followed by ta if go bu and ta de are separate words).
German-learning infants, due to their exposure to a stress-trochaic biased language, might process and weight prosodic information differently compared to infants learning languages like English, which have a more varied stress system.
The main aim of this research, conducted across three distinct experiments, was to systematically compare the weighting given to prosodic cues versus statistical cues by 6-month-old German-learning infants when encountering fluent speech.
A key hypothesis driving this research was that prosody would dominate over transitional probability (TP) cues in this specific infant population. Furthermore, it was predicted that the absence of clear prosodic cues would lead to a lack of observable segmentation evidence, suggesting that TP cues alone might not be sufficient for segmentation in these young German learners.
Key Concepts
Prosodic cues: In this study, prosodic cues primarily refer to lexical stress patterns. German is characterized by a strong trochaic bias, meaning a tendency for words to begin with a stressed syllable (strong-weak pattern). The experiments manipulated this by using a conflicting iambic stress pattern (weak-strong) to test infants' sensitivity to specific prosodic information.
Statistical cues: These cues are based on transitional probabilities (TPs). A high TP indicates that one syllable reliably follows another within a word, while a low TP suggests a word boundary. For example, in a sequence
gobutade, ifgobuandtadeare words, the TP fromgotobuwould be high, and frombutotawould be low.Test languages/words: The study utilized controlled disyllabic sequences to create different word types:
Statistical words (high TP): These were specific disyllabic units (e.g.,
gobu,tade,bido,puda) which had high internal transitional probabilities (e.g., the probability ofbufollowinggowas 1.0) and low transitional probabilities between these units in the familiarization stream.Prosodic words: These consisted of the same segmental content as statistical words but were distinguished by a consistent second-syllable stress pattern when presented as isolated test items.
Non-words: These were syllable pairs that did not co-occur in the familiarization string, serving as novel sound sequences against which infants' preferences for familiarized items could be measured.
Test paradigm: The Headturn Preference Procedure (HTP) was employed. This common infant psycholinguistics method measures infants’ relative listening times to different sound stimuli. Longer listening times can indicate either a preference for novelty (if an item stands out as new) or familiarity (if an item is recognized and preferred). The interpretation of longer looking times depends heavily on the specific experimental context and design, particularly the nature of the familiarization and test cues.
Three-test-conditions design (Experiment 1): This experiment directly pitted statistical words against prosodic words and non-words to assess which cue type held more influence. Experiment 2 served as a crucial control by omitting familiarization to rule out inherent biases. Experiment 3 isolated the role of TP cues by presenting familiarization streams with only statistical regularities, intentionally removing any prosodic variation.
Experiment 1: Prosody versus Statistics
Participants: The study involved an initial sample of healthy, full-term infants, aged approximately 6 to 7 months (mean age of months and days, range to ). All infants were monolingual German learners from families where German was the primary language. additional infants were excluded for various reasons, including fussiness, crying, or failing to complete the required number of trials, ensuring a robust final dataset.
Stimuli:
Familiarization String: This continuous speech stream was constructed from four repeating disyllabic sequences:
gobu,tade,bido,puda. The sequence was designed to create clear statistical regularities.Transitional Probabilities (TPs): Within each disyllabic unit (e.g.,
gotobu), the within-word TP was . Conversely, the between-word TPs (e.g.,butota) were intentionally low, ranging from to . This setup theoretically allows for segmentation based purely on statistical learning.Prosodic Cue (Iambic Pattern): Crucially, the familiarization string utilized a consistent second-syllable stress pattern (iambic) across all disyllabic units (e.g.,
goBU,taDE). This specific iambic pattern was chosen because it directly conflicts with the natural trochaic bias of German. This conflict was designed to test whether infants would segment according to the presented iambic prosody, even if it was unnatural for their native language input, or if they would rely on statistical cues.Frequency Control: To mitigate potential biases from exposure duration, two of the statistical words (
gobu,tade) appeared more frequently (90 times each) than the other two (bido,puda, 45 times each). The test items were selected from both high and low-frequency familiarized words.Test Items: Following familiarization, infants were presented with three types of test items:
Statistical words: These were the
pudaandbido(low frequency) sequences, which should ideally be segmented if infants rely on the high TP cues embedded in the familiarization stream.Prosodic words: For example,
butaanddego. These were formed by segmenting the familiarization string between the statistically strong units but aligned with the novel iambic stress pattern (e.g.,goBUtaDEcould be re-segmented asBUtaDEgoif infants followed the stress pattern consistently across the stream). These represented a segmentation preference driven purely by the induced prosody.Non-words: Such as
dabiandbide. These served as a baseline of completely novel sequences, allowing researchers to infer whether infants' longer looking times reflected familiarity (preference for known items) or novelty (preference for unfamiliar but segmented items).
Procedure: The study employed the standard Head-turn Preference Procedure setup. Infants were seated on a parent's lap, facing a central light. Familiarization began when the infant oriented to the central light, which then extinguished, and side lights illuminated, playing a continuous familiarization string. Test trials followed a similar structure: a central light attracted attention, then extinguished, and one of two side lights (playing test words) illuminated. Infants' listening times were measured. The experiment consisted of 3 blocks with the order of word types counterbalanced across participants to prevent order effects. Each trial involved repetition of a single word type. The total experimental session was relatively short, lasting approximately 3–5 minutes, to maintain infant engagement.
Data analysis: Nonparametric Wilcoxon Signed-Rank tests were used for statistical comparisons, given the nature of the looking time data. The effect size was reported using Cliff's delta ($\delta$), which quantifies the magnitude of the difference between two groups independent of sample size. P-values () determined statistical significance.
Results:
Mean Looking Times: Infants showed varying mean looking times for the different word types: prosodic words elicited the shortest looking times (approximately ), followed by statistical words (approximately ), and non-words (approximately ).
Proportion of Longest Looking Times: A detailed look revealed that 11 infants exhibited the longest looking times for non-words, another 11 infants showed the longest looking times for statistical words, while only 2 infants showed the longest looking times for prosodic words. This pattern suggests that prosodic words were perceived as the most familiar or least novel.
Significant Differences: Statistical analysis confirmed significant differences:
A significant difference was found between statistical words vs. prosodic words (, $\delta \approx 0.22p=0.024, $\delta \approx 0.19), further supporting the unique processing of prosodic items.
Crucially, there was no significant difference between non-words vs. statistical words (, $\delta \approx -0.05n=27n=31$$ German-learning infants, aged approximately 6–7 months, participated in this experiment.
Stimuli: A key manipulation here was the use of synthesized syllables (generated using MBROLA). Crucially, the familiarization string in Experiment 3 was designed to have no prosodic variation. All syllables were produced with flat intonation and equal duration, effectively removing prosodic cues while maintaining the identical transitional probability (TP) structure as in Experiment 1. This allowed for an isolated examination of infants' ability to segment speech based purely on statistical regularities.
Test labels: Consistent with the elimination of prosody, the test items were relabeled to reflect the focus on TP cues. They included TP words (
puda,bido), non-words (dabi,bide), and what were now called part-words (buta,dego), as their definition was no longer tied to a conflicting prosodic pattern but rather to specific statistical breaks.Procedure: The same Head-turn Preference Procedure used in Experiment 1 was employed.
Results: In stark contrast to Experiment 1, the results showed no significant differences among any of the conditions. That is, infants' looking times for TP words, non-words, and part-words were statistically indistinguishable.
Conclusion: The absence of any significant preference indicates that German 6-month-olds did not successfully segment the continuous speech based on transitional probability cues alone when explicit prosodic information was absent. This directly supports the main claim that prosody is a dominant cue for this population, and that in its absence, statistical cues are not sufficient for a clear segmentation outcome at this age.
General Discussion
The cumulative findings from these three experiments consistently demonstrate that German-learning 6-month-olds assign greater weight to prosodic cues than to statistical transitional probability (TP) cues when these cue types are in conflict during word segmentation. This highlights a clear cue weighting hierarchy in this specific developmental stage and language environment.
A significant outcome is the lack of any evidence for TP-based segmentation when prosody was absent (Experiment 3). This observation stands in marked contrast to findings reported for English-learning infants, who often exhibit a TP advantage and can segment words based on statistical cues alone, sometimes even at earlier ages.
Possible explanations for these crucial cross-linguistic differences:
Language-specific prosodic properties: German's strong trochaic dominance (preference for initial stress in words) may intrinsically bias German-learning infants towards attending to and utilizing prosodic information from an early age. Even the iambic (weak-strong) pattern used in Experiment 1, while conflicting with native trochaic bias, provided clear and consistent prosodic boundaries, which infants readily exploited. This suggests that the presence of clear prosodic cues, even unconventional ones, is highly salient.
Stimuli differences: The use of natural prosody in Experiment 1 (even if synthesized for specific stress patterns) might have significantly elevated the salience of prosodic cues, making them perceptually stronger than the purely statistical cues derived from the same segments. In contrast, studies showing TP advantages often use highly simplified, monotone, or synthesized speech, which might inadvertently diminish prosodic salience, thereby enhancing the apparent influence of statistics. Natural, rich acoustic cues might inherently promote prosody-based segmentation.
Developmental trajectories: It is possible that German infants acquire prosody-based segmentation strategies earlier in their development, relying less on statistical cues during the initial stages of word learning. Alternatively, the efficacy of transitional probabilities might be less language-general than previously assumed, with its utility for segmentation being heavily dependent on the specific phonological and prosodic characteristics of the input language and the developmental stage of the learner.
Implications: These findings hold significant implications, prompting caution against the assumption that TP-based word segmentation is a universal or primary bootstrapping mechanism for all infants learning all languages. The study strongly suggests that cue weighting for speech segmentation appears to be highly language-specific and profoundly experience-dependent, shaped by the acoustic-phonetic properties of the native language input from very early in development.
Additional notes: The methodological rigor of the study is enhanced by its three-test-conditions design, which clarified the direction and strength of cue weighting. The inclusion of non-word references was essential for correctly interpreting the meaning of longer or shorter listening preferences (i.e., whether they indicated novelty or familiarity). All data and materials used in the experiments are openly available, promoting transparency and reproducibility in developmental research.
Tables and Figures (referenced in text)
Table 1: Provides detailed acoustic properties, such as duration, intensity, and fundamental frequency contours, of stressed versus unstressed syllables utilized in the familiarization string across all experiments.
Table 2: Presents the specific acoustic properties of the syllables (e.g., phonetic features, average formant values) as they appeared within the various test words (statistical, prosodic, non-words).
Table 3: Illustrates the possible ways the continuous familiarization string could be segmented, explicitly contrasting the boundaries predicted by TP-based segmentation versus those predicted by prosodic-based segmentation.
Table 4: Summarizes the key properties of the test words, including their transitional probabilities (TPs), relative frequencies within the familiarization stream, and the specific stress patterns induced (or absent) for each word type.
Figure 1: Presents a bar graph depicting the mean looking times of infants in Experiment 1 for statistical words, prosodic words, and non-words, along with error bars and significance indicators.
Figure 2: Displays a similar bar graph for Experiment 2, illustrating mean looking times by word type in the absence of familiarization.
Figure 3: Shows the mean looking times for TP words, non-words, and part-words in Experiment 3, where only statistical cues were available during familiarization.