PSYC513/703: Language and Communication - Spoken Word Recognition

Variability in acoustic waveforms occurs due to:
- Speaker rate
- Intonation
- Noise and distortion
- Accent

Phones are a representation of speech production, not perception.
- Allophones: Different phones that are perceptually equivalent in a language.
  - Example: /p/ in "pin" and "spin", /t/ in "night rate" and "nitrate".
Phonemes: Set of phones that are cognitively equivalent.
- Phoneme = phone for me.
- Basic unit that distinguishes words.
- A change in phone will produce a different word or nonsense word.
- Minimal pairs: /p/ and /b/ distinguish “pin” and “bin”.
- Require a perceptual distinction
- Language specific
- British English has 44 phonemes: 20 vowels, 24 consonants
- Abstract unit

Phoneme perception is not simply a passive bottom-up process.
A phone can be perceived as one or another phoneme depending upon context.
Ganong effect (Ganong, 1980):
- Categorical boundaries.
- Voice onset time.
- Modulated by context.
- Example: DA TA DASH TASK

Top-down effects (McGurk & MacDonald, 1976):
- The effects of visual speech perception on the audio stream.
- Multimodal speech perception.
- Interference caused by seeing one phoneme and hearing another.

Top-down effects are evidence that speech perception is an active rather than a passive process.
Combination of top-down and bottom-up information.
Allows us to overcome imperfect speech input.

Continuous speech signal:
- Dynamic waveform caused by continuous movement of articulators.
- Boundaries not evident in the speech signal.
- Word boundaries.

Evidence of continuous attempts at word segmentation of the speech stream.
Cross-modal priming
- Hear sentences
- Respond to written words
- Lexical decision task
Primes:
- The scientist made a new discovery last year
- The scientist made a novel discovery last year
Target
- Nudist primed by: The scientist made a new discovery last year
Priming effect caused by temporary segmentation error
No report of the perception of the word ‘Nudist’ in the prime
Evidence of continuous segmentation attempts

Simple Theory
- Match string of letters/phonemes/syllables to a word in the lexicon
- Search
- Organisation of dictionary
- lexical acoustic house phonemic semantic /h/ /au/ /s/ happy jaw sad table how

ACCESS STAGE (perceptual representation used to activate lexical items, thus generating a candidate set of items – the cohort)
SELECTION STAGE (the most likely candidate is chosen from cohort)
INTEGRATION STAGE (in which the semantic and syntactic properties of the chosen words are utilized)

Word recognition is fast
- Shadowing and word-monitoring tasks: latencies of 250-275 msec
- Intuitively immediate - words are recognized before end of word is reached
Uniqueness point .. or even before Evidence from Gating (Grosjean, 1980)
- presented with fragments of a word with gradually increasing duration t - tr - tre - tress - tresp – trespa
- The point at which the person guesses the whole word is called the isolation point
Average recognition times
- Out of context: 300-350ms
- In context: 200ms
Top-down effects
- Ganong, Phonemic restoration, McGurk etc.
Speed and robustness depends on words in context
- sentence --> word context effects
System actively seeks matches to input - does not wait for complete match

Press a button when a presented stimulus is a real word:
- Words vs non-words
- Spinach Splinger
  - Fast response = easy access $400 ms$
  - Slow response = hard access $500 ms$

Word Length
Word frequency
- High frequency words = common words (“cat, mother, house”)
- Low frequency words = uncommon words (“accordion, compass”)
Uniqueness point
- early uniqueness point = strawberry (there are no other English words beginning with ”strawb”
- late uniqueness point = blackberry (not unique at /b/ of berry; blackbird, blackbeetle,…)
Neighbourhood
- yacht peach
  - Both high- FAST frequency SLOW
  - peach has lots of high- frequency neighbours (e.g. reach, peace, beach, pea)

Not robust to distortion of initial phonemes
- e.g. “shigarette”
- Ganong effect for initial, as well an non-initial phonemes
Lexical decision latencies are proportional to frequency-weighted neighborhood size, not merely to cohort size.
- Marslen-Wilson: auditory lexical decision task with word pairs with matched uniqueness points
- e.g. DIFFIC | ULThigh frequency (250ms) DIFFID | ENTlow frequency (379ms)
Requires segmentation (i.e., location of word onset) before word identification can begin
Not robust to segmentation errors
- The sky is falling This guy is falling

TRACE has three sets of interconnected detectors
- Feature detectors
- Phoneme detectors
- Word detectors
Within a set (or level) connections are inhibitory
- e.g. evidence that a certain stretch of the input is the word “tip” is evidence that it is NOT any other word
Between a set (or level) connections are excitatory
- E.g. evidence that a certain stretch of the input is the sound /t/ is evidence that it might be the beginning of the word “tip”

Speech Signal Features Phonemes Words /l/ /d/ /k/ lick lad - - - + + + lip /a/ /p/ /i/ fat

Stimulus: LICK LIP
Activation Competition Selection/Recognition
(e.g. Luce et al. 1990, Norris 1994)

TRACE is broadly compatible with lexical effects on phoneme identification, explaining them in terms of feedback from the lexical level to the phonemic level
- Ganong effect
- Phonemic Restoration Effect
TRACE recognizes words even if the initial phoneme is distorted or ambiguous
Can find word boundaries
Problems…
- requires massive duplication of units and connections, copying over and over again the connection patterns that determine which features activate which phonemes and which phonemes activate which words