Lecture 13 - Speech Perception

Ace your homework & exams now with Quizwiz!

Miller & Isard (1963): Examples of Normal, Semantically incorrect (syntax ok), and Semantically and syntactically incorrect sentences WITH and WITHOUT noise!

(1)*Normal*: Gadgets simplify work around the house. 89% presented in noise - 63% ~topdown knowledge plays a role in how well you can parse speech! easier to shadow~ (2)*Semantically incorrect (syntax ok)*: Hunters simplify motorists across the hive 79% ~not bad~ presented in noise - 22% ~big hit!~ (3)*Semantically and syntactically incorrect*: On trains have elephants the simplify. 56% ~really hard because moving all the topdown knowledge making it hard to parse what's happening~ presented in noise - 3% ~basically incapacitates you to shadows sentences at all~

Have all this variability to overcome, what are solutions?

- 'Invariances' in speech perception can guide speech recognition despite all the variability - Instead of focusing on differences, focus on similarities... - The spectrograms look similar to some degree and similarities can be exploited ---> acoustic consistencies that help you differentiate between the different possible phonemes for instance that with children vs men the 1st formant is lowest for this sound and for another formant is higher

Different people saying same vowel sound

- Everyone has a unique fingerprint associated with their vowel sound... less energy at low freq and more at higher - Think of this in terms of timbre!

In English there are about ____-____ phonemes: how many vowels and how many consonants?

- In English there are about *40-47* phonemes: • 13 major vowel sounds • 24 major consonant sounds

Speech Perception and the Brain: Broca's aphasia

- Individuals have damage in Broca's area (in frontal lobe) - Labored and stilted speech and short sentences but they understand others - Affected people often omit small words such as "is," "and," and "the." - one of first times where neurologists found someone with a very specific deficit in speech production and a very focal brain lesion that was in a very circumcised region of the brain - tono tono tono tono example, understand others! not a problem with speech comprehension but rather speech production

Formant transitions

- Rapid changes in frequency preceding or following consonants - This is what the discontinuity in consonants are called - Sharp, rapid transitions!

Meaning and Word Perception: Experiment by Miller and Isard

- Stimuli were three types of sentences: • Normal grammatical sentences • Anomalous sentences that were grammatical (but semantically meaningless) • Ungrammatical strings of words (violated rules of syntax and didn't have meaningful semantics) - Listeners were to *shadow* (repeat aloud) the sentences as they heard them through headphones

The Sound Spectrogram: Pure stimulus sweep

- pure stimulus that sweeps form low frequency and sweeps to mid range frequency and does so in a continuous manner through time ---> starts low at beginning and sweeps up in frequency over time page 9

UCSD & Microsoft program decoding voicemails

- tries to decipher voicemails so you can read them - not so great either - cutting edge UCSD paying for still has a terrible time deciphering speech

Nasal captivities play a role in _________ part of people's voices

ECHOEY

The variability problem 2) Variability across different speakers

Speakers differ in pitch, accent, speed in speaking, and pronunciation - prof vs his neighbor saying "ollie come here"

Speech Perception is not WHAT? is IS what?

Speech Perception is NOT inherently unimodal! it IS multimodal

Variability problem: different talkers

men vs children - full grown have lower frequencies - separation between 1st, 2nd, and 3rd formants... less distance between 1st and second in a lot of these vases overall fair amount of variability in the spectral composition of the different speech sounds strictly as a function of age because of things like head size that induce variability

Basic Units of Speech: Phoneme

smallest unit of speech that changes meaning of a word

The Acoustic Signal

• Produced by air that is pushed up from the lungs through the vocal cords and into the vocal tract

Cognitive Dimensions of Speech Perception (another way to overcome variability)

• Top-down processing/prior experience, including knowledge a listener has about a language, affects perception of the incoming speech stimulus • Segmentation is affected by context and meaning - Analogy: think charlie chaplan hallow mask, pictures of mars and painting prof showed that you can immediately see faces in

The Acoustic Signal: how are vowels produced

• Vowels are produced by vibration of the vocal cords and changes in the shape of the vocal tract • These changes in shape cause changes in the resonant frequency and produce peaks in air pressure at a number of frequencies called formants. manipulating where tongue is in mouth as air

Demonstration: Why does it sound funny?

- Backward sounds contain sounds that aren't normal (English) phonemes. - We can't hear or produce these sounds properly. - Think about how this relates to trying to speak a foreign language. 1:00:00

How are consonants produced?

- Consonants are produced by a constriction of the vocal tract - Sudden onset with lots of intensity at the beginning of the word, and a blank space where there is no energy, then another burst of light energy that onsets - Instead of continuous energy that is characteristic of vowel sounds, constant sounds have a lot more discontinuities in them .... e.g., "hit"

Categorical Perception: Why is this important?

- Despite the continuous variation of VOT, we basically only hear one phoneme or the other (experiments done using computerized speech) - Demonstrates that auditory system is simplifying input to filter out much of the complexity --> think da and ta, we can manipulate the onset time and slowly collapse it, then ask people what they are perceiving ---> find that there isn't a slow transition from hearing one thing to the other ,but instead your brain is doing categorical perception, collapsing the continuous variability, taking that continuous variability and saying its either ta or da, but never half way in between. --->your brain is imposing order on weird sounds!!!! ---> voice onset time is short, you will always categorically perceive da, and once you move beyond 40 m.s. you will interpret anything above this as ta! even though this is a continuous change!

The Sound Spectrogram: Human voice

- Human voice doing a 'frequency sweep' - Resonant frequencies, or 'formants' a lot richer AKA a lot more complexity to human voice or real object creating sound than a pure tone ramping up from the frequency domain

Meaning and Phoneme Perception: Experiment by Warren

- Listeners heard a sentence that had a phoneme covered by a cough - The task was to state where in the sentence the cough occurred - Listeners could not correctly identify the position and they also did not notice that a phoneme was missing - Implies that they were filling in the gap via top-down knowledge of likely sentence structure *called the phonemic restoration effect* (help to parse different speech sounds) subjects have very little issues understanding the sentence even though one of the phonemes have been removed, but if someone asks you where the cough was in the sentence, people are terrible at telling you. So not only are people interpolating across it, but almost completely ignored that the disruption happened. Lot of top down knowledge about semantics!!!!!!!

Meaning and Phoneme Perception: Experiment by Turvey and VanGelder

- Short words (sin, bat, and leg) and short nonwords (jum, baf, and teg) were presented to listeners - The task was to press a button as quickly as possible when they heard a target phoneme (e.g., you heard af sound as your target, if you heard bat you would be much faster at detecting the "ah" sound than if you heard bath) - On average, listeners were faster with words (580 ms) than non-words (631 ms) if you are dealing with familiar words and know their semantic meaning, you process them much faster than words you don't have experience with!

The Relationship between Phonemes and the Acoustic Signal (and why computers are pretty bad at this)

- The *variability problem* - there is no simple correspondence between the acoustic signal and individual phonemes ---> *Coarticulation* - overlap between articulation of neighboring phonemes causes variation ---> Variability comes from a phoneme's context, the speaker, etc

The segmentation problem

- There are no physical breaks in the continuous acoustic signal. - Must use top-down knowledge to disambiguate signals ex, "chew it" there isn't a distinct break between the chew and the it, if you were just trying to figure out what someone was saying based on the structure of the spectrogram b/c there's no way you could draw a line to say oh here is the chew and here's the it

Formants

- Vowel sounds are caused by a resonant frequency of the vocal cords and produce peaks in pressure at a number of frequencies called *formants* - The first formant has the lowest frequency, the second has the next highest, etc.

Vowel sounds & formants

- Vowel sounds tend to have very regular spacing of formants, and they've roughly onset and offset in a steady/systematic manner - stead across time!

The variability problem: 1) Coarticulation 1) How long does it take to produce a syllable? 2) While we speak, what's moving and how rapidly? 3) How odes the brain coordinate these movements? 4) This is known as? 5) And this is important part of what, that enables us to?

1) It takes only about a fifth of a second to produce a syllable. 2) While we speak, we move the lips, tongue and jaw quite rapidly. 3) The brain coordinates these movements in a very ingenious way, such that movements needed for adjacent vowels and consonants are produced nearly simultaneously. 4) This is known as coarticulation, and ensures that speech is produced very smoothly. 5) Coarticulation is thus also a very important part of the speech code that enables us to communicate at about five syllables a second.

The variability problem: Different pronunciations have the same meaning, but _______ _______ ________

3) Different pronunciations have the same meaning, but *very different spectrograms*

Solution to Variability: Categorical Perception An Example?

An example of this comes from experiments on voice on set time (VOT) - time delay between when a sound starts coming out of mouth and when voicing begins/onsets - Stimuli are do (VOT of 17ms) and to (VOT of 91ms) [or sometimes they use da and ta] ----> Delay between when the sound begins and the onset of vocal cords. ----> Distinguishes between 'to' vs. 'do', etc (voice onset time is the only characteristic that varies strongly - the delay between the onset of the first sound the the majority of the sound is different between the two phonemes)

Experience Dependent Plasticity: By adulthood, we are '________' to recognize and produce only a subset of possible sounds. Demonstration?

By adulthood, we are '*tuned*' to recognize and produce only a subset of possible sounds Demonstration: 1) Record your voice 2) Play it backwards 3) Imitate and record the backward sounds 4) Play that backwards

You're doing most of the changing of the sounds that come out of your mouth to change these different phonemes by doing what?

By manipulating where your tongue is in your mouth as you pass air out of your mouth!

Phoneme's are to auditory domain as __________ are to visual domain

Geons! -- building blocks of visual field, like phonemes are building blocks for speech

Solution to Variability: Categorical Perception Why does this help simplify perception?

Helps simplify perception because it collapses the near- infinite variability that we encounter across different speakers and conditions into a finite set of possible phonemes

Powerpoint and segmenting words

NOT! good at segmenting words. okay when go slow but when fast gets really confused

Phonemes in other languages?

Number of phonemes in other languages varies—11 in Hawaiian and 60 in some African dialects - point is that they are all used in same way, just combined in different patterns to produces speech sounds inherent in each language!

The segmentation problem: what does this all imply?

The fact that you can easily resolve each word, despite the continuous nature of the signal, implies that top-down knowledge of word/sentence structure are guiding perception "I owe you a yo yo" continuous signal across time that is totally uninterrupted!!

Solution to Variability: Categorical Perception

This occurs when a wide range of acoustic cues(patterns of input) results in the perception of a limited number of sound categories

Wernicke's area/Wernicke's area

Wernicke's area: discovered a few years later after broca's was discovered in a similar way (specific place in brain) Wernicke's area: individuals have damage in Wernicke's area (in temporal lobe) - Speak fluently but the content is disorganized and not meaningful - They also have difficulty understanding others - When trying to say: "The dog needs to go out so I will take him for a walk." - "You know that smoodle pinkered and that I want to get him round and take care of him like you want before," (word salad)

Why do other foreign languages sound like gibberish?

everything is run together and you don't have any top down knowledge to help you break down those different sounds/phonemes into different words

General aphasia, induced by TMS

figure 8 coil with magnetic pulse, zapping him right above where they think broca's area is in frontal cortex -- his job is to sit there and read a passage out of a book, and when you hear click its TMS -- instantly turns on and off speech

The variability problem: (1) coarticulation

overlap between articulation of neighboring phonemes: 'd' looks different depending on the vowel sound that follows it. ---> ex, di vs du: first part of each is not identical even though they are both d's... make d sound based on what we know will follow it.. comp has to be on mission to figure out what the sound is at the beginning of each other the letters to know the first D should be perceived/decoded the same as the other D

Experience Dependent Plasticity

• Before age 1, human infants can tell difference between sounds that are used to create all possible languages • The brain becomes "tuned" to respond best to speech sounds that are in the environment • Other sound differentiation disappears when there is no reinforcement from the environment (if you never hear certain sounds, loose the ability to process them) Why its way more difficult to learn a language later in life! People who grew up in japan for example, list and wrist are really hard for them. Children born in Japan, preferential looking studies, can tell the difference between list and wrist, but after about a year that ability begins to disappear b/c they have no distinction between those two sounds that they commonly hear in their language, and so everyone is born as a blank slate, but then you begin to lose/prune away your ability to distinguish phonemes that you don't come into contact with on a daily basis. Why it is so hard to get rid of an accent!

The variability problem: 1) Coarticulation For example, suppose you say the word happy (5 + how long does the whole word take to utter)

• Before you say anything, you will have moved your tongue into position for a • Then, while you are saying h, it will sound a bit like a • While you are saying a, you will also be closing your lips for pp • While your lips are together for pp (occlusion), you will be moving your tongue to where you want it for y • Finally, in order to say y, you will be opening your lips after pp • The whole word will usually be uttered in less than half a second

Auditory-visual speech perception - The McGurk effect

• Visual stimulus shows a speaker saying "ga-ga" • Auditory stimulus has a speaker saying "ba-ba" • Observer watching and listening hears "da-da", which is the midpoint between "ga" and "ba" • Observer with eyes closed will hear "ba" Like all other things we have talked about like visual auditory neurons reaching visual neurons, we've got a lot of areas in the brain primarily at around wernicke's area, that are sensitive to both auditory and visual inputs that are trying to combine the two inputs into a consistent sensory percept. When you play tricks like this, you can get those neurons that are trying to combine those two pieces of info, doing best they can, BUUUUT they will often will misestimate what the auditory signal is because the visual input override it or interfere with that auditory input!


Related study sets

Compensation and Benefits Chapter 4- Incentive and Variable Pay

View Set

Presynaptic Neuron & Post Synaptic Neuron

View Set

MASTERING BIOLOGY CHP 17 Transcription and Translation

View Set

Chapter 11: Traditional Leadership Approaches

View Set

Georgia law, rules and regulations

View Set