Speech Acoustics Notes

SPEECH ACOUSTICS NOTES
R. Port

March 24, 2008

1 . Speech Spectra and Waves

Speech sounds can be represented graphically either as a waveform or as a sound spectrogram. But waveforms are more difficult to interpret because of the very broad range of spatial scales. A sound spectrum is easier to interpret. This page reviews the most prominent features of sound spectrograms.

Sound Spectrogram: a graphic display of the frequency components added together to make a sound displayed over an interval of time. (Up to 2 seconds or so.)

Prominent Features: See Figure 8.3 (p. 186) in Ladefoged text (edition 5). Note in the wide-band spectrogram: formant 1 (F1), formant 2 (F2), formant 3 (F3). In Fig 8.7 and 8.8 (p. 192-3), note the stop closures and aspiration intervals (voice-onset time). In Fig 8.10, note the fricative noise, vocal fold pulses (or, loosely, `voicing'). In Fig 8.16 (p. 203) note the wide-band and narrow-band spectrograms of the same utterance. Note formants 1-3 (now harder to see), stop closure intervals, fricative noise and harmonics of the fundamental frequency (which indicate the F0 pattern).

2. Vowels. Regions of strong harmonic energy (with visible glottal pulsing). There are two useful graphs you need to know to understand vowel basics. The positions of F1 and F2 are the main variables.

Frequency X Time Graph. Look at Fig 8.2 and 8.3-4. If we pronounce the peripheral vowels in order slowly from [i] to [a] to [u], then F1 starts low (for [i]), rises to a maximum for [a] and then falls back to a low value for [u]. Over the same series, F2 begins at its maximum value for [i] and falls monotonically through [a] to its lowest value for [u]. Remember this image. You will be asked to draw it eventually.

F1 X F2 Graph . (Fig 8.5) If F1 is plotted against F2 on a plane, then [i], [a] and [u] occupy corners of a triangle. All the other vowels lie within this triangle. If you rotate the axes the right way, this triangle resembles the triangle of the auditory vowel space and the articulatory space.

Diphthongization. See Fig 8.17 (p. 204). Note that most English vowels are diphthongized to some degree, so the steady-state descriptions above are only approximate.

3. Stops: temporal regions with low energy (since the vocal tract is stopped) -- usually 50-120 ms duration.

Voicing. In principle, voiced vowels and consonants have glottal pulsing when voiced but have no pulsing when voiceless. But in English the distinction is more complicated since timing plays a major role: longer preceding V and shorter C for voiced, shorter V and longer C for voiceless.

Place of Articulation. Place cues are subtle on sound spectrograms. See Fig 8.7 and 8.8 (p. 192-193).

Bilabial - Locus of 2d and 3d formant comparatively low.
Alveolar - Locus of 2d formant about 1700-1800 Hz
Velar - Usually high locaus of 2d formant. Common origin of 2d and 3d formant transitions.

4. Fricatives. When a narrow constriction is made and air is forced through it under high pressure (due to closed velum and contracting chest cavity), the airflow becomes turbulent in the constriction producing sound at a wide range of frequencies.

Place of Articulation. See Figs 8.9 and 8.10. The [f] has broad spectrum, weak noise energy (since there is no resonating tube). The [s] and [z] fricative have very highpitched energy (above 3.5 kHz). The other fricative have a center frequency related to the length of cavity in front of the constriction - the longer it is, the lower the pitch. For [sh], the energy peak lies between F2 and F3.

Voicing. 1. Fricatives have weaker acoustic energy when voiced than voiceless (due to lower air flow). 2. The other cues resemble those of the stops.

5. Nasals. Nasal stops usually look weaker in energy than vowels and are usually steady-state. Nasalization of vowels is difficult to see unless you are directly comparing the same vowel nasalized and not.

6. Glides. The glides and resonants show strong formants sweeping up or down. English [r] has F3 dipping quite low (usually below 2 kHz). The [l] usually has F2 low but F3 high, while [w] has both F2 and F3 lowered.

SPEECH ACOUSTICS NOTES R. Port

SPEECH ACOUSTICS NOTES
R. Port