Physics of Speech

Once set in a vibratory motion, the vocal folds create a series of movement within the vocal tract above the larynx (the rate at which vocal folds vibrate is recognized as the fundamental frequency of sound). An object will increase the vibrations [2] that are close to its own natural frequencies.  In speech, some frequencies are dampened, while some increased – in accordance to the resonant properties of the tract (its cavities, tissues).

The discussion about resonance in the speech production leads us to a well-known theory about the production of vowels: the source-filter theory, which postulates that the vocal folds are the source of the sound; after the sound is made it passes through a filter shaped by the vocal tract cavities (Ladefoged, Elements 103). This filter is “frequency-selective and constantly modifies the spectral characteristics of sound sources during articulation” (Clark, Introduction 217), and it changes during articulation (218).

However, the vocal tract is not the only filter involved: the sound is modified, and after it leaves the vocal tract, the air in the “outside world” affects the sound [3], but it is also affected by the physical properties of the head, which “functions as a … reflecting surface … [,] a spherical baffle of about 9 cm radius” (Clark, Introduction, 221).

 The currently valid theory [1] of phonation is the aerodynamic myoelastic theory: the creation of sound is explained by taking into account aerodynamic forces, muscle activity, tissue elasticity and “the mechanically complex nature of the vocal fold tissue structure” (Clark, Introduction 37).

The speech mechanism in vowels can be described by a model that uses the physical properties of tubes. [4] A tube is a simple apparatus that, if attached to a source of sound, can emit harmonic frequencies [5]. When attached to a sound speaker at the end, the tube acts as a resonator that “has an infinite number of resonances, located at frequencies given by odd-quarter wavelength” (Kent and Read, 14). The resonant frequencies of a tube closed at one end are calculated by using the following formula (Johnson, 96):

Fn = (2n – 1)c/4L

Where n is an integer, L is the length of the tube and c is the speed of sound (about 35,000 cm/sec). This formula is derived from definition of frequency (f), which is in our case the same as the speed of sound (c) divided by wavelength[6] (∆) or:

f = c/∆

A tube is an approximation of the shape of the vocal tract, from larynx to lips. The acoustic energy is supplied by the vocal cords, which are located at the lower, closed end of the apparatus. This model is used to calculate average resonant frequencies in a configuration of the vocal tract that makes “uniform cross-sectional area” (Kent and Read 15), as in vowel schwa [ə] (See: Johnson, 97). Of course, this is an idealised and simplified representation, but it is useful because in this example “the configuration of the vocal tract approximates a parallel-side tube … closed at one end (the larynx) and open at the other (the lips)” (Clark, Introduction 218).

As an example, we can insert L = 17.5 cm in the formula, the average length of human tract [7] from glottis to lips (Kent and Read, 15). In this case the first formant, or the first resonance frequency, occurs at 500 Hz, the second at 1500 Hz, the third at 2500 Hz, and so on. Stevens cites Goldstine’s estimation of the vocal tract, stating that the average length in females is 14.1 cm (25). The calculated results for this sample length are then F1=620.5 Hz, F2=1861.7 Hz and F3=3102.8 Hz (more about formant calculation/synthesis).

However, this neutral position of the vocal tract can account for only a small number of sounds. Extended, the model of vowel production becomes more complicated, but explains the basic physics behind the vowel production. For example, in the pronunciation of the back vowel /ɑ/, the tongue separates the vocal tract and makes two tubes above the larynx. The first tube extends from pharynx to glottis, where it is closed, and the second from pharynx to the lips – and the tubes are roughly the same length (Ladefoged, Elements 123). The resonant frequency of each idealised tube will have the double value of the resonant frequency of the whole tube. If we take our example of 14.1 for females, and enter the value into the second formula, the first resonant frequency will be at 1041 Hz, which is also (for the sake of convenience) the same frequency as for the second tube. However, the air outside the mouth cavity will interact with the sound, and configuration of the pharynx will affect the first tube frequencies, which means that one resonance will be lower, and another higher – resembling the results measured in samples of the spoken vowel.

In other vowels the configuration of the tract becomes even more complex, because the tongue moves and changes the shape of the cavities, introducing other calculations, such as the Helmholtz resonator. For example, in the production of [i], the tongue makes a small-diameter constriction between the tubes in which a volume of air significantly contributes to the overall “shape” of a vowel. This volume of air must be taken into account when calculating the frequencies of the tubes (126).

Although simplified, the calculations from acoustic theories provide strong evidence in favour of the working principles, the proof being the general correlation between the calculated and the measured results (Kent and Read 22).

[1] According to Clark and his book published in 1990.

[2] In speech the origin of vibration usually refer to the vocal folds, but when a person in unable to produce sound by the vocal folds, usually because of illness, other means can be employed (Pinker, Instincts 165)

[3] This is the „radiation factor/impedance“, a filter that intensifies high frequencies by 6 dB for each octave. Within the pulse coming from the vocal folds, frequency peaks decrease for about -12 dB per octave. Thus, “these two … factors account for a -6dB/octave slope in the output spectrum”. (Ladefoged, Elements 105). Such sharp fall of the energy peak also means that “the intensity of the harmonics falls quite rapidly at high frequencies” (Clark, Introduction 212). It is then logical that most of the significant data in a sound signal is bellow 5.000 Hz, assuming that the upper hearing limit in humans is 20.000 Hz.

[4] In the 1960s Fant devised “nomographs” – diagrams that can be used to calculate the first four formants by using “the lengths of the resonators and their cross-sectional areas” (Clark, Introduction 222). The “nomographs” are quite famous in the history of acoustic research, but we will not expound them in detail in this paper. However, it is worth noting that “the two tube representation is only a crude approximation of the complex resonant cavity system of the human vocal tract during vowel production” (222).

[5] Mathematically, harmonics are the integer multiples of the fundamental frequency. Harmonics that correspond to the fundamental frequency of the object are the resonance frequencies. Various parts of the vocal tract act as resonators, so some frequencies of the sound are enhanced or dampened by the resonant properties of the tissues and vocal cavities. The enhanced frequencies of the sound are called formants, and they are visible in the spectrum as black bands. [?]

[6] “The distance, measured in the direction of propagation, between two points in the same phase in consecutive cycles of a wave. Symbol: ∆” (Trask, Dictionary 1995). Ladefoged (Elements 115) gives an insightful example: if a sound has frequency of 350 Hz, it will be heard for 1s at a distance of 350 m, since sound propagates at 35000 m/s (in common conditions); in this case the wavelength of the sound is 1 m (there are 350 peaks of sound at the 350 m distance).

[7]Lass gives 15 cm as an average distance in males. (Lass, Experimental 33). Clark (Introduction) provides insightful results reached by Pickett: “The length of a woman’s tract is about 80–90 per cent of a man’s, while child’s, depending on age, may be around 50 per cent of a man’s” (219).

This post is based on a draft for one of the introductory chapters in my paper.
Previous text: Sound (Related to Speech)

Sound (Related to Speech)

Sound is a form of energy (Crystal 32). It is a series of pressure fluctuations in a medium (Johnson 4). In speech the medium is usually air, although sound can propagate through solid objects and water, for example. Once the air particles become energised by the vocal folds vibration, a series of rarefaction and compression events begin. Compression occurs when particles are shifted closer to each other, which results in increased density within medium. Rarefaction is the opposite, when particles retract so density in medium reduces.

Compression, rarefaction, and other terms related to acoustics are often explained through a simple device – a pendulum. A pendulum, or a swing, is “a weight hung from a fixed point so that it can swing freely” (Oxford Dictionary). Once set in motion it will oscillate between two maximum points and its central, equilibrium, position.

A simple pendulum with minimum, maximum and equilibrium points

Here is a graphical representation of a pendulum. The point E is the equilibrium, while the points M1 and M2 mark the maximum points on both sides of the pendulum. The swinging motion from E to M1, then back to E and up to M2, can be shown in the coordinate system as a sinusoid. The figure shows such a sinusoid, with a series of maximum and minimum swinging points. The crossing point of the sinusoid and the line show the phase in oscillation when the pendulum reaches its starting point E. Particles do not travel through a medium; instead, they create a propagating pressure fluctuation: “A sound wave is a travelling pressure fluctuation that propagates through any medium that is elastic enough to allow molecules to could together and move apart” (Johnson 3). In other words, while each particle moves back and forth and acts “like the bob of pendulum … the waves of compression move steadily outward” (Ladefoged, Elements, 8). Here is an animation of the air molecules in a sound wave propagation.

Combined, a pendulum and a sinusoid illustrate the properties of sound waves and they help explain the terminology related to the physics of speech. For example, the distance between points E and M1 (or E and M2) is the amplitude. It shows the maximum oscillation points of the particles or, in sound, “the extent of maximum variation in air pressure” (Ladefoged, Elements, 14). A pendulum’s period (or a cycle) is a trajectory from E to M1, M2 and back to E. The number of such periods in a second is frequency, and it is measured in hertz (Hz). A pendulum with one oscillation per second has 1 Hz (equation 1). A sound of 100 Hz has an identifiable part that repeats once in a tenth part of a second.

1 Hz = 1/s

The energy of a sound wave depends on the force that created it. The bigger the energy in making the sound wave, the bigger pressure level in the medium it creates. The energy of a sound wave is related to its amplitude: a very strong wave will have big amplitude, and vice versa. The sound pressure, or its intensity, is measured in dB (decibels).

The human ear is very sensitive to pressure variations, estimated at 1013 units of intensity (Crystal 36). For easier reference, the logarithmic scale is used. Thus, units of 1013 are scaled to 130 dB (36).

A simple sinusoid below is an abstraction of a simple periodic sine wave. For its description, three items are needed: amplitude, frequency and phase [1] (Johnson 7). From the picture we see that the frequency of the sound is 1 per unit of time, while the amplitude reaches its peaks at 2 and -2 on the vertical scale. Unlike simple periodic waves, complex periodic waves “are composed of at least two sine waves” (8). One such complex wave has a pressure oscillation (an amplitude) that is the result of the pressure oscillations of at least two waves (Ladefoged, Elements 37), and, of course, the phases of the waves involved. Every complex wave can be seen as composed of several simple waves, and the merit of such model is that “any complex waveform can be decomposed into a set of sine waves having particular frequencies, amplitudes and phase relations)” (Johnson 11). The process of “breaking complex wave down into its sinusoidal components” (Clark 203) is well-known in physics and is called the Fourier analysis, named after the scientist who “developed its mathematical basis” (203) in XIX century.

A sinusoid graph
A sinusoid with equilibrium, maximum and minimum points corresponding to the pendulum movements

The second group of waves is aperiodic waves. They are characterised by the lack of repetitive pattern. Two types of waves are grouped under the term aperiodic: white noise and transients. White noise contains a completely random waveform, while waveform in transients does not repeat; in speech, an example for white noise is a fricative such as [s] (Johnson 12). Aperiodic sounds can also be subjected to Fourier analysis.

Sometimes pressure fluctuations in form of sound that hit an object cause the object to vibrate. The vibrations occur if the acting frequency is within the “effective frequency range” or resonator bandwidth (Ladefoged, Elements 68). Such induction of vibrations by another vibrating object is called resonance. Every object has a specific range of frequencies that it can respond to, and those frequencies correspond to the dominant frequencies of the sound the object can create – or as Ladefoged explains it: “… [T]he resonance curve of a body has the same shape as its spectrum” (65). In speech, the speech organs have the function of resonators: they filter (enhance and dampen) properties of waves, recognised as the speech sounds.

[1]  Phase is “the timing of the waveform relative to same reference point” (Johnson 8).

You can get SVG versions of the images (click for the pendulum of for the sinusoid).

This post is based on a draft for one of the introductory chapters in my paper.
Previous text: The Speech Organs and Airstream

The Speech Organs and Airstream

Speech is produced by the speech organs, where airstream  causes the vocal folds to vibrate (this applies to the egressive airstream mechanism). The created sound then moves through the articulatory system, attaining its final form – one of the sounds used in the language of the speaker.  This text is an overview of what happens with  air on its way out of the vocal tract.

The air from the lungs enters the larynx, a structure that consists of several cartilages: the thyroid, cricoid and arytenoid (Ogden, Introduction 40). The larynx is about 11 cm long and has 2.5 cm in diameter (Clark 30). The angle that is formed by the sides of the thyroid cartilage is 90° in males and 120° in females (30). This physical difference influences the voice quality intrinsically  (but, the quality can be culturally influenced as well [1]).

a graph showing the most relevant elements of the vocal tract
The vocal tract (Ogden 10)

The epiglottis, a leaf-shaped cartilage that closes the airways during swallowing, thus protecting sensitive tissue, is located above the larynx. The larynx houses vocal folds, “typically about 17 to 22 mm long in males and about 11 to 16 mm long in females” (32). The cartilage structure that surrounds the vocal folds and the vocal folds themselves form the glottis, a “laryngeal valve aperture” (32).

Above the epiglottis is the pharynx, a muscular passage that connects the oral cavity, the larynx and the velum. The pharynx is passively involved in speech (42), because it modifies the size of the space between the oral cavity and the larynx. The velum, a soft tissue, is placed above the pharynx. It directs the airflow in speech: if raised it closes the velopharyngeal port, an opening to the nasal cavity [2]  (46).

The oral cavity is a space in vocal tracts where humans can exert the greatest control of its size and shape (O’Connor, Phonetics 34), which makes it critical for “determining the phonetic qualities of speech sounds” (Clark, Introduction, 47). The oral cavity is a space between the lips (anteriorly [3]), the palotaglossus muscle (posteriorly), the tongue (inferiorly) and the roof of the mouth (superiorly) (47). The lips, the tongue and the angle of the mandible have an important role in speech sound production, although not of equal importance (for example, it is possible to make a distinctive sound with the mandible fixed) (47). Considering the complex muscular and neural structure of the mobile parts that surround the oral cavity it is no surprise, then, “that the characteristics of vowels depend on the shape of the open passage above the larynx” (Jones, Outline 29). Of course, this refers not only to vowels, but to all speech sounds; what makes vowels interesting, however, is the lack of any closure in the passages, so their quality is conditioned by the shape of the passages, or “inherent properties of the cavities” (Crystal 27).

When the tongue is moved backwards or forwards, the space in the pharyngeal region changes, and with the movement upwards and downwards (usually followed by mandible movement) the space defined by the hard palate and tongue changes in volume and shape (Stevens 22). According to Johnson the volume [4] of the vocal tract in males is about 170 cm3  and 130 cm3 in females; when the mandible is lowered for about 1 cm (average in speech), the volume increases to 190 cm3 and 150 cm3, respectively (24). Citing Goldstine, Johnson gives 41.1 cm as an average vocal tract length in adult females, 6.3 cm for pharynx length and 7.8 cm for the oral cavity length. In males, the values are 16.9 cm, 8.9 cm and 8.1 cm, respectively (25). This shows that the oral cavity in both sexes is almost of the same length, while differences are reflected in the length of the pharyngeal region (25).

The physiology of the vocal tract  links anatomy with phonetics. It describes, in terms of mechanics, properties and dimensions of the environment where speech sounds are created.

[1] “There are cultural effects too: in English-speaking cultures, it is common for males to enhance their intrinsically lower f0 by lowering their larynx, and for females to enhance their intrinsically higher f0.” (Ogden, Introduction 46)

[2] The velopharyngeal port is very important in discussing nasal sounds, where the air stream has a complex path that includes several cavities and an intricate physical model.

[3] Anterior/posterior – in anatomy, the axis from head to the opposite end of body.

[4] The values refer to the measurements when the vocal tract is in the neutral configuration.

This post is based on a draft for one of the introductory chapters in my paper.
Previous text: Speech and the Respiratory System
Next text: Sound (Related to Speech)