Some Definitions of Vowel Sounds

Vowels are speech sounds (1) during whose production “the tongue is held at such a distance from the roof of the mouth that there is no perceptible frictional noise” and “a resonance chamber is formed which modifies the quality of tone” (Jones, Pronunciation 12). Gimson defines vowels (2) as a “category of sounds … normally made with a voiced egressive air-stream, without any closure or narrowing such as would result in the noise component characteristic of many consonantal sounds” (Introduction 35).

Fant gave a list of several correlates in speech sound classification (Speech pp. 153-155). What follows is  a compiled overview of properties a sound should have, according to Fant, to be classified as a vowel. The first condition is that a vowel must have sound energy visible in sound spectrum, and that the source of the acoustic energy originates from the vocal folds vibration. A vowel should also have “vowelike correlate” in speech production, which means an unobstructed pass of airstream. Waveform analysis of a “vowelike sound” implies that “at last F1 and F2 [are] detectable”, while F3 should be visible if F1/F2 are not at their lower ends (156). To classify a vowel as a diphthong, the speech sound must satisfy the “glide” correlate, which in the production context means “moderate speed within a segment”, seen as a “relatively slow [spectrum change] rate but faster than for mere combination of two vowels” (156). The picture below shows a spectrogram of a diphthong, satisfying Fant’s requirements for the classification.

black and white spectrogram of diphthong /ɑɪ/
Spectrogram of diphthong /ɑɪ/ as spoken in word "dies" by a female Received Pronunciation speker

We will give one more description of vowels (3), as described by Laver, who says that two of the distinctions for classifying speech sounds are place of articulation and degree of stricture, both related to the medial phase of a segment. Place of articulation refers to “the location of the articulatory zone in which the active articulator is closest to the passive articulator during the medial phase of a segment” (166). Degree of stricture identifies the degree of closure between the two articulators in the medial phase. Thus, he defines vowels as a group of sounds articulated in places of neutral articulation (167), when “the potential active articulators … lie in their neutral anatomical position” (166) opposite their passive articulators. In discussion about degree of stricture Laver says that in resonants “the stricture is one of open approximation” (168), allowing unrestrained pass of energy from the vocal folds.


1 They are also discussed in terms of being “purely linguistic units, counters which do a certain job, irrespective of how they sound” (O’Connor, Phonetics, 199) but that is a more phonological approach.

2 Gimson refers to vowels in the introductory chapters as “the vowel type” of sounds, “described in mainly auditory terms” (Introduction, 35). When discussing the vowel versus the consonant distinction he notes: “It will be found that the phonemes of a language usually fall into two classes, those which a typically central (or nuclear) in the syllable and those which are non-central (or marginal). The term ‘vowel’ can then be applied to those phonemes having the former function and ‘consonant’ to those having the latter.” (53).

3 Laver (pp. 167-172) gives a detailed description of several articulation aspects.

This post is based on a draft for one of the introductory chapters in my paper. For cited works please visit the page Books & References

Physics of Speech

Once set in a vibratory motion, the vocal folds create a series of movement within the vocal tract above the larynx (the rate at which vocal folds vibrate is recognized as the fundamental frequency of sound). An object will increase the vibrations [2] that are close to its own natural frequencies.  In speech, some frequencies are dampened, while some increased – in accordance to the resonant properties of the tract (its cavities, tissues).

The discussion about resonance in the speech production leads us to a well-known theory about the production of vowels: the source-filter theory, which postulates that the vocal folds are the source of the sound; after the sound is made it passes through a filter shaped by the vocal tract cavities (Ladefoged, Elements 103). This filter is “frequency-selective and constantly modifies the spectral characteristics of sound sources during articulation” (Clark, Introduction 217), and it changes during articulation (218).

However, the vocal tract is not the only filter involved: the sound is modified, and after it leaves the vocal tract, the air in the “outside world” affects the sound [3], but it is also affected by the physical properties of the head, which “functions as a … reflecting surface … [,] a spherical baffle of about 9 cm radius” (Clark, Introduction, 221).

 The currently valid theory [1] of phonation is the aerodynamic myoelastic theory: the creation of sound is explained by taking into account aerodynamic forces, muscle activity, tissue elasticity and “the mechanically complex nature of the vocal fold tissue structure” (Clark, Introduction 37).

The speech mechanism in vowels can be described by a model that uses the physical properties of tubes. [4] A tube is a simple apparatus that, if attached to a source of sound, can emit harmonic frequencies [5]. When attached to a sound speaker at the end, the tube acts as a resonator that “has an infinite number of resonances, located at frequencies given by odd-quarter wavelength” (Kent and Read, 14). The resonant frequencies of a tube closed at one end are calculated by using the following formula (Johnson, 96):

Fn = (2n – 1)c/4L

Where n is an integer, L is the length of the tube and c is the speed of sound (about 35,000 cm/sec). This formula is derived from definition of frequency (f), which is in our case the same as the speed of sound (c) divided by wavelength[6] (∆) or:

f = c/∆

A tube is an approximation of the shape of the vocal tract, from larynx to lips. The acoustic energy is supplied by the vocal cords, which are located at the lower, closed end of the apparatus. This model is used to calculate average resonant frequencies in a configuration of the vocal tract that makes “uniform cross-sectional area” (Kent and Read 15), as in vowel schwa [ə] (See: Johnson, 97). Of course, this is an idealised and simplified representation, but it is useful because in this example “the configuration of the vocal tract approximates a parallel-side tube … closed at one end (the larynx) and open at the other (the lips)” (Clark, Introduction 218).

As an example, we can insert L = 17.5 cm in the formula, the average length of human tract [7] from glottis to lips (Kent and Read, 15). In this case the first formant, or the first resonance frequency, occurs at 500 Hz, the second at 1500 Hz, the third at 2500 Hz, and so on. Stevens cites Goldstine’s estimation of the vocal tract, stating that the average length in females is 14.1 cm (25). The calculated results for this sample length are then F1=620.5 Hz, F2=1861.7 Hz and F3=3102.8 Hz (more about formant calculation/synthesis).

However, this neutral position of the vocal tract can account for only a small number of sounds. Extended, the model of vowel production becomes more complicated, but explains the basic physics behind the vowel production. For example, in the pronunciation of the back vowel /ɑ/, the tongue separates the vocal tract and makes two tubes above the larynx. The first tube extends from pharynx to glottis, where it is closed, and the second from pharynx to the lips – and the tubes are roughly the same length (Ladefoged, Elements 123). The resonant frequency of each idealised tube will have the double value of the resonant frequency of the whole tube. If we take our example of 14.1 for females, and enter the value into the second formula, the first resonant frequency will be at 1041 Hz, which is also (for the sake of convenience) the same frequency as for the second tube. However, the air outside the mouth cavity will interact with the sound, and configuration of the pharynx will affect the first tube frequencies, which means that one resonance will be lower, and another higher – resembling the results measured in samples of the spoken vowel.

In other vowels the configuration of the tract becomes even more complex, because the tongue moves and changes the shape of the cavities, introducing other calculations, such as the Helmholtz resonator. For example, in the production of [i], the tongue makes a small-diameter constriction between the tubes in which a volume of air significantly contributes to the overall “shape” of a vowel. This volume of air must be taken into account when calculating the frequencies of the tubes (126).

Although simplified, the calculations from acoustic theories provide strong evidence in favour of the working principles, the proof being the general correlation between the calculated and the measured results (Kent and Read 22).

[1] According to Clark and his book published in 1990.

[2] In speech the origin of vibration usually refer to the vocal folds, but when a person in unable to produce sound by the vocal folds, usually because of illness, other means can be employed (Pinker, Instincts 165)

[3] This is the „radiation factor/impedance“, a filter that intensifies high frequencies by 6 dB for each octave. Within the pulse coming from the vocal folds, frequency peaks decrease for about -12 dB per octave. Thus, “these two … factors account for a -6dB/octave slope in the output spectrum”. (Ladefoged, Elements 105). Such sharp fall of the energy peak also means that “the intensity of the harmonics falls quite rapidly at high frequencies” (Clark, Introduction 212). It is then logical that most of the significant data in a sound signal is bellow 5.000 Hz, assuming that the upper hearing limit in humans is 20.000 Hz.

[4] In the 1960s Fant devised “nomographs” – diagrams that can be used to calculate the first four formants by using “the lengths of the resonators and their cross-sectional areas” (Clark, Introduction 222). The “nomographs” are quite famous in the history of acoustic research, but we will not expound them in detail in this paper. However, it is worth noting that “the two tube representation is only a crude approximation of the complex resonant cavity system of the human vocal tract during vowel production” (222).

[5] Mathematically, harmonics are the integer multiples of the fundamental frequency. Harmonics that correspond to the fundamental frequency of the object are the resonance frequencies. Various parts of the vocal tract act as resonators, so some frequencies of the sound are enhanced or dampened by the resonant properties of the tissues and vocal cavities. The enhanced frequencies of the sound are called formants, and they are visible in the spectrum as black bands. [?]

[6] “The distance, measured in the direction of propagation, between two points in the same phase in consecutive cycles of a wave. Symbol: ∆” (Trask, Dictionary 1995). Ladefoged (Elements 115) gives an insightful example: if a sound has frequency of 350 Hz, it will be heard for 1s at a distance of 350 m, since sound propagates at 35000 m/s (in common conditions); in this case the wavelength of the sound is 1 m (there are 350 peaks of sound at the 350 m distance).

[7]Lass gives 15 cm as an average distance in males. (Lass, Experimental 33). Clark (Introduction) provides insightful results reached by Pickett: “The length of a woman’s tract is about 80–90 per cent of a man’s, while child’s, depending on age, may be around 50 per cent of a man’s” (219).

This post is based on a draft for one of the introductory chapters in my paper.
Previous text: Sound (Related to Speech)

Formant synthesis application

Jonas Beskow at the Centre for Speech Technology KTH Stockholm wrote free Formant Synthesis Demo computer programme that runs on Windows and Linux (and on any other OS for which the application can be compiled from the open source code the author kindly uploaded).

The programme synthesises F1, F2, F3 and F4 formants from several sources (rectangle, triangle, sine, sampled and noise). It “demonstrates formant-based synthesis of vowels in real time, in the spirit of Gunnar Fant’s Orator Verbis Electris (OVE-1) synthesiser of 1953” (from the About window).

„Formants are defined by Fant  as ‘the spectral peaks of the sound spectrum |P(f)|’ of the voice. Formant is also used to mean an acoustic resonance,[2] and, in speech science and phonetics, a resonance of the human vocal tract. It is often measured as an amplitude peak in the frequency spectrum of the sound, using a spectrogram (in the figure) or a spectrum analyzer, though in vowels spoken with a high fundamental frequency, as in a female or child voice, the frequency of the resonance may lie between the widely-spread harmonics and hence no peak is visible. In acoustics, it refers to a peak in the sound envelope and/or to a resonance in sound sources, notably musical instruments, as well as that of sound chambers” — Wikipedia.

Formant Synthesis Demo
The window of the Formant Synthesis Demo

The download link is on the Formant Synthesis Demo site.