Phonetics R, Praat code in GPL3, Paper and Data to Download

This posts brings R, Praat and Python code I used to write my Phonetics MA paper, as well as the paper itself to download, plus the acquired data. I won’t go into too many details about the downloads, but I will note that I hope they will be of some use to people searching for similar things, approaches – or simply, to see how useful free and open source software is to researchers.

R, Python, Praat Code

The R, Python, and Praat code is hosted on Github under the label r-diphthongs-sr-en (here is the zipped version, which may not be up to date, but again not too different). The software tool that that took me the most time to write was a set of scripts in R language. It was designed to load the data I acquired with Praat and to list tables and create plots (the R plots and diphthongs you can see here). The code takes length, pitch, formants and intensity of diphthongs as the input.

Data: Diphthong Measurements, RP Speaker versus ESL Speakers

Praat TextGrid, drawn below waveform
A TextGrids with segments for the word/diphthong length, and the referent points in the constituent vowels for data measurement.

In my research I compared the lengths, formants, intensity and pitch of the selected diphthongs, as pronounced by of a group of female ESL speakers (native language Serbian), with a referent RP speaker. The data (see it here or at the above links) was extracted by using Praat TextGrids (this is how I checked them), and if you’re interested to see which methods and techniques I used to segment the files, you can see this chapter (the link to the integral paper is below). The data linked contains 8 diphthongs in 2 contexts (short/long), as recorded and pronounced by 15 ESL speakers and 1 RP speaker. The diphthongs were pronounced within 32 words (I wrote this script to select the corpus).

MA Paper: “Pronunciation of English Diphthongs by Speakers of Serbian: Acoustic Characteristics”

The paper is titled “Pronunciation of English Diphthongs by Speakers of Serbian: Acoustic Characteristics” and the most current (but not error free) version you will find here:

So, Why Putting All This Online?

The most of the code here is tailor-made for my research, and I am aware that it cannot as-is be used in some other project. However, I believe it is a very useful heap of ideas. For example, Praat scripts and TextGrigs show some advanced tips for data extraction and control, which are backed up by a phonetic discussion about segmentation (itself a demanding task). The Python is used for corpus search and integrates a script from NTLK Toolkit to verify the sound signal annotations (as well as for the control of recording, but about that some other time). Finally, R scripts show how custom-made project is limited only by imagination, and how simple operations and filtering can significantly contribute to the final result (what I’m saying here is: don’t use Excel, learn R).

I also firmly believe that data, especially scientific (even in a such humble work, as an MA paper is), should be free, and that ideas should be free. Moreover, I have in mind Ladefoged’s words from his Phonetic Data Analysis:

After you have written everything, I hope you will publish a complete account of the work, even of it is only on your web site. Private knowledge does the world no good. … In addition, make sure that your data is stored in such way that it can be found and used by others. (p 192)


Resonant frequencies and the vocal tract length

This post is about resonant frequencies of a tube, in the context of speech and the neutral vocal configuration. Two formulas are given: the first to calculate the resonant frequencies when the length is known, and second, to calculate the length when the frequency of a formant is known. Finally, there is a real-life example: a calculation of a speaker’s vocal tract length after measuring the formants in schwa.

The speech mechanism in vowels is described by a model that uses the physical properties of tubes. A tube is a simple apparatus that, if attached to a source of sound, can emit harmonic frequencies. When attached to a sound speaker at the end, the tube acts as a resonator that “has an infinite number of resonances, located at frequencies given by odd-quarter wavelength” (Kent and Read 14). The resonant frequencies of a tube closed at one end are calculated by using this formula (Johnson 96):


Where n is an integer, L is the length of the tube and c is the speed of sound (about 35,000 cm/sec).

This was very interesting to me, so I decided to experiment with the formula in R language. The purpose was to calculate average frequencies of a vocal tract in the neutral configuration (a position of vocal organs where a tube without obstacles is created from the larynx to the lips). So, the formula written above in R looks like this:

freq <- ((2*i-1)*35000)/(4*tract.len)

For a given speed of sound c=35000, the formant number i and the tract length, we can calculate estimated formant values. As an example, we can insert L = 17.5 cm in the formula, the average length of human tract16 from glottis to lips (15). In this case the first formant, or the first resonance frequency, occurs at 500 Hz, the second at 1500 Hz, the third at 2500 Hz, and so on. Here is the output form R code located here:

> Resonance(17.5)
Tract length is 17.5 cm.
formant 1: 500 Hz
formant 2: 1500 Hz
formant 3: 2500 Hz
formant 4: 3500 Hz
formant 5: 4500 Hz

Of course, we can reverse the calculation; by entering formant frequency and the order of the formant we can calculate an average length:

prep <- 35000*((formant/2)-0.25)
length <- (prep/freq)

This is the result of  Length function of the code:

> Length(1000, 1)
Estimated tract length is 8.75 cm, where formant number 1 has value of 1000 Hz.

This length corresponds to vocal tract lengths measured in infants.

spectrogram and waveform
A spectrogram and waveform near the end of a word "abjured". The three red lines show formants, while the vertical line shows the measurement point. Analysed in Praat.

To make the calculations even more interesting, we can measure the frequency of the first formant of speakers, and then “calculate” the length lengths of the vocal tracts. Here is an example: we recorded a speaker and examined the sound data. Since schwa sound is pronounces in (approximately) the neutral configuration, we measured the formants where this sound (IPA: ə) was articulated. In this case, that was near the end of the word  abjured /əbˈdʒʊəd/. The first three formant values in the sample female speaker were:

Time_s   F1_Hz   F2_Hz   F3_Hz
4.633178   549.304326   1750.098455   2915.885791

If we enter 549.3 Hz in the second formula, we get:

> Length(549.304326,1)
Estimated tract length is 15.92 cm, where formant number 1 has value of 549.3043 Hz.

This is, it seems, an acceptable value for this speaker.

The measurements and image was obtained by using Praat, free phonetic software. Calculation and the code example were written in R programming language.

English Diphthongs

A diphthong is defined by Jones as “a sound made by gliding from one vowel to another … represented phonetically by sequence of two letters” (Pronunciation 22). A sound realised as a diphthong marks “a change from one vowel quality to another, and the limits of the change are roughly indicated by the two vowel symbols” (O’Connor, Phonetics 155). It is important to note that even though a diphthong is “… phonetically a vowel glide or a sequence of two vowel segments [it] … functions as a single phoneme” (220).

Vowels are speech sounds during whose production “the tongue is held at such a distance from the roof of the mouth that there is no perceptible frictional noise” and “a resonance chamber is formed which modifies the quality of tone” (Jones, Pronunciation 12). Gimson defines vowels as a “category of sounds … normally made with a voiced egressive air-stream, without any closure or narrowing such as would result in the noise component characteristic of many consonantal sounds” (Introduction 35). – – Which speech sounds are vowels?

The critical property of diphthongal realisation of a sound is when “the organs of speech perform a clearly perceptible movement” (Jones, Outline 63). Gimson notes that diphthongs, or “diphthongal vowel sounds” (Introduction 39) are sounds “which have a considerable voluntary glide”. They are “the sequences of vocalic elements … which form a glide within one movement” (126).

Centering Diphthongs on the Cardinal Diagram
Centering Diphthongs in RP

The movement in a diphthong starts from the first element, which is usually a pure vowel (127) and reaches an approximate value of a vowel indicated by the second element or “the point in the direction of which the glide is made” (126). The point of direction, whether on the cardinal vowel diagram, or the tongue in the mouth, enables classification of the RP diphthongs into two groups: closing and centring (Jones, Pronunciation 23-24):

The first element in RP diphthongs is usually [ɪ, e, a, ʊ, ə], while the second is [ɪ, ʊ, ə] (Gimson, Introduction 126). However, one of the characteristics of diphthongs is great regional variety (not discussed here).

Classification of diphthongs on the closing and the centring
Type        Constituent vowels
Closing     eɪ, ɔʊ, ɑɪ, ɑʊ, ɔɪ
Centring    ɪə, ɛə, ɔə, ʊə

Diphthongs can also be divided into groups based on the vowel to which they gravitate in the second element. Thus, we have groups that have /ɪ/, /ʊ/ and /ə/ as the second element.

Long vowels / diphthongs:
[ɪ] eɪ, aɪ, ɔɪ, ʊɪ
[ʊ] əʊ, ɑʊ
[ə] ɪə, ɛə, ɔə, ʊə

In this post we are focused on Received Pronunciation, and the examples about the sounds do not include different variants of pronunciation (whether in the UK itself, or the USA, AU or other). (Here are the RP vowels of English, placed on vowel diagram, based on the overview in O’Connor’s Phonetics.)

English Diphthongs

Diphthong /eɪ/

Diphthong /eɪ/ starts “from slightly below the half-close front position and moves in the direction of RP /ɪ/” (Gimson, Introduction 128). The beginning of this diphthong is between cardinals [e] and [ɛ]. The first element of the diphthong /aɪ/ “varies from central to front” (O’Connor 167) or, in Gimson’s description, it is “slightly behind the front open position i.e. C[ä]” (Introduction 129). The glide ends with RP /ɪ/ position.

Diphthongs /ɔɪ/ and /ɔɪ/

The first element of /ɔɪ/ in RP is pronounced very close to cardinal [ɔ] and the second, after the configuration changes, is close towards the pronunciation of /ɪ/ (O’Connor, Phonetics 169). In this glide “the range of closing … is not as great as in /aɪ/ …” and “the jaw movement … may not … be as marked as in the case of /aɪ/” (Gimson, Introduction 131). This diphthong can be seen as asymmetrical on the RP system, since it is the “only glide of this type with a back starting point” (132).

Diphthong /əʊ/

The realisation of diphthong /əʊ/ starts with the articulators positioned for “typical RP [ɜ:] position”, while afterwards the tongue moves “slightly up and back to RP [ʊ], but the starting point may vary …” (O’Connor 167). In conservative pronunciation this diphthong starts “in a more retracted region”, near centralised (or centralised-open) [o], “and the whole glide is accompanied by increasing lip-rounding” (Gimson, Introduction 133). In an affected variant, the diphthong starts with more centralised-closed [ɜ] position (134). Also, “in many speakers of general RP, the 1st (central) element is so long that there may rise for a listener a confusion between /əʊ/ an /ɜ:/, especially when [ɫ] follows, e.g. goal, girl … ” (134).

Diphthong /ɑʊ/

The diphthong /ɑʊ/ starts “further back than /aɪ/ and changes towards RP /ʊ/” (O’Connor, Phonetics 168); Gimson describes it as starting “slightly more fronted … than RP /ɑ:/” (Introduction 136). Another dominant diphthong in the back region is /əʊ/, so /ɑʊ/ has to be pronounced with a perceivable difference – for this reason no raising is possible without losing the contrast, and so “fronting or retraction” (136) prevails in the variants of /ɑʊ/.

Diphthong /ɪə/

This is one of the centring diphthongs (/ɪə/, /ɛə/ and /ʊə/). Diphthong /ɪə/, starts with the tongue positioned for /ɪ/. In the second part of the pronunciation, the movement has two types. The first is “the more open variety of /ə/ when /ɪə/ is final in the words”, while in the second type, in non-final positions, the movement is not so extensive (Gimson, Introduction 142). The two pronunciations are, in essence, “two main allophones of /ɪə/ in RP, corresponding to those of /ə/” (O’Connor, Phonetics 170).

Diphthong /ɛə/

Diphthong /ɛə/ “starts at cardinal /ɛ/ or below and moves to more central but equally open position” (171). Gimson adds that when final /ə/ acquires a more open position, while in the cases when /ɛə/ is “closed by a consonant”, /ə/ it is of “mid … type” (Introduction 143). The variants are mostly in the degree of openness of the first element (143).

Diphthong /ʊə/

The glide /ʊə/ has “coalesced with /ɔ:/ for most RP speakers” (Gimson, Introduction 145) and “[a] monophthongal pronunciation is … found regularly before /r/ in, e.g. alluring, furious, having the quality of the diphthong’s beginning point” (O’Connor, Phonetics 172). Gimson also gives an overview of the monophthongal pronunciation, such as in the words your, Shaw or sure, but warns “that such lowering of monophthongization of /ʊə/ is rarer in case of less commonly used monosyllabic words such as moor, tour, dour” (Introduction 145). The diphthong is pronounced with the first element around /ʊ/, while the second element reaches a “more open type of /ə/” (144).

Notes about Length and Targets

The closing diphthongs in the cardinal diagram
Closing diphthongs in English

For the exception of the falling diphthongs, “most of the height and stress associated [with the sound] is concentrated on the 1st element, the 2nd element being only lightly sounded” (126). The length of the diphthongs is the same as in long pure vowels, which means they are affected by the same syllabic fortis and lenis rules.

Harrington describes a study based on the hypotheses by Pols, about classification of diphthongs applied in American English by Cottinfield, and the importance of the targets for the classification. The first hypothesis is about “dual target” (or onset plus offset), the second about “onset plus slope”, while the third involves “onset plus direction”. According to the first hypothesis, “both diphthong targets are critical for identification [of a diphthong]”, while the second claims that “quality is presumed to depend on the first target”; the third hypothesis postulates that “the first target and the direction of spectral movement” are the biggest contributors in diphthong recognition (Techniques 66).


[1] The figures in the text were derived from O’Connor’s Phonetics.

Need a vowel chart with English monophthongs and diphthongs in SVG format? It’s here.