srmorph: Serbian Morphology in Python

My interest in linguistics and programming is continued with an experiment in morphology and srmorph project. It is a pilot endeavour I use to test ideas about parsing words of my native language (Serbian) on word level, and later, syntactic level. This post is about the work in progress.

What Can Be Seen, Searched, Parsed?

The project for time being has only Web/AJAX interface at which allows:

Affixes as Basics

At the foundation of srmorph are Serbian affixes. I always wanted to write a parser that would work by first examining words on the level of prefixes an suffixes (infixes are somewhat tougher problem). Therefore, the analysis is for now based on identifying affixes.

Environment and Data Format

The environment is Python 3 programming language, while the grammar data format is based around Python classes themselves. The uninstantiated classes are the actual data containers, and after they inherit from main meta classes, the become useful for parsing. For example, a class containing suffixes about declension looks like this:

class AffNounDeclension0(MAffix):
    """Suffix. Example: 'доктор'. Ref. Klajn:51."""
    pos = 'MNoun'
    place = 'end'
    process = ('inflection', 'declension')
    subtype = 1
    gender = 'm'
    suffix = {0:'', 1:'а', 2:'у', 3:'а', 4:'е', 5:'ом', 6:'у'}
    blendswith = ('nonpalatal',)

The attribute suffix lists seven endings glued to some masculine nouns in Serbian (Croatian, Bosnian). POS identifies word class, here a noun, etc.

Parsing and Website

The inherited Serbian affix classes (60+) are so far parsed functionally. I have set up a dynamic website at which shows some of the things that can be done by parsing. For now the algorithm is rather straightforward, until further filtering is introduced on word class level.

Once reasonably developed, the project will become open source.

screenshot: all classes where suffix "na"
Details about affix “na” in Serbian

Checking Praat’s TextGrids in Python

A TextGrid file contains data about intervals, segments, times etc. of the corresponding signal file (audio in wav, mp3, aif…). Because grids are in plain-text  – they can be analysed / checked / extracted  automatically, or parsed.

In case you are a linguist/phonetician you might be using Praat, a small, but very powerful, programme for phonetic analysis. Chances are have a lot of speakers and recordings. You will probably segment signals in Praat, and save the segmentation in TextGrids.

Thanks to Margaret Mitchell and Steven Bird, who contributed the parser for Praat TextGrid to Natural Language Toolkit, automated analysis is now much easier.

TextGrid parser is a part of NLTK and it is located here.

I am grateful to the authors, because they saved me a lot of time during segmentation checks. All that was needed was a Python script that uses the above code to load TextGrid content, and then write a set of checks for each file/speaker.

Checking file  03-speaker-im.TextGrid
    Checking proper tier names...
    Checking if tiers contain 32 items...
    Checking if all tiers have valid text...
    Checking if the diphthongs have pairs...
    Checking if all words are present...
    Checking if the words and diphthongs match...
Mismatch: "ay_l" not allowed in "dice", at position 24.
It should say "ay_s".

Here, for example, my script warned me that I have a wrong label for a diphthong in the file number 3. To spot that “manually” it would require a lot of time and attention.

I hope this post might help other researchers, and here is the Python script I wrote for my phonetic research.