Skip to content

srmorph: Serbian Morphology in Python

2013 January 19
by Romeo Mlinar

My interest in linguistics and programming is continued with an experiment in morphology and srmorph project. It is a pilot endeavour I use to test ideas about parsing words of my native language (Serbian) on word level, and later, syntactic level. This post is about the work in progress.

What Can Be Seen, Searched, Parsed?

The project for time being has only Web/AJAX interface at http://srmorph.languagebits.com/ which allows:

Affixes as Basics

At the foundation of srmorph are Serbian affixes. I always wanted to write a parser that would work by first examining words on the level of prefixes an suffixes (infixes are somewhat tougher problem). Therefore, the analysis is for now based on identifying affixes.

Environment and Data Format

The environment is Python 3 programming language, while the grammar data format is based around Python classes themselves. The uninstantiated classes are the actual data containers, and after they inherit from main meta classes, the become useful for parsing. For example, a class containing suffixes about declension looks like this:

class AffNounDeclension0(MAffix):
    """Suffix. Example: 'доктор'. Ref. Klajn:51."""
    pos = 'MNoun'
    place = 'end'
    process = ('inflection', 'declension')
    subtype = 1
    gender = 'm'
    suffix = {0:'', 1:'а', 2:'у', 3:'а', 4:'е', 5:'ом', 6:'у'}
    blendswith = ('nonpalatal',)

The attribute suffix lists seven endings glued to some masculine nouns in Serbian (Croatian, Bosnian). POS identifies word class, here a noun, etc.

Parsing and Website

The inherited Serbian affix classes (60+) are so far parsed functionally. I have set up a dynamic website at http://srmorph.languagebits.com/ which shows some of the things that can be done by parsing. For now the algorithm is rather straightforward, until further filtering is introduced on word class level.

Once reasonably developed, the project will become open source.

screenshot: all classes where suffix "na"

Details about affix “na” in Serbian

Jotpub – Minimalistic Language Tests Platform

2012 September 7

I am happy to announce a new website project, following several months of development: Jotpub.com, titled Minimalistic Language Tests Platform Jotpub is a website that enables anyone, from most modern devices, to solve, create, and share language tests, see the score and keep track of the solved tests — for free.

Purpose

Jotoub logo (green circle with white JP letters in it)

Jotpub

The purpose of Jotpub is to be a simple, creative, open, non-obstructive and trusted solution for language practice by online tests. It is simplistic in design and idea — there are too many complex learning platforms and Jotpub is not trying to be one. It is open, so all interested visitors can solve tests and check their answers, and thus non-obstructive. It encourages creativity  by giving an opportunity to all registered users (the registration is free and necessary so tests could have an author) to make and share tests. “Trusted” means that the author (and the editor) pays close attention to the sources of the selected tests and texts, and that only checked, reliable, content is visible to learners. All in all, Jotpub an attempt to reach, educate, and help language learners (an their teachers). And, yes, it’s free.

Features

Main features of Jotpub are focused on solving, creating, and sharing language tests. Solving is based on entering, checking, and revealing answers to “traditional” questions: fill in the blanks, true/false and multiple, with either one or several correct answers — try solving The Indefinite Article Basics or Articles from “Golden Bird” (a lengthy one). Creating is done on the website on a special page (Jotpub Test Maker) that uses a simple interface to create new language tests.  Sharing is made possible by publishing a page address (for now by using a third-party solution), and all authors can share the link to their tests.

How Language Tests on Jotpub Function

Fill-in-the-blanks - text and lozenges to enter answers

Fill-in-the-blanks, one of the questions on Jotpub

The language test are organized into  two main groups: by categories and tags. The categories follow a strict linguistic division, for example Grammar > Articles > All articles, while tags represent more loose, mixed, labeling (in the tag listings, the categories are tags as well). Users can browse the tests and then select one to solve. After entering the answers and clicking on Check my answers, the analysed test will appear, showing correct, incorrect and missing answers. Users can now redo the test, or reveal the answers (this option is available only when tests are first attempted).

If users are registered, their test is saved in Testbook,  so they can redo it later. Also, each listing has small symbols that show solved and attempted tests.

Jotpub has a nice printing style and only the relevant information for test solving is shown on the paper. You can copy it and share to your learners. You can also print solved tests.

Creating Language Tests

making multiple answers question on jotpub

Jotpub Test Maker – an example of creating a multiple select question.

The test creation is done in the Jotpub Test Maker — the page anyone can use, and the one I use to make all tests on Jotpub. After users register, they will be able to load the Test Maker  and enter their questions. Great attention was given to an attempt to make this as simple as possible. For example, to make a blank, just enter the sentence and enclose any word (or a letter) in square brackets ([like this]); for multiple select either one or several answers – this will render two types of questions.

The test are at first unlisted. This means they are not available to all users. After the test is checked it may be listed and visible to all visitors. However, all created tests can be shared if the test creator shares a link to it.

Further Plans For Jotpub

There are several plans regarding the future development of Jotpub. Some are focused on stability and optimization, and others on new features (particularly about statistics). For now, the  primary task is to see what users think abut the website, and what they would like to see changed, improved or added. You are invited to be one of the visitors to solve some tests!

 

Phonetics R, Praat code in GPL3, Paper and Data to Download

2012 May 20
by Romeo Mlinar

This posts brings R, Praat and Python code I used to write my Phonetics MA paper, as well as the paper itself to download, plus the acquired data. I won’t go into too many details about the downloads, but I will note that I hope they will be of some use to people searching for similar things, approaches – or simply, to see how useful free and open source software is to researchers.

R, Python, Praat Code

The R, Python, and Praat code is hosted on Gitorious under the label r-diphthongs-sr-en (here is the zipped version, which may not be up to date, but again not too different). The software tool that that took me the most time to write was a set of scripts in R language. It was designed to load the data I acquired with Praat and to list tables and create plots (the R plots and diphthongs you can see here). The code takes length, pitch, formants and intensity of diphthongs as the input.

Data: Diphthong Measurements, RP Speaker versus ESL Speakers

Praat TextGrid, drawn below waveform

A TextGrids with segments for the word/diphthong length, and the referent points in the constituent vowels for data measurement.

In my research I compared the lengths, formants, intensity and pitch of the selected diphthongs, as pronounced by of a group of female ESL speakers (native language Serbian), with a referent RP speaker. The data (see it here or at the above links) was extracted by using Praat TextGrids (this is how I checked them), and if you’re interested to see which methods and techniques I used to segment the files, you can see this chapter (the link to the integral paper is below). The data linked contains 8 diphthongs in 2 contexts (short/long), as recorded and pronounced by 15 ESL speakers and 1 RP speaker. The diphthongs were pronounced within 32 words (I wrote this script to select the corpus).

MA Paper: “Pronunciation of English Diphthongs by Speakers of Serbian: Acoustic Characteristics”

The paper is titled “Pronunciation of English Diphthongs by Speakers of Serbian: Acoustic Characteristics” and the most current (but not error free) version you will find here: http://www.languagebits.com/files/ma-paper/

So, Why Putting All This Online?

The most of the code here is tailor-made for my research, and I am aware that it cannot as-is be used in some other project. However, I believe it is a very useful heap of ideas. For example, Praat scripts and TextGrigs show some advanced tips for data extraction and control, which are backed up by a phonetic discussion about segmentation (itself a demanding task). The Python is used for corpus search and integrates a script from NTLK Toolkit to verify the sound signal annotations (as well as for the control of recording, but about that some other time). Finally, R scripts show how custom-made project is limited only by imagination, and how simple operations and filtering can significantly contribute to the final result (what I’m saying here is: don’t use Excel, learn R).

I also firmly believe that data, especially scientific (even in a such humble work, as an MA paper is), should be free, and that ideas should be free. Moreover, I have in mind Ladefoged’s words from his Phonetic Data Analysis:

After you have written everything, I hope you will publish a complete account of the work, even of it is only on your web site. Private knowledge does the world no good. … In addition, make sure that your data is stored in such way that it can be found and used by others. (p 192)

Cheers!