Why I started learning Java

I am learning Java. This is why.

My job applications were rejected several times on the grounds of not having “formal programming knowledge”. I have done some nice amateurish open source stuff and tools, proved myself in several technologies, but that was not good enough to convince HRs that I am worthy of a junior position. I will not go into details about the quality of the jobs I applied for, the context of the applications or experiences related to it – to me the rejections were enough to do something. I had to choose, and I wanted to choose, a programming language to study as formally as I could, and this is why I decided to dedicate my time and money to Java.

Open and free versus closed and proprietary

I have always liked open source world. I have learned so much from it and got introduced to some great people. I also believe that sharing knowledge and allowing creative freedom is a good thing, that has its rightful place in today’s consumerist/corporate world. Java is open and runs on almost anything. Chosen: Java. Rejected: C#

Java is in demand

True, Java is not the freshest thing around when it comes to IT word, but it is relevant and in demand. I browsed though job advertisements and compared Java with other technologies: Java seemed to be consistently present throughout years. Chosen: Java. Rejected: C#, PHP

Java is versatile

One of the courses offered to me was a course in PHP. I have never been a fan of PHP, but I worked on it when I had to, and there are some awesome projects written in it (this CMS for example!). Also, I did not feel I can learn the things I wanted the way I could with Java (OO and some advanced meta stuff). Another option was C++, which was not really my cup of tea (I am not interested into low-level languages). Chosen: Java. Rejected: C#, PHP, C++

Java is corporate

Now, this was an interesting moment for me: Java is a corporate technology. My programming projects have been either related to academic research or to open source / startups. I have not had a chance to see how programming works within a corporate context, and Java seemed like a perfect way into that world. It may not be the perfect world, but it dominates, and it would be foolish to ignore it. After all, all those startups are hoping to become a corporate leaders. Also, there have been moments where I felt at ease with more strict corporate frame of mind, than with over the place fresh startup’s. Chosen: Java. Rejected: PHP

So, what about Python?

I had not found a suitable course where Python is studied formally. Even if I had, that would have not affected my decision to choose Java. I love Python I keep coming back to it (most of the projects I wrote about on this site are Python-related), but sometimes it’s good to get a flavour of a different mental setup and learn new techniques. I have experienced that and I loved it when I had to learn the basics of R programming; now I am looking forward to the same excitement in Java. Learning a new programming language (a formal language) is similar to learning a new language (a natural language): you get a change to see reality from different angles, get to know different culture.

So, off I go to the Java adventures.

srmorph: Serbian Morphology in Python

My interest in linguistics and programming is continued with an experiment in morphology and srmorph project. It is a pilot endeavour I use to test ideas about parsing words of my native language (Serbian) on word level, and later, syntactic level. This post is about the work in progress.

What Can Be Seen, Searched, Parsed?

The project for time being has only Web/AJAX interface at http://srmorph.languagebits.com/ which allows:

Affixes as Basics

At the foundation of srmorph are Serbian affixes. I always wanted to write a parser that would work by first examining words on the level of prefixes an suffixes (infixes are somewhat tougher problem). Therefore, the analysis is for now based on identifying affixes.

Environment and Data Format

The environment is Python 3 programming language, while the grammar data format is based around Python classes themselves. The uninstantiated classes are the actual data containers, and after they inherit from main meta classes, the become useful for parsing. For example, a class containing suffixes about declension looks like this:

class AffNounDeclension0(MAffix):
    """Suffix. Example: 'доктор'. Ref. Klajn:51."""
    pos = 'MNoun'
    place = 'end'
    process = ('inflection', 'declension')
    subtype = 1
    gender = 'm'
    suffix = {0:'', 1:'а', 2:'у', 3:'а', 4:'е', 5:'ом', 6:'у'}
    blendswith = ('nonpalatal',)

The attribute suffix lists seven endings glued to some masculine nouns in Serbian (Croatian, Bosnian). POS identifies word class, here a noun, etc.

Parsing and Website

The inherited Serbian affix classes (60+) are so far parsed functionally. I have set up a dynamic website at http://srmorph.languagebits.com/ which shows some of the things that can be done by parsing. For now the algorithm is rather straightforward, until further filtering is introduced on word class level.

Once reasonably developed, the project will become open source.

screenshot: all classes where suffix "na"
Details about affix “na” in Serbian

Phonetics R, Praat code in GPL3, Paper and Data to Download

This posts brings R, Praat and Python code I used to write my Phonetics MA paper, as well as the paper itself to download, plus the acquired data. I won’t go into too many details about the downloads, but I will note that I hope they will be of some use to people searching for similar things, approaches – or simply, to see how useful free and open source software is to researchers.

R, Python, Praat Code

The R, Python, and Praat code is hosted on Github under the label r-diphthongs-sr-en (here is the zipped version, which may not be up to date, but again not too different). The software tool that that took me the most time to write was a set of scripts in R language. It was designed to load the data I acquired with Praat and to list tables and create plots (the R plots and diphthongs you can see here). The code takes length, pitch, formants and intensity of diphthongs as the input.

Data: Diphthong Measurements, RP Speaker versus ESL Speakers

Praat TextGrid, drawn below waveform
A TextGrids with segments for the word/diphthong length, and the referent points in the constituent vowels for data measurement.

In my research I compared the lengths, formants, intensity and pitch of the selected diphthongs, as pronounced by of a group of female ESL speakers (native language Serbian), with a referent RP speaker. The data (see it here or at the above links) was extracted by using Praat TextGrids (this is how I checked them), and if you’re interested to see which methods and techniques I used to segment the files, you can see this chapter (the link to the integral paper is below). The data linked contains 8 diphthongs in 2 contexts (short/long), as recorded and pronounced by 15 ESL speakers and 1 RP speaker. The diphthongs were pronounced within 32 words (I wrote this script to select the corpus).

MA Paper: “Pronunciation of English Diphthongs by Speakers of Serbian: Acoustic Characteristics”

The paper is titled “Pronunciation of English Diphthongs by Speakers of Serbian: Acoustic Characteristics” and the most current (but not error free) version you will find here: http://www.languagebits.com/files/ma-paper/

So, Why Putting All This Online?

The most of the code here is tailor-made for my research, and I am aware that it cannot as-is be used in some other project. However, I believe it is a very useful heap of ideas. For example, Praat scripts and TextGrigs show some advanced tips for data extraction and control, which are backed up by a phonetic discussion about segmentation (itself a demanding task). The Python is used for corpus search and integrates a script from NTLK Toolkit to verify the sound signal annotations (as well as for the control of recording, but about that some other time). Finally, R scripts show how custom-made project is limited only by imagination, and how simple operations and filtering can significantly contribute to the final result (what I’m saying here is: don’t use Excel, learn R).

I also firmly believe that data, especially scientific (even in a such humble work, as an MA paper is), should be free, and that ideas should be free. Moreover, I have in mind Ladefoged’s words from his Phonetic Data Analysis:

After you have written everything, I hope you will publish a complete account of the work, even of it is only on your web site. Private knowledge does the world no good. … In addition, make sure that your data is stored in such way that it can be found and used by others. (p 192)