Monday, 23 February 2009

Hawai‘i, diacritics, and computer-meaningful statements in written language

A lovely piece on diacritics from pipwerks.

An ‘okina usually indicates a glottal stop, which is very important in the pronunciation of Hawaiian words.  The name Hawai‘i is a great example: the ‘okina indicates the name is pronounced hahwhy-ee instead of hahwhy. When you hear a native pronounce the name, there's usually a very short hard pause between the why and eesyllables.

Unfortunately, the two Hawaiian diacriticals are not used by European languages, which means they're difficult to accurately represent on a standard US qwerty keyboard. In most printed publications, the authors simply omit the diacriticals altogether — the very reason you usually see the name Hawaii, and not Hawai‘i.

It is easy for English speakers to forget that in many (if not most) languages, the funny little extra marks on the page aren't just there to look nice. They constitute meaningful units of language. There are multiple instances where their presence or abscence changes the meaning of the word (and one from Spanish is on the tip of my tongue, but I just can't catch it!) This has huge implications for computer use of language. There are few examples that apply to English, but there are still a few - 'can't' is a contraction of 'cannot', but 'cant' is a secret language (or an architechtural feature - love those homographs).

This is close to my heart because my first name is Zoë, and finding those damn dots in different bits of software is one of the banes of my life. 

My name is Greek, which means the dots are not umlaut, which is German, but dieresis. The dieresis tells the reader to 'pronounce these two adjacent vowels seperately', whereas the umlaut tells the reader how to pronounce the vowel marked (ä, ö, and ü are pronounced differently from a, o, and u.)

In terms of computer readability, that's not too much of a problem - it's a pronunciation issue, and (most) computers don't talk. English speakers are very lucky in that there are no non-alphabet characters in our written language that have the capacity to change word meaning, but I can't help but think that it can make for some lazy programming and difficult-to-use text editors. Arabic speaking programmers must tear their hair out when faced with written-for-English text editors. (As with Hebrew, writing in vowels in Arabic is optional, as the reader should be able to work out the word meaning by context - not handy for Arabic google.)

(Related: lots of people know that Hawai'i has only eight consonants, but did you know that across the Pacific, the consonants actually decrease as you move Westward? Polynesian languages (a subset of Austronesian language) are spread from NZ through the whole Pacific, all the way to Hawai'i. At every new island/language area from NZ West, you get a few consonants less, and they get softer, too. There's a nice map here: http://www.ecai.org/austronesiaweb/Maps/All-austronesia-area/All-austronesia-area.jpg)

No comments: