Arabic, The big problem.

Body

I don't know why I'm writing this.
I've been envolved with Arabic for the last few years.
During the last few weeks I've started to realize the problems with Arabic, Noot only when it comes to text processing by the computer, But For non native Arabic speakers and even for native Arabic speakers like us.
Arabic is somehow a complex language which requires certain capabilities from the rendering backend to be displayed correctly.
One of thse capabilities is "shaping". To simplify things we we have 28 letters, Some of the letters have different shaping according to the position in the word. Initial, Medial, Final or Isolated.
The rendering backend should be aware of this to be able to display Arabic correctly.
Arabic letters are stored in text files in the Isolated form, The rendering backend must then "interprete" the letters and join them correctly otherwise we'll have non-joined letters.
The 2nd thing, Which is BiDi "Bidirectional algorithm". Now imagine when you embed Arabic strings within English strings for example Or vice versa, Not only With English, But this hppens with any language written from left to right. The rendering backend must also be able to resolve this situation and reorder the test sigments to obtain a visually correct order.

Now there is another thing which does require the rendering system to take care, Diacritics or what we call in arabic "tashkeel", Or the accents.

Before I talk about problems regarding the software we are using, I was talking yesterday woth a cool British guy, He was trying t teach himself Arabic by the means of flash cards. But he discovered that the letters - even in the arabeyes wordlist - are NOT accented so he'll have problems pronouncing it. I don't write the accents myself so I can't blame anyone.

Now let me state the problems with GNU/Linux, Though I can't separate the GNU/Linux problems from the problems in other operating systems, But I'll be mainly talking about GNU/Linux.

A GNU/Linux system is composed of several components working well with each others.
The desktop is composed of the X server or the X window system, Which the layer responsible for interfacing with the hardware, It gives you a plain desktop, On which the Desktop Environments start to put the background, The icons, A panel, ..........
You open applications "windows" the position of these windows is controlled by the "window manager".

With an open system like GNU/Linux you'll find yourself having multiple window managers and multiple desktop environments, Though you can assemble your own desktop from several components.
The most popular desktop environments are KDE and GNOME.

KDE is written in C++ using the Qt toolkit, While GNOME is written in C using the GTK+ toolkit.

A few years ago we didn't Sometime later GTK+ hit version 2.0 and QT hit 3.0, Both brought good Arabic support in terms of bidi and shaping. Some small problems remained but they are almost solved by now. One of them was GTK+ not supporting the letter accents, This remained a problem for a long time but it has been solved.
I'll be talking mainly about GTK+ since I'm more familiar with it.
As we said before we must have a rendering backend, The rendering backend'll be drawing strings on the toolkit widget, It might be incorporated in the toolkit like Qt, Or separated in another library like Gtk+, Which is using "pango" as the rendering backend. Since there is the X layer below the GUI toolkits, Then X provides functions t draw strings too, Actually I'll be addressing this later.
now GTK+ is using UTF-8 internally to represent the strings, UTF-8 is one of the Unicode Transformation Formats, UTF-8 is a multi-byte encoding.
Now let's try to explain this.
How do you map a certain character stored in a file to the corresponding character in the font ? The characters are stored in files as numbers, Since a character is a byte, And a byte is 8 bits, So we can't have more than 2^8 = 256 characters which might be enough to represent a language or two at the same time but are not enough to represent all the languages at the same time, Thus we had something called the encoding. We'll have a font with the Arabic letters, Another one with the Hebrew letters, Another one with Greek letters and so on. We can only use 1 font at a time to represent the character, This is the encoding.
With the unicode standard we now have more space to represent all the languages with one encoding.
UTF-8 is one of the unicode representations, A character might be 1 byte, 2 bytes, 3 bytes or 4 bytes. Arabic falls into the 2 bytes segment "0x06XX".
So now we have toolkits capable of rendering Arabic and apply bidi and shaping, Not all the toolkits can do this, But I'm talking about the major two.

The bidi might be very complex when we have an arabic string in which we embed an english string and embed an arabic string into the english string, Here comes the Unicode control characters, They are used to aid the rendering backend to resolv the bidi correctly, Though we have no keys on the keyboard to input them, But GTK+ text widgets have a submenu in the rightclick contxt menu to allow the input of them, AFAIK This is not present in Qt ATM.
I think this is a fast overview abut the current state regarding the desktop, Let's try to talk about the problems.
1) No standards, No standards on how to normalise arabic letters, So it's not implemented in any database server, It can be done using fuzzy search, but fuzzy searching is slow and requires the application to know about it.
2) Letter accents, They are not used thus making non-arabic native speakers unable to pronounce the words correctly.
3) The famous lam-alef problem.
4) Translation, Arabic to english translation requires high Artificial intelligence level. I think No one'll do this for us, We must do it either via research centers in the universities or via graduation projects by computer science students.
5) OCR and voice recognition, Same as above, But with an additional point that is no Arabic website is following the accessibility guides, Thus we can't apply the OCR technology if it's there.
6) We have no good open source font till now, the KACST fonts are fine, But the English glyphs are ugly.
6) Multiple implementations, Each application emplements the bidi and/or shaping by itself. We have a bidi library "fribidi" it's used by Pango, Pango is implementing the shaping algorithm. Qt is implementing bidi and shaping, Openoffice and mozilla both implements their bidi and shaping, We don't have a library for the shaping till now, fribidi was supposed to implement this but I don't know why this didn't happen till now. So if we find a problem with the bidi or the unicode standard changed we'll have to go through all the applications to fix it, Some applications don't implement bidi and shaping thus we are required to code the shaping into it, Thus increasing the mess!
7) Xft problems, This is somehow related to the above problem. X itself doesn't implement bidi and shaping, Xft doesn't, Xlib doesn't, So any application not using pango or Qt must be patched for Arabic to work correctly, Which is not the Right Thing (tm). Or we must implement bidi and shaping into xlib or Xft, But doing this'll break Qt, Mozilla, Openoffice, The Pango Xft rendering backends.

In my opinion this is a quick overview about the Arabic language in general from a programmer point of view!

Add new comment

The content of this field is kept private and will not be shown publicly.