Writings Photos Code Contact Resume
Arabic, The big problem.

You are here

Submitted by msameer on Sun, 02/01/2005 - 12:04pm

I don't know why I'm writing this.
I've been envolved with Arabic for the last few years.
During the last few weeks I've started to realize the problems with Arabic, Noot only when it comes to text processing by the computer, But For non native Arabic speakers and even for native Arabic speakers like us.
Arabic is somehow a complex language which requires certain capabilities from the rendering backend to be displayed correctly.
One of thse capabilities is "shaping". To simplify things we we have 28 letters, Some of the letters have different shaping according to the position in the word. Initial, Medial, Final or Isolated.
The rendering backend should be aware of this to be able to display Arabic correctly.
Arabic letters are stored in text files in the Isolated form, The rendering backend must then "interprete" the letters and join them correctly otherwise we'll have non-joined letters.
The 2nd thing, Which is BiDi "Bidirectional algorithm". Now imagine when you embed Arabic strings within English strings for example Or vice versa, Not only With English, But this hppens with any language written from left to right. The rendering backend must also be able to resolve this situation and reorder the test sigments to obtain a visually correct order.

Now there is another thing which does require the rendering system to take care, Diacritics or what we call in arabic "tashkeel", Or the accents.

Before I talk about problems regarding the software we are using, I was talking yesterday woth a cool British guy, He was trying t teach himself Arabic by the means of flash cards. But he discovered that the letters - even in the arabeyes wordlist - are NOT accented so he'll have problems pronouncing it. I don't write the accents myself so I can't blame anyone.

Now let me state the problems with GNU/Linux, Though I can't separate the GNU/Linux problems from the problems in other operating systems, But I'll be mainly talking about GNU/Linux.

A GNU/Linux system is composed of several components working well with each others.
The desktop is composed of the X server or the X window system, Which the layer responsible for interfacing with the hardware, It gives you a plain desktop, On which the Desktop Environments start to put the background, The icons, A panel, ..........
You open applications "windows" the position of these windows is controlled by the "window manager".

With an open system like GNU/Linux you'll find yourself having multiple window managers and multiple desktop environments, Though you can assemble your own desktop from several components.
The most popular desktop environments are KDE and GNOME.

KDE is written in C++ using the Qt toolkit, While GNOME is written in C using the GTK+ toolkit.

A few years ago we didn't Sometime later GTK+ hit version 2.0 and QT hit 3.0, Both brought good Arabic support in terms of bidi and shaping. Some small problems remained but they are almost solved by now. One of them was GTK+ not supporting the letter accents, This remained a problem for a long time but it has been solved.
I'll be talking mainly about GTK+ since I'm more familiar with it.
As we said before we must have a rendering backend, The rendering backend'll be drawing strings on the toolkit widget, It might be incorporated in the toolkit like Qt, Or separated in another library like Gtk+, Which is using "pango" as the rendering backend. Since there is the X layer below the GUI toolkits, Then X provides functions t draw strings too, Actually I'll be addressing this later.
now GTK+ is using UTF-8 internally to represent the strings, UTF-8 is one of the Unicode Transformation Formats, UTF-8 is a multi-byte encoding.
Now let's try to explain this.
How do you map a certain character stored in a file to the corresponding character in the font ? The characters are stored in files as numbers, Since a character is a byte, And a byte is 8 bits, So we can't have more than 2^8 = 256 characters which might be enough to represent a language or two at the same time but are not enough to represent all the languages at the same time, Thus we had something called the encoding. We'll have a font with the Arabic letters, Another one with the Hebrew letters, Another one with Greek letters and so on. We can only use 1 font at a time to represent the character, This is the encoding.
With the unicode standard we now have more space to represent all the languages with one encoding.
UTF-8 is one of the unicode representations, A character might be 1 byte, 2 bytes, 3 bytes or 4 bytes. Arabic falls into the 2 bytes segment "0x06XX".
So now we have toolkits capable of rendering Arabic and apply bidi and shaping, Not all the toolkits can do this, But I'm talking about the major two.

The bidi might be very complex when we have an arabic string in which we embed an english string and embed an arabic string into the english string, Here comes the Unicode control characters, They are used to aid the rendering backend to resolv the bidi correctly, Though we have no keys on the keyboard to input them, But GTK+ text widgets have a submenu in the rightclick contxt menu to allow the input of them, AFAIK This is not present in Qt ATM.
I think this is a fast overview abut the current state regarding the desktop, Let's try to talk about the problems.
1) No standards, No standards on how to normalise arabic letters, So it's not implemented in any database server, It can be done using fuzzy search, but fuzzy searching is slow and requires the application to know about it.
2) Letter accents, They are not used thus making non-arabic native speakers unable to pronounce the words correctly.
3) The famous lam-alef problem.
4) Translation, Arabic to english translation requires high Artificial intelligence level. I think No one'll do this for us, We must do it either via research centers in the universities or via graduation projects by computer science students.
5) OCR and voice recognition, Same as above, But with an additional point that is no Arabic website is following the accessibility guides, Thus we can't apply the OCR technology if it's there.
6) We have no good open source font till now, the KACST fonts are fine, But the English glyphs are ugly.
6) Multiple implementations, Each application emplements the bidi and/or shaping by itself. We have a bidi library "fribidi" it's used by Pango, Pango is implementing the shaping algorithm. Qt is implementing bidi and shaping, Openoffice and mozilla both implements their bidi and shaping, We don't have a library for the shaping till now, fribidi was supposed to implement this but I don't know why this didn't happen till now. So if we find a problem with the bidi or the unicode standard changed we'll have to go through all the applications to fix it, Some applications don't implement bidi and shaping thus we are required to code the shaping into it, Thus increasing the mess!
7) Xft problems, This is somehow related to the above problem. X itself doesn't implement bidi and shaping, Xft doesn't, Xlib doesn't, So any application not using pango or Qt must be patched for Arabic to work correctly, Which is not the Right Thing (tm). Or we must implement bidi and shaping into xlib or Xft, But doing this'll break Qt, Mozilla, Openoffice, The Pango Xft rendering backends.

In my opinion this is a quick overview about the Arabic language in general from a programmer point of view!

Comments

Submitted by Alaa (not verified) on Sun, 02/01/2005 - 12:48pm

BIDI is needed for numbers, while arabic text flows from right to left numbers flow from left to right like in latin languages, so BIDI is required even in unilingual texts.

another thing is that BIDI standard sucks, it makes absolutley no sense, why is a sentence split and reordered if the line direction is left to right and the sentence starts with right to left characters then includes left to right characters then more right to left chars?

I mean if you write

arabic1 enlgish arabic2

this will be displayed as

english arabic1 arabic2 (or something similar but equaly borked)

then there are weird bracket behavior which makes writing xml or latex with arabic embedded in a major pain.

cheers,
Alaa

Submitted by msameer on Sun, 02/01/2005 - 1:06pm

No alaa, If you write:
arabic1 english arabic2 it'll be displayed as:
arabic2 english arabic1
but aligned to the right side
I think it makes sense ?!

Submitted by Anonymous (not verified) on Sun, 02/01/2005 - 8:48pm

what I'm talking about is when the line direction is left to right.

ie like all html that does not have a dir attribute.

Submitted by msameer on Sun, 02/01/2005 - 8:53pm

i think the browsers must implement the correct rearrangement of text, regardless of the alignment.

Submitted by Anonymous (not verified) on Mon, 03/01/2005 - 1:00pm

this is beside the point, the algorithm states that this is how it should be done if the line direction is left to right.

imagine you're writing an arabic to english dictionary first word will be arabic but the line direction will be left to right.

now if it so happens that you need to switch languages in the middle of the definition (to refer to another related arabic word masalan) things will break unless you embed control chars.

the algorithm makes no sense in these situations. and don't tell me bracket rules don't bother you, evr tried writing xml with arabic inside?

I have to insert blank lines all around the tags to avoid confusion.

Submitted by Anonymous (not verified) on Mon, 21/02/2005 - 8:18pm

Came across something that reminded me of part of the problem you wrote about with Arabic. Just thought I'd bring this to your attention.

"As one types, substitutions are made dynamically as the context changes. Not only do varied ligatures appear, but subtle changes to glyph exit and entry strokes ensure an attractive, flowing text – both between letters and at word beginnings and endings."
http://store.adobe.com/type/browser/landing/bickham.html

Add new comment

Subscribe to /  digg  bookmark