Writings Photos Code Contact Resume
Baghdad, The yet to be.

You are here

Submitted by msameer on Sun, 12/02/2006 - 3:36pm

As of today, The 12th of February 2005, The CVS now holds the new Baghdad code, I've dumped the old code and data files.
It was not accurate, A lot of correct words were flagged as incorrect and vise versa. Beside, using a SQL database as the back end storage engine wasn't really a good thing in my opinion.

I had a lot of ideas in mind on how to implement the new thing.

1) A complete word list with all the Arabic words, This is tedious and the list will be huge.
2) Continue using the old data set, I had o understand how it was generated, Extend it and/or correct it. I'm not a linguist and I think that my Arabic suck ;-)
3) Try to identify whether the word fed into the checker is a verb, a noun or whatever, Then try to get the root by removing prefixes, suffixes and non-needed letters from the stem. This wasn't easy because I didn't find rules to classify the words and I was able to get an exception to every rule I came out with.
4) As above but don't use strict rules, Instead try to apply all the rules and then classify according to the probability.
5) Ahmad Gharbeia had an idea, I shouldn't inspect each word separately, I should get the previous word into account too, This looks fine but the problem is that I need to build Baghdad as a library other applications can use, I'm not welling to implement the support for every application out there, Therefor I'll Create a plug-in for enchant as it seems to be the upcoming standard "According to Islam" and this will allow me to support both GNOME and KDE.
6) After talking with Amr Gharbeia, We decided to use a set of regular expressions to extract the root of a word then match this root against a word list, Of course this'll not always work so I'll have to add some AI to handle it but this is fine with me.
7) I extended the above idea, Now for the regular expressions, We have a file named after the number of letters of the word that'll be matched against this regular expression. Each line in this file is composed of 2 parts, The regular expression itself followed by the number of letters returned by this match.
For the word lists, Each file is named after the number of letters of words in it, So whenever a regular expression matches, We know the word list we should check for this word in.
By using the above method, We can avoid running a word against all the regular expressions we have thus saving time and processing power, And by knowing which file we need to check for the word we can achieve a bit of speed.
I know that the above algorithm needs to be optimized a lot but I leave this when it becomes slow.
I'll also be implementing a caching of correct and Incorrect words with their suggestions.
I'll also be implementing a separate word list for words with no Arabic roots, At the end we'll be using a hybrid model as I discussed with Alaa.

Now what does this mean ?

It means that we might end up with a FLOSS spell checker after all these years.

It also means that we'll need to start building the regular expressions and words lists which is not what I really like to do, Under other conditions I'd have shut my mouth because I'm asking people to do something I won't be doing, But as I'm doing all the coding I guess I have the rights to say that.

It also means that if the above algorithm fails, We'll have to live without an Arabic spell checker.

Now let me state the current situation:
1) Main library: Works, Not all the correction algorithms are there but it's in the same state as the release before the rewrite.
2) Data files: 8 words and 3 regular expressions, We can detect 24 words.
3) Enchant library: Completed.

The library still needs a lot of work so we can get private dictionary support, Encodings other than UTF-8, Diacritics stripping, Relaxed spell checking rules. But the word list is the most important thing now in my opinion.


Submitted by diaa (not verified) on Sun, 12/02/2006 - 4:14pm

keep it up

Add new comment

Subscribe to /  digg  bookmark