Writings Photos Code Contact Resume
Why aspell and Not Baghdad ?

You are here

Submitted by msameer on Fri, 24/03/2006 - 2:34pm

It all started with a question from Amr Gharbeia
"Why are we working on an Arabic spell checker ?, Why not a wordlist for aspell ?"
Oh, That's a simple question but it changed a lot of things.
The answer could be: Because aspell won't work with Arabic.
This triggered another question: Who said so ? Did you test ?
Mo. Elzubeir started The Duali project, I assumed he did the testing and discovered that it won't work.
But did I do the testing ? No.

Too bad I wasted months doing something because someone didn't do testing, I should've done my testing part too.

So it's been four years without a spell checker just because some people are lame ? Too bad ;-)

Also Given the fact that Dwayne of translate.org.za suggested that we use the word list approach and he said that it'll work, We see that our approach was wrong.
Now Baghdad can continue but not as a spell checker, It can be a grammar checker or an engine that will understand the sentence or anything else but not as a spell checker.

For baghdad to be like that we need a dataset in a specific format that I'm sure no one will help generate it, All those voices out there crying due to the absence of an Arabic spell checker didn't "won't" move or help generate the word list. I can code but the thing I can't do is the word list.

Now that I have an Arabic wordlist generated from the words of the holy Quran and given the fact that it worked like a charm, I can say that all we need is the word list.
Too bad I assumed that aspell won't work for Arabic.

Now for the wordlist we want, I'd say that:
* It must be from the modern Arabic used, Arabic is full of words that are ancient, If you know them, Then you don't need a spell checker :-)
* It must be correct.

That's why I still object to the use of the Buckwalter dataset as we don't know whether it's 100% correct or not and we still don't know how many ancient words are there.

For the same reasons, I didn't really release the list generated from the Quran, I'm sure it's correct but the quran is special, They write some words in a different way and I don't know which ones or have the time or knowledge to proof read it.

Now for the dataset for baghdad to be a grammar checker:
We need a table that lists all the Arabic words, Which derivation rules apply to them "This is a problem with my previous approach, A word can be derived correctly but it's not available in the Arabic dictionary thus it's wrong" and their position in the sentence

At the moment what we need is a wordlist, Or some text of modern Arabic and I'd be glad to maintain the list after that.

Comments

Submitted by amr@www.gharbeia.net on Sat, 25/03/2006 - 10:44am

I should be working on the wordlist. It is sad how we tend to not do the important things so we can do the urgent things. In the mean time, we need bulks of Arabic texts: all we can get.

Submitted by phaeronix (not verified) on Sat, 25/03/2006 - 6:18pm

Well after some thought msameer made a wordlist from a copy of the quraan.

How many words were in there Mohammed ? 40,000 I think..
Not enough but a good start. I am trying to include it in a usable way in next phaeronix so we can have some wider testing.
I had the idea of a refernce text, with which we can measure accuracy ( as a percentage of detected and not detected words ).

Msameer can you create that wiki page ? We need to start collecting ideas.

Submitted by Ethan Bradford (not verified) on Mon, 15/05/2006 - 11:28pm

Gokalp Yapici and I (with some help from Mohammed) have submitted the data for Aspell Arabic to Kevin Atkinson. Our testing shows that it works very well indeed.

The first version uses exactly the Buckwalter data -- no more and no less. After that's installed in Aspell's code repository, we'll update it with one with several word additions to handle the (remarkably few) words that came up as missing in our testing. The second version will also have a sounds-alike file created by Mohammed.

So before you do a lot of work on another approach to Arabic spell checking, I recommend you try this out and see if it works for you.

Add new comment

Subscribe to /  digg  bookmark