uncorrected proof

Viewer
Transcript

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

Available online at www.sciencedirect.com

1

Computer Speech and Language xxx (2008) xxx–xxx

COMPUTER SPEECH AND LANGUAGE www.elsevier.com/locate/csl

4

Imed Zitouni *, Ruhi Sarikaya

5

IBM T.J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, NY 10598, United States

6 7

Received 19 December 2006; received in revised form 11 December 2007; accepted 3 June 2008

OO F

3

Arabic diacritic restoration approach based on maximum entropy models

8

PR

2

ED

Abstract

In modern standard Arabic and in dialectal Arabic texts, short vowels and other diacritics are omitted. Exceptions are made for important political and religious texts and in scripts for beginning students of Arabic. Scripts without diacritics have considerable ambiguity because many words with diﬀerent diacritic patterns appear identical in a diacritic-less setting. In this paper we present a maximum entropy approach for restoring short vowels and other diacritics in an Arabic document. The approach can easily integrate and make eﬀective use of diverse types of information; the model we propose integrates a wide array of lexical, segment-based and part-of-speech tag features. The combination of these feature types leads to a high-performance diacritic restoration model. Using a publicly available corpus (LDC’s Arabic Treebank Part 3), we achieve a diacritic error rate of 5.1%, a segment error rate 8.5%, and a word error rate of 17.3%. In case-ending-less setting, we obtain a diacritic error rate of 2.2%, a segment error rate of 4.0%, and a word error rate of 7.2%. We also show in this paper a comparison of our approach to previously published techniques and we demonstrate the eﬀectiveness of this technique in restoring diacritics in diﬀerent kind of data such as the dialectal Iraqi Arabic scripts. Ó 2008 Published by Elsevier Ltd.

21 22

Keywords: Arabic diacritic restoration; Vowelization; Maximum entropy; Finite state transducer

23

1. Introduction

24 25 26 27 28 29 30 31

Semitic languages such as Arabic and Hebrew are not as much studied as English for computer speech and language processing. In recent years, Arabic in particular has been receiving tremendous attention. Typically Arabic text is presented without short vowels and other diacritic marks that are placed either above or below the graphemes. The process of adding vowels and other diacritic marks to Arabic text can be called diacritization or vowelization. Vowels help deﬁne the sense and the meaning of a word. It also shows how it should be pronounced. However, the use of vowels and other diacritics has lapsed in modern Arabic writing. Modern Arabic texts are composed of scripts without short vowels and other diacritic marks. This often leads to considerable ambiguity since several words that have diﬀerent diacritic patterns may appear identical

UN

CO R

RE

CT

9 10 11 12 13 14 15 16 17 18 19 20

Q1

*

Corresponding author. Tel.: +1 9149451346. E-mail addresses: [email protected] (I. Zitouni), [email protected] (R. Sarikaya).

0885-2308/$ - see front matter Ó 2008 Published by Elsevier Ltd. doi:10.1016/j.csl.2008.06.001

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385 2 July 2008 Disk Used 2

No. of Pages 20, Model 3+

ARTICLE IN PRESS

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

in a diacritic-less setting. Educated modern Arabic speakers are able to accurately restore diacritics in a document. This is based on the context and their knowledge of the grammar and the lexicon of Arabic. However, a text without diacritics becomes a source of confusion for beginning readers and people with learning disabilities. A text without diacritics is also problematic for natural language processing applications such as text-tospeech or speech-to-text, where lack of diacritics adds another layer of ambiguity when processing the data. As an example, full vocalization of text is required for text-to-speech applications, where the mapping from graphemes to phonemes is not simple compared to languages such as English and French; where there is, in most cases, one-to-one relationship. Also, using data with diacritics shows improvement in the accuracy of speechrecognition applications (Aﬁfy et al., 2004). Currently, text-to-speech, speech-to-text, and other applications use data where diacritics are placed manually, which is a tedious and time consuming practice. A diacritization system that restores the diacritics of scripts, i.e. supply the full diacritical markings, would be of interest to these applications. It also would greatly beneﬁt nonnative speakers, suﬀerers of dyslexia and could assist in restoring diacritics of children’s and poetry books, a task that is currently done manually. We recently proposed a statistical approach that restore short vowels and diacritics using maximum entropy based framework (Zitouni et al., 2006,). We present in this paper an in depth analysis of this approach and we also investigate its eﬀectiveness in processing diﬀerent kind of data such as dialectal Iraqi Arabic and modern standard Arabic. We also compare our approach to other published competitive techniques such as ﬁnite state transducer (Nelken and Stuart, 2005). The approach we propose ensures a highly accurate restoration of diacritics and eliminates the cost of manually diacritized text required for several applications. We cast the diacritic restoration task as a sequence classiﬁcation problem. The proposed approach is based on the maximum entropy framework where several diverse sources of information are employed. The model implicitly learns the correlation between these types of information and the output diacritics. In the next section, we present the set of diacritics to be restored and the ambiguity we face when processing a non-diacritized text. Section 3 gives a brief summary of previous related works. Section 4 presents our diacritization model; we explain the training and decoding process as well as the diﬀerent feature categories employed to restore the diacritics. Section 5 describes a clearly deﬁned and replicable split of the LDC’s Arabic Treebank Part 1, 2 and 3 corpus, used to build and evaluate the system, so that the reproduction of the results and future comparison can accurately be established. Section 6 shows the performance of the approach we present in a smaller data set on documents from the same source: An Nahar News Text. We use LDC’s Arabic Treebank Corpus Part 1 v3.0 only with a clearly deﬁned and replicable split of the data. The goal is to study the performance of our technique in a smaller data set on data from the same source. Section 7 reports a comparison of our approach to the ﬁnite state machine modeling technique that showed promising results in (Nelken and Stuart, 2005). Section 8 presents the eﬀectiveness of our approach in processing dialectal Arabic text such as Iraqi, which has a diﬀerent structure and annotation convention compared to modern standard Arabic used in LDC’s Arabic Treebank corpus. Finally, Section 9 concludes the paper and discusses future directions.

67

2. Arabic diacritics

68 69 70 71 72 73 74 75 76

The Arabic alphabet consists of 28 letters that can be extended to a set of 90 by additional shapes, marks, and vowels (Tayli and Al-Salamah, 1990). The 28 letters represent the consonants and long vowels such as , (both pronounced as/a:/), (pronounced as/i:/), and (pronounced as/u:/). Long vowels are constructed by combining , , , and with the short vowels. The short vowels and certain other phonetic information such as consonant doubling (shadda) are not represented by letters, but by diacritics. A diacritic is a short stroke placed above or below the consonant. Table 1 shows the complete set of Arabic diacritics. We split the Arabic diacritics into three sets: short vowels, doubled case endings, and syllabiﬁcation marks. Short vowels are written as symbols either above or below the letter in text with diacritics, and dropped all together in text without diacritics. We ﬁnd three short vowels:

77 78 79 80

fatha: it represents the /a/ sound and is an oblique dash over a letter as in (c.f. fourth row of Table 1). damma: it represents the /u/ sound and is a loop over a letter that resembles the shape of a comma (c.f. ﬁfth row of Table 1). kasra: it represents the /i/ sound and is an oblique dash under a letter (c.f. sixth row of Table 1).

UN

CO R

RE

CT

ED

PR

OO F

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

3

Table 1 Arabic diacritics on the letter – consonant – (pronounced as/t/) Diacritic on

Name

Meaning/Pronunciation

fatha

/a/

damma

/u/

kasra

/i/

OO F

Short vowels

Doubled case ending (‘‘tanween”) tanween al-fatha

/an/

tanween al-damma

/un/

tanween al-kasra

/in/

Syllabiﬁcation marks

consonant

PR

shadda sukuun

CT

ED

The doubled case ending diacritics are vowels used at the end of the words; the term ‘‘tanween” is used to express this phenomenon. Tanween marks indeﬁniteness and it is manifested in the form of case marking or in conjunction with case marking as the bearer of tanween. Similar to short vowels, there are three diﬀerent diacritics for tanween: tanween al-fatha, tanween al-damma, and tanween al-kasra. They are placed on the last letter of the word and have the phonetic eﬀect of placing an ‘‘N” at the end of the word. Text with diacritics contains also two syllabiﬁcation marks:

RE

shadda: it is a gemination mark placed above the Arabic letters as in . It denotes the doubling of the consonant. The shadda is usually combined with a short vowel such as in . sukuun: written as a small circle as in . It marks the boundaries between syllabes or end of verbs in the cases of the jussive moods. It indicates that the letter does not contain vowels. Table 2 shows an Arabic sentence transcribed with and without diacritics. In modern Arabic, writing scripts without diacritics is the most natural way. Exceptions are made for important political and religious texts as well as scripts for beginner students of the Arabic language, where documents contain diacritics. In a diacriticless setting, many words with diﬀerent vowel patterns may appear identical, which leads to considerable ambiguity at the word level. The word , for example, has 21 possible forms that have valid interpretations when adding diacritics (Kirchhoﬀ and Vergyri, 2005). It may have the interpretation of the verb ‘‘to write” in (pronounced/kataba/). Also, it can be interpreted as ‘‘books” in the noun form (pronounced/kutubun/). A study conducted by (Debili et al., 2002) shows that there is an average of 11.6 possible diacritizations for every non-diacritized word when analyzing a text of 23,000 script forms. Arabic diacritic restoration is a non-trivial task as expressed in (El-Imam, 2003). Native speakers of Arabic are able, in most cases, to accurately vocalize words in text based on their context, the speaker’s knowledge of

CO R

88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103

absence

UN

81 82 83 84 85 86 87

doubling vowel

Table 2 The same Arabic sentence without (upper row) and with (middle row) diacritics

The English translation is shown in the third row.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385 2 July 2008 Disk Used 4

No. of Pages 20, Model 3+

ARTICLE IN PRESS

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

the grammar, and the lexicon of Arabic. Our goal is to convert knowledge used by native speakers into features and incorporate them into a maximum entropy model. We assume that the input text to be diacritized does not contain any diacritics.

107

3. Previous work

108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153

Diacritic restoration has been receiving increasing attention and has been the focus of several studies. In (El-Sadany and Hashish, 1988), a rule-based method that uses morphological analyzer for vowelization was proposed. Another, rule-based grapheme to sound conversion approach was appeared in 2003 by ElImam (El-Imam, 2003). The main drawback of these rule-based methods is that it is diﬃcult to maintain the rules up-to-date and extend them to other Arabic dialects. Also, new rules are required due to the changing nature of any ‘‘living” language. More recently, there have been several new studies that use alternative approaches for the diacritization problem. In (Emam and Fisher, 2004) an example based hierarchical top–down approach is proposed. First, the training data is searched hierarchically for a matching sentence. If there is a matching sentence, the whole utterance is used. Otherwise the data is searched for matching phrases, then words to restore diacritics. If there is no match at all, character n-gram models are used to diacritize each word in the utterance. In (Vergyri and Kirchhoﬀ, 2004), diacritics in conversational Arabic are restored by combining morphological and contextual information with an acoustic signal. Diacritization is treated as an unsupervised tagging problem where each word is tagged as one of the many possible forms provided by the Buckwalter’s morphological analyzer (Buckwalter, 2002). The Expectation Maximization (EM) algorithm is used to learn the tag sequences. Gal in (Gal, 2002) used a HMM-based diacritization approach. This method is a white-space delimited word based approach that restores only short vowels (a subset of all diacritics). Most recently, a weighted ﬁnite state machine based algorithm is proposed (Nelken and Stuart, 2005). This method employs characters and larger morphological units in addition to words. Among all the previous studies this one is more sophisticated in terms of integrating multiple information sources and formulating the problem as a search task within a uniﬁed framework. This approach also shows competitive results in terms of accuracy when compared to previous studies. In their algorithm, a character based generative diacritization scheme is enabled only for words that do not occur in the training data. It is not clearly stated in the paper whether their method predicts the diacritics shadda and sukuun. Even though the methods proposed for diacritic restoration have been maturing and improving over time, they are still limited in terms of coverage and accuracy. In the approach we present in this paper, we propose to restore the most comprehensive list of the diacritics that are used in any Arabic text. Our method diﬀers from the previous approaches in the way the diacritization problem is formulated and multiple information sources are integrated. We view the diacritic restoration problem as sequence classiﬁcation, where given a sequence of characters our goal is to assign diacritics to each character. Our appoach is based on Maximum Entropy (MaxEnt henceforth) technique (Berger et al., 1996). MaxEnt can be used for sequence classiﬁcation, by converting the activation scores into probabilities (through the soft-max function, for instance) and using the standard dynamic programming search algorithm (also known as Viterbi search). We ﬁnd in the literature several other approaches of sequence classiﬁcation such as (McCallum et al., 2000 and Laﬀerty et al., 2001). The conditional random ﬁelds method presented in (Laﬀerty et al., 2001) is essentially a MaxEnt model over the entire sequence: it diﬀers from the MaxEnt in that it models the sequence information, whereas the MaxEnt makes a decision for each state independently of the other states. The approach presented in (McCallum et al., 2000) combines MaxEnt with Hidden Markov models to allow observations to be presented as arbitrary overlapping features, and deﬁne the probability of state sequences given observation sequences. We report in Section 7 a comparative study between our approach and the most competitive diacritic restoration method that uses ﬁnite state machine algorithm (Nelken and Stuart, 2005). The MaxEnt framework was successfully used to combine a diverse collection of information sources and yielded a highly competitive model as we will describe next section. Even though it is not conventional to reference papers that are published after submitting a manuscript, we mention a method (Habash and Rambow, 2007), which was recently published while this manuscript was in

UN

CO R

RE

CT

ED

PR

OO F

104 105 106

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

5

review. The work described in (Habash and Rambow, 2007) is similar to that of (Vergyri and Kirchhoﬀ, 2004) in that the diacrization problem is casted as that of choosing the correct diacrization from all possible diacrizations for a given word provided by the Buckwalter analysis. Since that method relies on having all possible diacritizations of a word it has problems in dealing with words that are not covered by the Buckwalter analysis.

159

4. Automatic diacritization

160 161 162 163 164

The performance of many natural language processing tasks, such as shallow parsing (Zhang et al., 2002) and named entity recognition (Florian et al., 2004), has been shown to depend on integrating many sources of information. Given the stated focus of integrating many feature types, we selected the MaxEnt classiﬁer. MaxEnt has the ability to integrate arbitrary types of information and make a classiﬁcation decision by aggregating all information available for a given classiﬁcation.

165

4.1. Maximum entropy classiﬁers

166 167 168 169 170 171 172 173 174

We formulate the task of restoring diacritics as a classiﬁcation problem, where we assign to each character in the text a label (i.e., diacritic). Before formally describing the method1, we introduce some notations: let Y ¼ fy 1 ; . . . ; y n g be the set of diacritics to predict or restore, X be the example space (i.e., string or characters) m and F ¼ f0; 1g be a feature space. Each example x 2 X has associated a vector of binary features f ðxÞ ¼ ðf1 ðxÞ; . . . ; fm ðxÞÞ. The goal of the process is to associate examples x 2 X with a probability distribution over the labels from Y (if we are interested in soft classiﬁcation) or associate one label y 2 Y (if we are interested in hard classiﬁcation). Soft classiﬁcation means that a likelihood will be attributed for every label, whereas a hard classiﬁcation stands for predicting the most likely label in the current context. For our purposes, the classiﬁcation itself can be viewed as a function

i¼1

196

n X

hi ðxÞ ¼ 1 8 x 2 X

i¼1

ð1Þ

ð2Þ

CO R

In this context, hi ðxÞ (also denoted hðx; y i Þ) can be viewed as a conditional probability distribution, pðy i jxÞ. In most of the cases, we will be interested in ﬁnding the mode of the distribution hfig ðxÞ, i.e. making a ‘‘hard” classiﬁcation decision ^ hðxÞ ¼ arg max hi ðxÞ

184

y i 2Y

In a supervised framework, like the one we are considering here, one has access to a set of training examples T X together with their classiﬁcations: T ¼ fðx1 ; y 1 Þ; . . . ; ðxk ; y k Þg. To evaluate performance, we also have set aside a diﬀerent subset of labeled examples E ¼ fðx1 ; y1 Þ; . . . ; ðxp ; y p Þg, which is our development test data: E X. i¼1...n The MaxEnt algorithm associates a set of weights faij gj¼1...m with the features ðfj Þi ; the higher the absolute value, the heavier the impact a particular feature has on the overall model. To have a fully functional system, one has to be able to obtain the ‘‘proper” values for the aij parameters. These weights are estimated during the training phase to maximize the likelihood of the data (Berger et al., 1996). Given these weights, the model computes the probability distribution over labels for a particular example x as follows: m X Y f ðxÞ 1 Y f ðxÞ aijj ; ZðxÞ ¼ aijj ð3Þ P ðyjxÞ ¼ ZðxÞ j¼1 i j

UN

185 186 187 188 189 190 191 192 193 194

hðx; y i Þ ¼

RE

such that n X

179 180 181 182

PR

ED

h : X Y ! ½0; 1

CT

176 177

OO F

154 155 156 157 158

1

This is not meant to be an in-depth introduction to the method, but a brief overview to familiarize the reader with them.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385 2 July 2008 Disk Used 6

214 215 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239

OO F

PR

where IðxÞ ¼ 1 if x is true and 0 otherwise. The term cðxÞ denotes the true classiﬁcation of x. Unfortunately, this empirical value is not easy to optimize directly, though there are statistical methods that can accomplish this goal under certain assumptions (for instance, the Winnow method as described by Littlestone (1988), in the case when the classiﬁcation space is linearly separable). Instead, we will be looking to ﬁnd the parameter set that maximizes the log-likelihood of the data: N N Y X log P ðy i jxi Þ ¼ log P ðy i jxi Þ ð6Þ i¼1

ED

207 208 209 210 211 212

i¼1

in other words, we are looking for the solution to the problem N X Pb ¼ arg max log P k ðy i jxi Þ k

CT

206

The two formulae are equivalent. Since there is no restriction on the type of features, this model easily integrates diﬀerent and arbitrary types of features, by simply adding them to feature pool. By using the training data, ideally we would want to estimate the model parameters fkij g such as to minimize the empirical error T 1 X Probð^ hðxÞ 6¼ cðxÞÞ Ið^ hðxi Þ 6¼ y i Þ T i¼1

i¼1

ð7Þ

Since the value in Eq. (6) is a convex function of the parameters fkij g, as one can easily check, ﬁnding the exponential model that has the maximum data likelihood becomes a classical optimization problem, which has a unique solution. There have been several methods proposed to ﬁnd such an optimum point, such as generalized iterative scaling (GIS) (Darroch and Ratcliﬀ, 1972), improved iterative scaling (IIS) (Berger et al., 1996), the limited-memory Broyden-Fletcher-Goldfarb-Shanno approximate gradient procedure (BFGS) (Liu and Nocedal, 1989), and sequential conditional generalized iterative scaling (SCGIS) (Goodman, 2002). Describing these methods is beyond the goal of this paper and we refer the reader to the cited material. While the MaxEnt method can nicely integrate multiple feature types seamlessly, in certain cases it can overestimate its conﬁdence in especially low-frequency features. Let us clarify through an example: let us assume that we are interested in computing the probability of getting heads while tossing a (possibly unbiased) coin, and let us assume that we tossed it 4 times and got the head 4 times and no tails. Furthermore, let us assume that our model has 2 features: f1 ðx; yÞ ¼ is y ‘heads0 and f2 ðx; yÞ ¼ is y ‘tails’. Then our constraints would be that E^p ½f1 ¼ 1 and E^p ½f2 ¼ 0, which in turn will enforce that our model will always predict that heads will show up with probability 1, which is, of course, premature with only 4 tosses. The problem here comes from our enforcing a hard constraint on a feature whose estimation is not reliable enough. There are several adjustments that can be made to the model to address this issue, such as regularization by adding Gaussian priors (Stanley and Ronald, 2000) or exponential priors (Goodman, 2004) to the model, using fuzzy MaxEnt boundaries (Khudanpur, 1995), or using MaxEnt with inequality constraints (Kazama and Jun’ichi, 2003). In this paper, to estimate the optimal aj values, we train our MaxEnt model using the sequential conditional generalized iterative scaling (SCGIS) technique (Goodman, 2002). To overcome the problem of overestimating conﬁdence in low-frequency features especially, we use the regularization method based on adding Gaussian priors as described in Stanley and Ronald (2000).2 Intuitively, this measure will model parameters as being

RE

201 202 203 204

where ZðXÞ is a normalization factor. Most of the time we prefer writing Eq. (3) in a form where the parameters appear in exponent form: m m 1 Y 1 Y f ðxÞ P ðyjxÞ ¼ aijj ¼ ðekij Þfj ðxÞ ð4Þ ZðxÞ j¼1 ZðxÞ j¼1 " # m X 1 exp kij fj ðxÞ ¼ defP k ðyjxÞ ð5Þ ¼ ZðxÞ j¼1

CO R

200

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

UN

197 198

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2

Note that the resulting model cannot really be called a maximum entropy model, as it does not yield the model which has the maximum entropy (the second term in the product), but rather is a maximum a-posteriori model.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

7

close to 0 in value, unless the data suggests they are not. After computing the class probability distribution, the chosen diacritic is the one with the most a posteriori probability. The decoding algorithm, described in Section 4.2, performs sequence classiﬁcation, through dynamic programming.

243

4.2. Search to restore diacritics

244 245 246 247 248 249 250 251 252 253 254 255 256 258

We are interested in ﬁnding the diacritics of all characters in a script or a sentence. These diacritics have strong interdependencies which cannot be properly modeled if the classiﬁcation is performed independently for each character. We view this problem as sequence classiﬁcation, in contrast to an example-based classiﬁcation problem: given a sequence of characters in a sentence x1 x2 . . . xL , our goal is to assign diacritics (labels) to each character, resulting in a sequence of diacritics y 1 y 2 . . . y L . It is diﬃcult to take the approach of considering the example space formed of words, instead of characters, and applying the same procedure. When using example space formed of words, the space has a very high dimensionality, and we would run very soon into data sparseness problems. Instead, we will apply the Markov assumption, which states that the diacritic associated with the character i depends only on the diacritics associated with the characters at positions i k þ 1 . . . i 1, where k is usually equal to 3. Given this assumption, and the notation xL1 ¼ x1 . . . xL , the conditional probability of assigning the diacritic sequence y L1 to the character sequence xL1 becomes

259 260

and our goal is to ﬁnd the sequence that maximizes this conditional probability ^y L1 ¼ arg max pðy L1 jxL1 Þ

262

ð8Þ

ED

pðy L1 jxL1 Þ ¼ pðy 1 jxL1 Þpðy 2 jxL1 ; y 1 Þ . . . pðy L jxL1 ; y L1 Lkþ1 Þ

PR

OO F

240 241 242

ð9Þ

CT

y L1

While we restricted the conditioning on the classiﬁcation tag sequence to the previous k diacritics, we do not impose any restrictions on the conditioning on the characters – the probability is computed using the entire character sequence xL1 . In practical situations, though, features will only examine a limited context of the particular character of interest, but they are allowed to ‘‘look ahead”, i.e. to examine features of the characters succeeding the current character. Under the constraint described in Eq. (8), the sequence in Eq. (9) can be eﬃciently identiﬁed. To obtain it, we create a classiﬁcation tag lattice (also called trellis), as follows:

270 271 272 273 274 275 276 277

Let xL1 be the character input sequence and S ¼ fs1 ; s2 ; . . . ; sm g be an enumeration of Yk ðm ¼ jYj Þ – we will call an element sj a state. Every such state corresponds to the labeling of k successive characters. We ﬁnd it useful to think of an element si as a vector with k elements. We will use the notations si ½j for jth element of such a vector (the label associated with the token xikþjþ1 ) and si ½j1 . . . j2 the sequence of elements between indices j1 and j2 . We conceptually associate every character xi ; i ¼ 1; . . . ; L with a copy of S; S i ¼ fsi1 ; . . . ; sim g; this set represents all the possible labeling of characters xiikþ1 at the stage where xi is examined. We then create links from the set S i to the S iþ1 , for all i ¼ 1 . . . L 1, with the property that

UN

(

Þ wðsij1 ; sjiþ1 2

279 280 281

k

CO R

RE

263 264 265 266 267 268 269

¼

L iþ1 pðsiþ1 j1 ½kjx1 ; sj2 ½1::k 1Þ

if sij1 ½2::k ¼ siþ1 j2 ½1::k 1

0 otherwise

These weights correspond to probability of a transition from the state sij1 to the state siþ1 j2 . For every character xi , we compute recursively3

3

For convenience, the index i associated with state sij is moved to a; the function ai ðsj Þ is in fact aðsij Þ.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385 2 July 2008 Disk Used 8

No. of Pages 20, Model 3+

ARTICLE IN PRESS

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

a0 ðsj Þ ¼0; j ¼ 1; . . . ; k i ai ðsj Þ ¼ max ai1 ðsj1 Þ þ log wðsi1 j1 ; s j Þ j1 ¼1;...;M

292 295 296 297 298 299

Intuitively, ai ðsj Þ represents the log–probability of the most probable path through the lattice that ends in state sj after i steps, and ci ðsj Þ represents the state just before sj on that particular path.4 Having computed the ðai Þi values, the algorithm for ﬁnding the best path, which corresponds to the solution of Eq. (9) is 1. Identify ^sLL ¼ arg maxj¼1...L aL ðsj Þ 2. For i ¼ L 1 . . . 1, compute ^sii ¼ ciþ1 ð^siþ1 iþ1 Þ 3. The solution for Eq. (9) is given by

OO F

284 285 286 287 288 289 290 293 294

j1 ¼1;...;M

^y ¼ f^s11 ½k; ^s22 ½k; . . . ; ^sLL ½kg

PR

283

i ci ðsj Þ ¼ arg max ai1 ðsj1 Þ þ log wðsi1 j1 ; s j Þ

k

The full algorithm is presented in Algorithm 1. The runtime of the algorithm is HðjYj LÞ, linear in the size of the sentence L but exponential in the size of the Markov dependency, k. To reduce the search space, we use beam-search.

301 303 302 304 305 306 307 308 309 310 311 312 316 313 314 315 317 318 319 320 321 322 323

Algorithm 1 Viterbi search

324 325 326 327 328 329 330 331 332

Anyone implementing Algorithm 1 faces a practical challenge: even for small values of k, the space Yk can be quite large, especially if the classiﬁcation space is large. This problem arises because the algorithm’s search k space size is linear in jYj . This is the reason why in practice, for many natural language processing tasks, a beam-search algorithm is preferred instead. This algorithm is constructed around the idea that many of the nodes in the trellis have such small a-values that they will not be included in any ‘‘good” paths, and therefore can be skipped from computation without any loss in performance. To achieve this, the algorithm will keep k only a few of the M ¼ jYj states alive at any trellis stage i. Then, after computing the expansion of those nodes for stage i þ 1, it eliminates some of the resulting states, based on their ai values. One can use a variety of ﬁltering techniques, among which we mention:

333

using a ﬁxed beam – keep only the n top-scoring candidates at each stage i for expansion.

UN

CO R

RE

CT

ED

Input: characters wL1 . Output: the most probable sequence of tags (i.e., diacritics) ^y L1 ¼ arg maxy L1 P ðy L1 jxL1 Þ Create S ¼ fs1 ; . . . ; sM g, an enumeration of Yk for j ¼ 1; M do aj 0 for i ¼ 1 k; L þ k do for j ¼ 1; M do cij ¼ 1; bj ¼ 1 for j0 ¼ 1; M such that sj0 ½2::k ¼ sj ½1::k 1 do i v aj0 log wðsji1 0 ; sj Þ if ðv > bj Þ then bj v; cij j0 a b ^sLþk ¼ arg maxj¼1...m aj j ¼ arg maxj cLþk;j for i ¼ L þ k 1 . . . 1 do ^si sj ; j ciþ1;j ^y L1 ð^s1 ½1; ^s2 ½1; . . . ; ^sL ½1Þ

4 For numerical reasons, the values ai are computed in log space, since computing them in normal space will result in underﬂow for even short sentences. Alternatively, one can compute a normalized version of the ai coeﬃcients, where they are normalized at each stage by the sum of all coeﬃcients in the trellis column.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

9

using a variable beam – keep only the candidates that are within a speciﬁed relative distance (in terms of ai ) from the top scoring candidate at stage i. Both options are good choices – in our experience one can use a beam of 5 and a relative beam of 30% to speed up the computation signiﬁcantly (20–30 times) with almost no drop in performance; this might vary depending on the task, though.

340

4.3. Features employed

341 342 343 344 345 346 347 348 349 350 351

Within the MaxEnt framework, any type of features can be used, enabling the system designer to experiment with interesting feature types, rather than worry about speciﬁc feature interactions. In contrast, with a rule based system, the system designer would have to consider how, for instance, lexically derived information for a particular example interacts with character context information. That is not to say, ultimately, that rulebased systems are in some way inferior to statistical models – they are built using valuable insight which is hard to obtain from a statistical-model-only approach. Instead, we are merely suggesting that the output of such a rule-based system can be easily integrated into the MaxEnt framework as one of the input features, most likely leading to improved performance. Features employed in our system can be divided into three diﬀerent categories: lexical, segment-based, and part-of-speech tag (POS) features. We also use the previously assigned two diacritics as additional features. In the following, we brieﬂy describe the diﬀerent categories of features:

352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 382 380 383 381

Lexical features: we include the character n-gram spanning the current character xi , both preceding and following it in a window of 7: fxi3 ; . . . ; xiþ3 g. We use the current word wi and its word context in a window of 5 (forward and backward trigram): fwi2 ; . . . ; wiþ2 g. We specify if the character of analysis is at the beginning or at the end of a word. We also add joint features between the above source of information. Segment-based features: Arabic blank-delimited words are composed of zero or more preﬁxes, followed by a stem and zero or more suﬃxes. Each preﬁx, stem or suﬃx will be called a segment in this paper. The segmentation process consists in separating the Arabic white-space delimited words into segments. Segments are often the subject of analysis when processing Arabic (Zitouni et al., 2005). Syntactic information such as POS or parse information is usually computed on segments rather than words. As an example, the Arabic white-space delimited word contains a verb , a third-person feminine singular subject-marker (she), and a pronoun suﬃx (them); it is also a complete sentence meaning ‘‘she met them.” To separate the Arabic white-space delimited words into segments, we use a segmentation model similar to the one presented by (Lee et al., 2003). It is important to note that we conduct deep segmentation, since we split the word into zero or more preﬁxes, followed by a stem (i.e., root) and zero or more suﬃxes. The white-space delimited word (she meets them) for example is segmented into three morphs: preﬁx ( (she), followed by stem and suﬃx (them)). Another example is the word (their location) that should be segmented into two tokens: the noun (location) and the possessive pronoun (their) that is carried as a suﬃx. The model obtains an accuracy of about 98% on a development test data extracted from LDC Arabic Treebank corpus, which is considered good considering the kind of segmentation we perform. In order to simulate real applications, we only use segments generated by the model rather than true segments. In the diacritization system, we include the current segment ai and its word segment context in a window of 5 (forward and backward trigram): fai2 ; . . . ; aiþ2 g. We specify if the character of analysis is at the beginning or at the end of a segment. We also add joint information with lexical features. POS features: we attach to the segment ai of the current character, its POS: POSðai Þ. This is combined with joint features that include the lexical and segment-based information. We use a statistical POS tagging system built on Arabic Treebank data with MaxEnt framework (Ratnaparkhi, 1996). We use a set of 121 POS tags extracted from LDC’s Arabic Treebank Part 1 v3.0, Part 2 v2.0, and Part 3 v2.0. The model has an accuracy of about 96%. We did not want to use the true POS tags because we would not have access to such information in real applications.

UN

CO R

RE

CT

ED

PR

OO F

334 335 336 337 338 339

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385 2 July 2008 Disk Used 10

No. of Pages 20, Model 3+

ARTICLE IN PRESS

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

5. Experiments on LDC’s arabic treebank corpus

385

5.1. Data

386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416

We show in this section the performance of the diacritic restoration system on data extracted from the LDC’s Arabic Treebank of diacritized news stories. The corpus contains modern standard Arabic data style and includes complete vocalization (including case-endings). We introduce here a clearly deﬁned and replicable split of the corpus, so that the reproduction of the results or future investigations can accurately and correctly be established. The data includes documents from LDC’s Arabic Treebank Part 1 v3.0, Part 2 v2.0, and Part 3 v2.0. This corpus includes 1834 documents from Agence France press, Umaah, and An Nahar News Text. We split the corpus into three sets: training data, development data (devset), and test data (testset). The training data contains 506,000 words approximately, whereas the devset contains close to 59,700 words and the testset contains close to 59,300 words. The 180 documents of the devset and the 195 documents of the test set are created by taking the last (in chronological order) 12.5% and 13.5% of documents respectively from every LDC’s Arabic Treebank data source. The devset contains documents from LDC’s Arabic Treebank Part 1 v3.0 – from ‘‘20000715_0006” (i.e., July 15, 2000) to ‘‘20001115_0136” (i.e., November 15, 2000), LDC’s Arabic Treebank Part 2 v2.0 – from ‘‘20020120_0002” (i.e., January 20, 2002) to ‘‘20020929_0017” (i.e., September 29, 2002) and from ‘‘backissue_01-a0_024” to ‘‘backissue_33-e0_009”, as well as LDC’s Arabic Treebank Part 3 v2.0 – from ‘‘20020115_0010” (i.e., January 15, 2002) to ‘‘20021115_0020” (i.e., November 15, 2002). The testset contains also documents from LDC’s Arabic Treebank Part 1 v3.0 – from ‘‘2000115_0138” (i.e., November 15, 2000) to ‘‘20001115_0236” (i.e., November 15, 2000), LDC’s Arabic Treebank Part 2 v2.0 – from ‘‘backissue_34-a0_020” to ‘‘backissue_40-e0_025”, and LDC’s Arabic Treebank Part 3 v2.0 – from ‘‘20021115_0010” (i.e., November 15, 2002) to ‘‘20021215_0045” (i.e., December 15, 2002). The time span of the training set, devset and testset are intentionally non-overlapping with each other, as this models how the system will perform in the real world. Previously published papers use proprietary corpus or lack clear description of the training/devtest data split, which make the comparison to other techniques diﬃcult. By clearly reporting the split of the publicly available LDC’s Arabic Treebank corpus in this section, we want future comparisons to be correctly established. It is important to note that we do not remove digits, punctuations or any other characters from the text during decoding and also when computing scores. Also, the devset and testset are initially undiacritized and unsegmented; this is true for all experiments shown in this paper. We let the system decide on every character, including the prediction of non-diacritic label for digits and punctuations. We count an error if the system assign a diacritic to a digit or a punctuation. Therefore, we don’t have a special processing for non-Arabic characters.

417

5.2. Evaluation results

418 419 420 421 422 423 424 425 426 427 428 429 430 431

Experiments are reported in terms of word error rate (WER), segment error rate (SER), and diacritization error rate (DER). The DER is the proportion of incorrectly restored diacritics. The WER is the percentage of incorrectly diacritized white-space delimited words: in order to be counted as incorrect, at least one character in the word must have a diacritization error. The SER is similar to WER but indicates the proportion of incorrectly diacritized segments. A segment can be a preﬁx, a stem, or a suﬃx. Segments are often the subject of analysis when processing Arabic (Zitouni et al., 2005). Syntactic information such as POS or parse information is based on segments rather than words. Consequently, it is important to know the SER in cases where the diacritization system may be used to help disambiguate syntactic information. We also report in this section the performance of each diacritics in terms of precision (P), recall (R), and F-measure (F); precision is the number of diacritics predicted correctly devided by the true ones, recall is the number of diacritics predicted correctly devided by those predicted by the system, and F-measure is the double of precision times recall devided by the sum of precision and recall. We notice that on the devset the MaxEnt model converges to an optimum after less than 6 iterations and consequently no much tuning is required. The devset in this experiment is used for feature selection and tuning

UN

CO R

RE

CT

ED

PR

OO F

384

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

11

model training parameters. Once decided on the devset both features and model training parameters remain the same for testing on the testset. Several modern Arabic scripts contains the consonant doubling ‘‘shadda”; it is common for native speakers to write without diacritics except for the shadda. In this case the role of the diacritization system will be to restore the short vowels, doubled case ending, and the vowel absence ‘‘sukuun”. We run three batches of experiments: (1) a ﬁrst experiment where documents contain the original shadda; (2) a second experiement (cascaded model) where shadda is predicted ﬁrst and then other diacritics, and (3) a third experiment (joint model) where all diacritics including shadda are predicted at once. The diacritization system using the cascaded model proceeds in two steps: a ﬁrst step where only shadda is restored and a second step where other diacritics (excluding shadda) are predicted. The advantage of such a model is that we have a smaller search space; seven labels to predict versus twelve labels (6 diacritics X 2), since shadda is usually combined with other vowels with the exception of sukuun and shadda itself. To assess the performance of the system under diﬀerent conditions, we consider three cases based on the kind of features employed:

446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463

1. system that has access to lexical features only; 2. system that has access to lexical and segment-based features; 3. system that has access to lexical, segment-based and POS features.

PR

OO F

432 433 434 435 436 437 438 439 440 441 442 443 444 445

CO R

RE

CT

ED

In addition to these features, we always use the two previously assigned diacritics as additional feature. The precision in predicting the shadda restoration step on the devset is equal to 90% when we use lexical features only, 96.8% when we add segment-based information, and 97.1% when we employ lexical, POS, and segmentbased features. On the testset, the precision of shadda restoration is equal to 89.6% when we use lexical features only, 96.5% when we add segment-based information, and also 96.5% when we employ lexical, POS, and segment-based features. Table 3 reports experimental results of the diacritization system with diﬀerent feature sets. Using only lexical features, the cascaded approach gives a DER of 8.1% and a WER of 25.0% on the testset which is competitive to a previously published system evaluated on Arabic Treebank Part 2: in Nelken and Stuart (2005) a DER of 12.79% and a WER of 23.61% are reported. The system they described in Nelken and Stuart (2005) uses lexical, segment-based, and morphological information. Table 3 also shows that, when segment-based information is added to our system, a signiﬁcant improvement is achieved: 27% for WER (18.6 vs. 25.0), 31% for SER (9.5 vs. 13.2), and 35% for DER (5.5 vs. 8.1). Similar behavior is observed when the documents contain the original shadda. POS features are also helpful in improving the performance of the system. The use

Table 3 Performance of diacritization system on data from LDC’s Arabic Treebank Part 1 v3.0, Part 2 v2.0, and Part 3 v2.0 True shadda WER

UN

Lexical features devset 24.2 testset 24.5

Lexical + segment-based features devset 15.5 testset 16.5

Cascaded model

Joint model

SER

DER

WER

SER

DER

WER

SER

DER

12.3 12.5

7.6 7.8

24.8 25.0

13.0 13.2

8.0 8.1

25.4 26.0

13.2 13.6

8.4 8.6

7.5 8.1

4.5 4.8

17.7 18.6

8.9 9.5

5.2 5.5

18.1 18.7

9.2 9.6

5.5 5.8

4.3 4.6

17.3 17.4

8.7 8.9

5.0 5.1

17.4 17.8

8.8 9.2

5.2 5.6

Lexical + segment-based + POS features devset 15.1 7.2 testset 15.9 7.8

The terms WER, SER, DER stand for word error rate, segment error rate, and diacritic error rate, respectively. The columns marked with ‘‘True shadda” represent results on documents containing the original consonant doubling ‘‘shadda”, while columns marked with ‘‘Cascaded model” and ‘‘Joint model” represent results where the system restored diacritics using cascaded and joint approaches, respectively.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385 2 July 2008 Disk Used 12

No. of Pages 20, Model 3+

ARTICLE IN PRESS

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

of POS feature improved the WER by 6% (17.4 vs. 18.6), SER by 6% (8.9 vs. 9.5), and DER by 7% (5.1 vs. 5.5). Results also show that the cascaded approach outperforms the joint approach when all features are used: 4% for WER (17.4 vs. 18.2), 3% for SER (8.9 vs. 9.2), and 9% for DER (5.1 vs. 5.6). One may think that it is important to compare the performance of our MaxEnt approach to a baseline model using a dictionary lookup. We hence build a diacritic restoration system where for each undiacriticized word we predict its most frequent diacritization that has been most observed in the training data. We do not add any diacritics for previously unseen words. Such a system has a DER of 23.5% and a WER of 57.7% on the devset. On the testset we obtain comparable results: a DER of 23.9% and a WER of 57.7%. To better understand the behavior of our system, we also show in Table 4 the performance of the diacritization system on each diacritic with diﬀerent feature sets. For this experiment, we use the cascaded approach since it is the one that gives better results as shown in Table 3. Results in Table 4 are presented in terms of precision (P), recall (R), and F-measure (F). Results shows the doubled case ending (‘‘tanween”) and are the most hard to predict. This can be explained by their relative low frequency. Also, tanween mostly appear at the end of a word where it is harder to restore diacritics. To conﬁrm this, we show in the next section the performance of the diacritization system without case-ending.

479

5.3. Case-ending

480 481

Case-ending in Arabic documents consists of the diacritic attributed to the last character in a white-space delimited word. Restoring them is the most diﬃcult part in the diacritization of a document. Case endings are

ED

PR

OO F

464 465 466 467 468 469 470 471 472 473 474 475 476 477 478

Table 4 Diacritic performance of the cascaded approach with diﬀerent set of features Nb

P Lexical

R

F

CT

Diacritic

P

R

F

Lexical + Segment

P

R

F

Lexical + Segment + POS

83,594 19,105 49,704

93.0 80.4 91.6

93.1 84.7 85.4

93.0 82.5 88.4

95.8 90.5 95.4

96.0 90.8 91.1

95.9 90.6 93.2

96.0 91.0 95.6

96.2 91.2 91.3

96.1 91.1 93.4

testset: fatha /a/ damma /u/ kasra /i/

81,809 18,370 48,583

92.8 80.1 91.3

93.0 84.6 85.1

92.9 82.2 88.1

95.8 89.4 95.0

95.7 90.3 90.4

95.8 89.8 92.6

95.9 90.2 95.1

96.0 91.0 90.7

96.0 90.5 92.8

CO R

RE

Short vowels devset: fatha /a/ damma /u/ kasra /i/

58.6 24.4 58.8

74.0 68.5 74.1

65.4 36.0 65.6

84.8 49.3 80.8

90.4 75.4 78.1

87.5 59.6 79.4

84.7 50.8 81.4

90.5 76.1 78.6

87.5 60.9 80.0

testset: tanween al-fatha /an/ tanween al-damma /un/ tanween al-kasra /in/

1,814 677 2,695

58.4 23.9 58.4

73.8 68.1 73.8

65.2 35.4 65.2

83.0 47.0 80.6

87.7 72.0 77.8

85.3 56.9 79.1

83.0 48.7 81.0

88.1 73.3 78.0

85.4 58.5 79.4

Syllabiﬁcation marks devset: shadda / / sukuun /o/

14,838 17,761

90.0 96.7

93.6 96.4

91.7 96.5

96.8 98.1

96.5 98.2

96.6 98.2

97.1 98.4

96.4 98.5

96.8 98.4

testset: shadda / / sukuun /o/

14,581 17,416

89.6 96.5

93.0 96.3

91.3 96.4

96.5 97.6

96.4 97.9

96.5 97.7

96.5 98.2

96.6 98.3

96.6 98.2

UN

Doubled case ending (‘‘tanween”) devset: tanween al-fatha /an/ 1,874 tanween al-damma /un/ 735 tanween al-kasra /in/ 2,549

Performance is presented in terms of Precison (P), Recall (R), and F-measure (F). The term ‘‘Nb” stands for the number of characters of a speciﬁc diacritic. Evaluation is conducted on devset and testset data that contain close to 260K characters and 558K characters, respectively.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

13

Table 5 Performance of the diacritization system based on employed features True shadda

Cascaded model SER

Lexical features devset 11.4 6.4 testset 11.6 6.6 Lexical + segment-based features devset 6.5 3.7 testset 7.2 4.1 Lexical + segment-based + POS features devset 6.1 3.4 testset 6.7 3.8

Joint model

DER

WER

SER

DER

WER

SER

DER

3.3 3.4

12.1 12.2

6.8 6.9

3.6 3.7

13.2 13.5

7.2 7.6

3.9 4.0

2.0 2.2

8.5 8.7

4.8 5.1

1.9 2.1

7.9 8.2

4.6 4.9

OO F

WER

2.3 2.5

8.1 8.6

4.6 4.9

2.6 2.7

2.2 2.4

7.6 8.0

4.4 4.6

2.3 2.6

RE

CT

ED

only present in formal or highly literary scripts. Only educated speakers of modern standard Arabic master their use. Technically, every noun has such an ending, although at the end of a sentence no inﬂection is pronounced, even in formal speech, because of the rules of ‘pause’. For this reason, we conduct another experiment in which case-endings were stripped throughout the training and testing data without the attempt to restore them. case-endings were stripped from all words including the moods on verbs. This experiment is particularly important for text-to-speech systems, since experimental results reﬂect how accurate a text-to-speech system can be. We present in Table 5 the performance of the diacritization system on documents without case-endings. Results clearly show that when case-endings are omitted, the WER declines by 58% (6.7% vs. 15.9%), SER is decreased by 51% (3.8% vs. 7.8%), and DER is reduced by 54% (2.1% vs. 4.6%). Also, Table 5 shows again that a richer set of features results in a better performance; compared to a system using lexical features only,

Table 6 Diacritic performance of the cascaded approach with diﬀerent set of features Diacritic

Nb

P

Short vowels devset: fatha /a/ damma /u/ kasra /i/

74,493 13,728 34,460

testset: fatha /a/ damma /u/ kasra /i/

73,520 13,325 33,641

R

CO R

Lexical

F

P

R

F

Lexical+ Segment

P

R

F

Lexical+ Segment + POS

95.2 81.8 92.6

94.4 90.2 93.2

94.8 85.8 92.9

97.8 89.9 96.1

96.9 94.3 96.4

97.4 92.0 96.2

98.0 90.4 96.3

97.1 94.7 96.7

97.6 92.5 96.5

95.1 81.0 92.0

94.3 89.6 92.5

94.7 85.1 92.3

97.8 88.6 95.6

96.7 94.0 95.8

97.2 91.2 95.7

98.0 89.1 95.7

97.0 94.6 96.0

97.5 91.8 95.9

Syllabiﬁcation marks devset: shadda / / 11,388 sukuun /o/ 17,582

92.1 98.1

92.7 93.3

92.4 94.7

97.2 99.3

96.4 97.4

96.8 98.4

97.8 99.4

96.7 98.2

97.2 98.8

testset: shadda / / sukuun /o/

91.5 95.7

93.3 93.4

92.4 94.5

97.2 98.9

96.1 97.6

96.6 98.2

97.7 99.1

96.6 98.2

97.1 98.6

UN

482 483 484 485 486 487 488 489 490 491 492

PR

System is trained and evaluated using LDC’s Arabic Treebank Part 1 v3.0, Part 2 v2.0, and Part 3 v2.0 without case-ending. The terms WER, SER, DER stand for word error rate, segment error rate, and diacritic error rate, respectively. The columns marked with ‘‘True shadda” represent results on documents containing the original consonant doubling ‘‘shadda”, while columns marked with ‘‘Cascaded model” and ‘‘Joint model” represent results where the system restored diacritics using cascaded and joint approaches respectively.

11,155 17,283

Performance is presented in terms of Precison (P), Recall (R), and F-measure (F). The term ‘‘Nb” stands for the number of characters of a speciﬁc diacritic. Evaluation is conducted on devset and testset data that contain close to 260K characters and 558K characters respectively.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385 2 July 2008 Disk Used 14

No. of Pages 20, Model 3+

ARTICLE IN PRESS

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

adding POS and segment-based features improved the WER by 42% (6.7% vs. 11.6%), the SER by 41% (3.8% vs. 6.6%), and DER by 38% (2.1% vs. 3.4%). Similar to the results reported in Table 3, we show that the performance of the system are almost the same whether the document contains the original shadda or not. A system like this trained on non case-ending documents can be of interest to applications such as speech recognition, where the last state of a word HMM model can be deﬁned to absorb all possible vowels (Aﬁfy et al., 2004). We show in Table 6 the performance of the diacritization system on each diacritic with diﬀerent feature sets, when no tanween is predicted. Similar to results reported in Table 4, we use the cascaded approach since it is the one that gives better results as shown in Table 5. Results in Table 6 conﬁrm our thought that the doubled case ending (‘‘tanween”) are the most hard to predict. In this experiment we show that without case-endings, performance improves considerably for each of the diacritics. This is in addition to the fact that no tanween has to be predicted in this case.

504

6. Experiments on smaller data size from similar source

505

6.1. Data

506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525

The LDC’s Arabic Treebank Corpus Part 1 v3.0, Part 2 v2.0, and Part 3 v2.0 used to evaluate the diacritic restoration system in Section 5 includes documents from diﬀerent sources: Agence France press, Umaah, and An Nahar News Text. In this case the MaxEnt model, used to build the diacritization system, has to discriminate between diacritics and generalize on data from diﬀerent sources. In this section, we explore the performance of our approach using LDC’s Arabic Treebank Corpus Part 1 v3.0 only. This corpus includes documents from An Nahar News Text only. Our goal is to study the performance of our technique on a smaller data set, where documents are collected from one source only. We want to show that, for the diacritic restoration task, MaxEnt model is able to generalize well and achieve similar performance with smaller data size. We train and evaluate the diacritization system on the LDC’s Arabic Treebank of diacritized news stories – Part 3 only: catalog number LDC2004T11 and ISBN 1-58563-298-8. This corpus includes 600 documents from the An Nahar News Text. Again, we introduce here a clearly deﬁned and replicable split of the corpus, so that the reproduction of the results or future investigations can accurately be established. There are a total of 340,281 words. We split the corpus into two sets: training data and test (testset) data. We did not ﬁnd the interest to create a development set on this data set, since we use the same MaxEnt parameters as those of the model trained on LDC’s Arabic Treebank Corpus Part 1, 2 and 3. The training data contains 288,000 words approximately, whereas the testset contains close to 52,000 words. The 90 documents of the testset data are created by taking the last (in chronological order) 15% of documents dating from ‘‘20021015_0101” (i.e., October 15, 2002) to ‘‘20021215_0045” (i.e., December 15, 2002). The time span of the testset is intentionally non-overlapping with that of the training set, as this models how the system will perform in the real world.

526

6.2. Evaluation results

527 528 529 530 531 532 533 534 535 536 537 538 539

In this section, we repeat the experiments of Section 5 for the smaller data source. We again run three batches of experiments: (1) a ﬁrst experiment where documents contain the original shadda; (2) a second experiement with the cascaded model (shadda is predicted ﬁrst and then other diacritics), and (3) a third experiment with the joint model where all diacritics are predicted at once. The precision of shadda restoration is equal to 91.1% when we use lexical features only, 96.2% when we add segment-based information. Adding POS feature didn’t lead in improving the shadda precision. Table 7 reports experimental results of the diacritization system with diﬀerent feature sets using only LDC’s Arabic Treebank Corpus Part 3 for training and decoding. As expected, performance of the diacritization system trained on LDC’s Arabic Treebank Corpus Part 3 only is close to those reported in Table 3 when LDC’s Arabic Treebank Corpus Part 1, 2 and 3 are used for training. We obtain the same performance for SER (8.9) and a slight decrease in performance by 7% for DER (5.1 vs. 5.5) and by 3% for WER (17.4 vs. 18.0). This conﬁrms that even with a smaller data size, the MaxEnt model is able to generalize and achieve similar performance. Table 7 shows that, when segment-based information is added to our system, a signiﬁcant improve-

UN

CO R

RE

CT

ED

PR

OO F

493 494 495 496 497 498 499 500 501 502 503

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

15

Table 7 The impact of features on the diacritization system performance using LDC’s Arabic Treebank Corpus Part 3 for training and decoding True shadda

Cascaded model

Joint model

SER

DER

WER

SER

DER

WER

SER

DER

Lexical features 24.8

12.6

7.9

25.1

13.0

8.2

25.8

13.3

8.8

5.5

18.8

9.4

5.8

19.1

9.5

6.0

Lexical + segment-based + POS features 17.3 8.5 5.1

18.0

8.9

5.5

18.4

9.2

5.7

Lexical + segment-based features 18.2 9.0

OO F

WER

RE

CT

ED

ment is achieved: 25% for WER (18.8 vs. 25.1), 38% for SER (9.4 vs. 13.0), and 41% for DER (5.8 vs. 8.2). Similar behavior is observed when the documents contain the original shadda. POS features are also helpful here in improving the performance of the system. The use of POS feature improved the WER by 4% (18.0 vs. 18.8), SER by 5% (8.9 vs. 9.4), and DER by 5% (5.5 vs. 5.8). Using only LDC’s Arabic Treebank Corpus Part 3 for training shows also that the cascaded approach is slightly better then the joint one where shadda and other diacritics are estimated in the same time. Using the dictionary lookup approach as described in the previous section on LDC’s Arabic Treebank Corpus Part 3, we obtain a DER of 24.1% and a WER of 53.9% on the testset. Hence, our technique outperforms the dictionary lookup approach by a relative 77% (5.5 vs. 24.1) on DER and by a relative 66% (18.0 vs. 53.9) on WER. Similar to Section 5, we show in Table 8 the performance of the diacritization system using the cascaded approach on each diacritic with diﬀerent feature sets. Once again, the doubled case ending (‘‘tanween”) is the hardest to predict. This is because the doubled case ending appears at the end of a word where it is the most diﬃcult part in the diacritization of a document.

CO R

Table 8 Diacritic performance of the cascaded approach with diﬀerent set of features Diacritic

Nb

P

R

F

Lexical

Short vowels testset: fatha /a/ damma /u/ kasra /i/

P

R

F

Lexical + Segment

P

R

F

Lexical + Segment + POS

70,929 16,590 42,101

93.4 79.3 91.3

93.0 89.6 85.0

93.2 84.1 88.0

95.3 88.1 94.7

95.5 90.7 89.8

95.4 89.3 92.2

95.4 88.6 94.8

95.6 91.2 90.0

95.5 89.9 92.3

Doubled case ending (‘‘tanween”) testset: tanween al-fatha /an/ 1,693 tanween al-damma /un/ 616 tanween al-kasra /in/ 2,186

72.7 26.5 54.3

86.3 60.9 69.9

79.0 36.9 61.2

81.6 45.7 76.7

85.2 68.8 75.4

83.4 54.9 76.0

82.2 47.2 77.2

85.3 69.9 75.6

83.7 56.3 76.4

Syllabiﬁcation marks testset: shadda / / sukuun /o/

89.3 94.0

93.0 94.1

91.1 94.0

95.7 97.2

96.2 97.4

95.9 97.3

96.0 97.3

96.3 97.8

96.2 97.5

UN

540 541 542 543 544 545 546 547 548 549 550 551 552 553

PR

The terms WER, SER, DER stand for word error rate, segment error rate, and diacritic error rate, respectively. The columns marked with ‘‘True shadda” represent results on documents containing the original consonant doubling ‘‘shadda”while columns marked with ‘‘Cascaded model” and ‘‘Joint model” represent results where the system restored diacritics using cascaded and joint approaches, respectively.

13,608 14,621

System is trained and evaluated using LDC’s Arabic Treebank Corpus Part 3. Performance is presented in terms of Precision (P), Recall (R), and F-measure (F). The term ‘‘Nb” stands for the number of characters of a speciﬁc diacritic. Experiments are conducted on close to 223K characters.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385 2 July 2008 Disk Used 16

No. of Pages 20, Model 3+

ARTICLE IN PRESS

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

Table 9 Performance of the diacritization system based on employed features

WER

Cascaded model SER

DER

Lexical features 11.8 6.6 3.6 Lexical + segment-based features 7.8 4.4 2.4 Lexical + segment-based + POS features 7.2 4.0 2.2

Joint model

WER

SER

DER

WER

SER

DER

12.4

7.0

3.9

12.9

7.5

4.4

8.6

4.8

2.7

9.2

5.2

3.1

7.9

4.4

2.5

8.4

4.9

2.9

OO F

True shadda

System is trained and evaluated using LDC’s Arabic Treebank Corpus Part 3 on documents without case-ending. Columns marked with ‘‘True shadda” represent results on documents containing the original consonant doubling ‘‘shadda”while columns marked with ‘‘Cascaded model” and ‘‘Joint model” represent results where the system restored diacritics using cascaded and joint approaches respectively.

Nb

P

R

F

Lexical Short vowels testset: fatha /a/ damma /u/ kasra /i/

P

R

F

Lexical + Segment

95.9 83.8 92.0

94.2 88.7 92.6

95.0 86.2 92.3

Syllabiﬁcation marks testset: shadda / / 10,713 sukuun /o/ 14,489

90.3 94.0

94.9 94.1

92.5 94.0

P

R

F

Lexical + Segment + POS

97.5 90.8 95.1

96.4 91.8 95.3

97.0 91.3 95.2

97.6 91.4 95.2

96.7 92.3 95.5

97.1 91.8 95.3

95.5 97.2

97.5 97.3

96.4 97.2

95.8 97.5

98.0 97.8

96.9 97.6

CT

63,864 11,911 28,839

ED

Diacritic

PR

Table 10 Diacritic performance of the cascaded approach with diﬀerent set of features

554

RE

System is trained and evaluated using LDC’s Arabic Treebank Corpus Part 3 on documents without case-ending. Performance is presented in terms of Precison (P), Recall (R), and F-measure (F). The term ‘‘Nb” stands for the number of characters of a speciﬁc diacritic. Experiments are conducted on close to 223K characters.

6.3. Case-ending

UN

CO R

555 As stated before, case-ending consists of the diacritic attributed to the last character in a word. Restoring 556 them is the most diﬃcult part in the diacritization process. Table 9 shows the performance of the diacritization 557 Q2 system where case-endings were stripped throughout the training and testing data (see Table 10). 558 Again once, results clearly show that when case-endings are omitted, the WER declines by 58% (7.2% vs. 559 17.3%), SER is decreased by 52% (4.0% vs. 8.5%), and DER is reduced by 56% (2.2% vs. 5.1%). Also, Table 9 560 shows that a richer set of features results in a better performance; compared to a system using lexical features 561 only, adding POS and segment-based features improved the WER by 38% (7.2% vs. 11.8%), the SER by 39% 562 (4.0% vs. 6.6%), and DER by 38% (2.2% vs. 3.6%). Similar to the results reported in Table 7, we show that the 563 performance of the system are similar whether the document contains the original shadda or not. We remind 564 that a system like this trained on non case-ending documents can be of interest to applications such as speech 565 recognition (Aﬁfy et al., 2004). 566

7. Comparison to other approaches

567 568 569 570 571

As stated in Section 3, the most recent and advanced approach to diacritic restoration is the one presented in (Nelken and Stuart, 2005): they showed a DER of 12.79% and a WER of 23.61% on Arabic Treebank corpus using ﬁnite state transducers (FST) with Katz language modeling (LM) as described in (Chen and Goodman, 1999). Because they did not describe how they split their corpus into training/test sets, we were not able to use the same data for comparison. Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

CT

ED

PR

OO F

In this section, we want essentially to duplicate the aforementioned FST result for comparison using the identical training and testing set we use for our experiments. We also propose some new variations on the ﬁnite state machine modeling technique which improve performance considerably. The algorithm for FST based vowel restoration could not be simpler: between every pair of characters we insert diacritics if doing so improves the likelihood of the sequence as scored by a statistical n-gram model trained upon the training corpus. Thus, in between every pair of characters we propose and score all possible diacritical insertions. Results reported in Table 11 indicate the error rates of diacritic restoration (including shadda). We show performance using both Kneser-Ney and Katz LMs (Chen and Goodman, 1999) with increasingly large n-grams. It is our opinion that large n-grams eﬀectively duplicate the use of a lexicon. It is unfortunate but true that, even for a rich resource like the Arabic Treebank, the choice of modeling heuristic and the eﬀects of small sample size are considerable. Using the ﬁnite state machine modeling technique, we obtain similar results to those reported in (Nelken and Stuart, 2005): a WER of 23% and a DER of 15%. Better performance is reached with the use of Kneser-Ney LM. These results still under-perform those obtained by MaxEnt approach presented in Table 7. When all sources of information are included, the MaxEnt technique outperforms the FST model by 21% (22% vs. 18%) in terms of WER and 39% (9% vs. 5.5%) in terms of DER. The SER reported on Tables 7 and 9 are based on the Arabic segmentation system we use in the MaxEnt approach. Since, the FST model doesn’t use such a system, we found inappropriate to report SER in this section. We propose in the following an extension to the aforementioned FST model, where we jointly determine not only diacritics but segmentation into aﬃxes as described in (Lee et al., 2003). Table 12 gives the performance of the extended FST model where Kneser-Ney LM is used, since it produces better results. This should be a much more diﬃcult task, as there are more than twice as many possible insertions. However, the choice of diacritics is related to and dependent upon the choice of segmentation. Thus, we demonstrate that a richer internal representation produces a more powerful model.

Table 11 Error rate in % for n-gram diacritic restoration using FST

WER

3 4 5 6 7 8

63 54 51 44 39 37

CO R

n-gram size

RE

Katz LM

Kneser-Ney LM

DER

WER

DER

31 25 21 18 16 15

55 38 28 24 23 23

28 19 13 11 11 10

Table 12 Error rate in % for n-gram diacritic restoration and segmentation using FST and Kneser-Ney LM n-gram size

3 4 5 6 7 8

UN

572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596

17

True shadda Kneser-Ney

Predicted shadda Kneser-Ney

WER

DER

WER

DER

49 34 26 23 23 23

23 14 11 10 9 9

52 35 26 23 22 22

27 17 12 10 10 10

Columns marked with ‘‘True shadda” represent results on documents containing the original consonant doubling ‘‘shadda” while columns marked with ‘‘Predicted shadda” represent results where the system restored all diacritics including shadda.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385 2 July 2008 Disk Used 18

No. of Pages 20, Model 3+

ARTICLE IN PRESS

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

Table 13 The impact of features on the diacritization of dialectal Arabic data Predicted shadda

Lexical features 97%/3% 87%/13% Lexical + segment-based features 97%/3% 87%/13%

WER

SER

DER

23.3 23.8

19.1 19.4

10.8 11.0

OO F

Train/test split

18.1 18.5

15.0 15.3

8.2 8.4

The terms WER, SER, DER stand for word error rate, segment error rate, and diacritic error rate, respectively.

8. Robustness on dialectal arabic data

598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625

The goal of this section is to study the eﬀectiveness of the MaxEnt diacritic restoration approach on dialectal Arabic data such as the one spoken in Iraq. For the purpose of experiments shown in this section, we use a manually diacritized dialectal Arabic corpus that covers the dialect spoken in Iraq. The data consists of 30,891 sentences labeled by linguists who are native Iraqi speakers. The corpus is randomly split into training and test set of sizes 29,861 (97% of the corpus) and 1030 (3% of the corpus) sentences respectively. Training and test data have 170K (24,953 unique) and 5897 (2578 unique) words, respectively. About 21% of the words in the lexicon of the test data are not covered in the training vocabulary. After removing diacritics, the number of unique words in the training and test data vocabularies are reduced to 15726 and 2101 words, respectively. This implies that there are about 9K (undiacritized) words with multiple diacritizations. For a second set of experiments we split the corpus as 87% (26968 sentences) training set and 13% (3918 sentences) test set. Similar to Sections 5 and 6, results are shown in terms of word error rate (WER), segment error rate (SER), and diacritization error rate (DER). To study the performance of the system under diﬀerent conditions, we consider two cases based on the kind of features employed: (1) system that has access to lexical features only, and (2) system that has access to lexical and segment-based features (Aﬁfy et al., 2006). Because we do not have an Iraqi POS tagger, we did not experiment with the use of features extracted from POS information. Table 13 shows experimental results of the diacritization system with diﬀerent feature sets when it is trained on dialectal Iraqi Arabic data. Results show the robustness of our approach to perform on dialectical Arabic data that has a structure completely diﬀerent from modern standard Arabic data used in the LDC’s Arabic Treebank Corpus; performances are comparable to those shown in Sections 5 and 6 using the publicly available LDC’s Arabic Treebank Corpus. Using only lexical features, we observe a DER of 10.8% and a WER of 23.3% for 97%/3% training/test split of the corpus. Similar to results reported in Table 7, we also notice that when segment-based information is added to our system a signiﬁcant improvement is achieved: 22% for WER (18.1 vs. 23.3), 21% for SER (15.0 vs. 19.1), and 24% for DER (8.2 vs. 10.8). The results for 87%/13% training/test split of the corpus are slightly worse than those for the 97%/3% training/test split of the corpus. Degradation of the results is expected as there are more unseen words in the test data as compared to the 97%/3% split. However, relatively small degradation also shows that the impact of the grapheme based features is larger than those of the word or morpheme based features. There is suﬃcient number of grapheme based features for both splits of the data to estimate reliable models.

626

9. Conclusion

627 628 629 630 631 632

We presented in this paper a statistical model for Arabic diacritic restoration. The approach we propose is based on the Maximum entropy framework, which gives the system the ability to integrate diﬀerent sources of knowledge. Our model has the advantage of successfully combining diverse sources of information ranging from lexical, segment-based and POS features. Both POS and segment-based features are generated by separate statistical systems – not extracted manually – in order to simulate real world applications. The segmentbased features are extracted from a statistical morphological analysis system using WFST approach and the

UN

CO R

RE

CT

ED

PR

597

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

19

POS features are generated by a parsing model that also uses Maximum entropy framework. Evaluation results show that combining these sources of information leads to state-of-the-art performance. We also showed in this paper the eﬀectiveness of our approach in processing dialectal Arabic documents with diﬀerent structure and annotation convention, such as Iraqi Arabic. As future work, we plan to incorporate Buckwalter morphological analyzer information to extract new features that reduce the search space. One idea will be to reduce the search to the number of hypotheses, if any, proposed by the morphological analyzer. We also plan to investigate additional conjunction features to improve the accuracy of the model.

641

Acknowledgements

642 643

Grateful thanks are extended to Jeﬀrey S. Sorensen for his contribution in conducting experiments using ﬁnite state transducer.

644

References

645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687

Aﬁfy, M., Abdou, S., Makhoul, J., Nguyen, L., Xiang, B., 2004. The BBN RT04 BN Arabic System. In: RT04 Workshop, Palisades, NY Aﬁfy, M., Sarikaya, R., Kuo, H.-K.J., Besacier, L., Gao, Y., 2006. On the use of morphological analysis for dialectal arabic speech recognition. In: InterSpeech06. Pittsburg, PA, USA (September). Berger, A., Della Pietra, S., Della Pietra, V., 1996. A maximum entropy approach to natural language processing. Computational Linguistics 22 (1), 39–71. Buckwalter, T., 2002. Buckwalter Arabic morphological analyzer version 1.0. Technical Report, Linguistic Data Consortium, LDC2002L49 and ISBN 1-58563-257-0. Stanley Chen, Ronald Rosenfeld, 2000. A survey of smoothing techniques for me models. IEEE Transactions on Speech and Audio Processing. Chen, Stanley F., Goodman, Joshua, 1999. An empirical study of smoothing techniques for language modeling computer speech and language. Computer Speech and Language 4 (13), 359–393. Darroch, J.N., Ratcliﬀ, D., 1972. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics 43 (5), 1470– 1480. Debili, F., Achour, H., Souissi, E., 2002. De l’etiquetage grammatical a‘ la voyellation automatique de l’arabe. Technical Report, Correspondances de l’Institut de Recherche sur le Maghreb Contemporain 17. El-Imam, Y., 2003. Phonetization of arabic: rules and algorithms. Computer Speech and Language 18, 339–373. El-Sadany, T., Hashish, M., 1988. Semi-automatic vowelization of Arabic verbs. In: 10th NC Conference, Jeddah, Saudi Arabia. Emam, O., Fisher, V., 2004. A hierarchical approach for the statistical vowelization of Arabic text. Technical Report, IBM patent ﬁled, DE9-2004-0006, US Patent Application US2005/0192809 A1. Florian, R., Hassan, H., Ittycheriah, A., Jing, H., Kambhatla, N., Luo, X., Nicolov, N., Roukos, S., 2004. A statistical model for multilingual entity detection and tracking. In: Proceedings of HLT-NAACL 2004, pp. 1–8. Gal, Y., 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In: ACL-02 Workshop on Computational Approaches to Semitic Languages. Goodman, Joshua., 2002. Sequential conditional generalized iterative scaling. In: Proceedings of ACL’02. Goodman, Joshua, 2004. Exponential priors for maximum entropy models. In: Marcu Susan Dumais, Daniel, Roukos, SalimRoukos, editors, (Eds.), HLT-NAACL 2004, MainProceedings. Association for Computational Linguistics, Boston, Massachusetts, USA, pp. 305–312. Habash, Nizar, Rambow, Owen., 2007. Arabic diacritization through full morphological tagging. In: The Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2007). Companion Volume, Short Papers. Rochester, NY, USA, April. Kazama, Jun’ichi, Tsujii, Jun’ichi., 2003. Evaluation and extension of maximum entropy models with inequality constraints. In: Collins, Michael, Steedman, Mark, (Eds.), Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 137–144. Khudanpur, Sanjeev, 1995. A method of maximum entropy estimation with relaxed constraints. In: 1995 Johns Hopkins University Language Modeling Workshop. Kirchhoﬀ, K., Vergyri, D., 2005. Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition. Speech Communication 46 (1), 37–51. Laﬀerty, John, McCallum,Andrew, Pereira, Fernando, 2001. Conditional random ﬁelds: probabilistic models for segmenting and labeling sequence data. In: ICML. Lee, Y.-S., Papineni, K., Roukos, S., Emam, O., Hassan, H., 2003. Language model based Arabic word segmentation. In: Proceedings of the ACL’03, pp. 399–406.. Littlestone, N., 1988. Learning quickly when irrelevant attributes abound: a new linearthreshold algorithm. Machine Learning (2), 285– 318.

UN

CO R

RE

CT

ED

PR

OO F

633 634 635 636 637 638 639 640

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385 2 July 2008 Disk Used 20

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxx–xxx

CO R

RE

CT

ED

PR

OO F

Liu, D.C., Nocedal, J., 1989. On the limited memory BFGS method for large scale optimization. Mathematical Programming 45 (3 Ser. B), 503–528. McCallum, Andrew, Freitag, Dayne, Pereira, Fernando, 2000. Maximum entropy markov models for information extraction and segmentation. In: ICML. Nelken, Rani, Shieber, Stuart M., 2005. Arabic diacritization using weighted ﬁnite-state transducers. In: ACL-05 Workshop on Computational Approaches to Semitic Languages, Ann Arbor, Michigan, pp. 79–86. Ratnaparkhi, Adwait., 1996. A maximum entropy model for part-of-speech tagging. In: Conference on Empirical Methods in Natural Language Processing. Sarikaya, Ruhi, Emam, Ossama, Zitouni, Imed, Gao, Yuqing, 2006. Maximum entropy modeling for diacritization of arabic text. In: InterSpeech06, Pittsburg, PA, USA. Tayli, M., Al-Salamah, A., 1990. Building bilingual microcomputer systems. Communications of the ACM 33 (5), 495–505. Vergyri, D., Kirchhoﬀ, K., 2004. Automatic diacritization of Arabic for acoustic modeling in speech recognition. In: COLING Workshop on Arabic-script Based Languages, Geneva. Zhang, Tong, Damerau, Fred, Johnson, David E., 2002. Text chunking based on a generalization of Winnow. Journal of Machine Learning Research 2, 615–637. Zitouni, Imed, Sorensen, Jeﬀ, Luo,Xiaoqiang, Florian, Radu, 2005. The impact of morphological stemming on Arabic mention detection and coreference resolution. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor, pp. 63–70. Zitouni, Imed, Sorensen, Jeﬀrey S., Sarikaya, Ruhi, 2006. Maximum entropy based restoration of arabic diacritics. In: COLING/ACL 2006, Sydney, Australia.

UN

688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708

No. of Pages 20, Model 3+

ARTICLE IN PRESS

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

3), we achieve a diacritic error rate of 5.1%, a segment error rate 8.5%, and a word error rate of ... Available online at www.sciencedirect.com ... bank corpus. ...... data extracted from LDC Arabic Treebank corpus, which is considered good ...

Download PDF

297KB Sizes 1 Downloads 399 Views

Report

uncorrected proof

Recommend Documents