The Seventh Conference on Language Engineering, Cairo, Egypt, 5-6 December 2007
Arabic Anaphora Resolution Using the Web as Corpus Khaled Elghamry
Rania Al-Sabbagh
Ain Shams University, Egypt elghamryk [at] ufl [dot] edu
Najwa El-Zeiny Cairo University, Egypt
1 The Seventh Conference on Language Engineering, Cairo, Egypt, 5-6 December 2007
Arabic Anaphora Resolution Using the Web as Corpus Abstract—This paper presents a dynamic algorithm for Anaphora Resolution (AR) in Arabic unrestricted texts. The poor performance of current Arabic/English Machine Translation (MT) systems in terms of AR and the fact that AR is an understudied issue in Arabic Natural Language Processing (ANLP) are the main motivations for this paper. The algorithm suggested follows a statistical approach to AR and makes use of the web as corpus to overcome the inherit problem of statistical approaches, namely sparse data. Te algorithm achieves a performance rate of 87.6% which is – to the best of the authors’ knowledge – the first result for AR in Arabic using a generic corpus. Index Terms— Arabic Anaphora Resolution (AAR), Statistical Approaches, Web as Corpus, Sparseness of Data 1. INTRODUCTION Anaphora Resolution (AR) is the process of figuring out the antecedent (i.e. referent) of a given anaphor [10] [18] [19]. The paper focuses on encliticized Arabic 3rd person personal pronouns: 56 ه/hA/ (her/hers/it/its), 8 /h/ (him/his/it/its), 9 ه/hm/ (masculine: them/their) and : ه/hn/ (feminine: them/their). In spite of being an area under active research in formal and computational linguistics, AR is understudied in ANLP. The only study of AR in Arabic, to the best of the authors' knowledge, is that of [16], who have studied AR only in Arabic technical manuals, which are syntactically and lexically restricted, achieving a precision rate of 95.4%. However, their approach has never been tested on unrestricted (i.e. generic) texts. AR causes problems for some current Machine Translation (MT) systems dealing with Arabic generic
texts
such
as
Sakhr
MT
system
(www.ajeeb.com)
and
Google
MT
systems
(www.googlelanguagetools.com). This is evident in the following example extracted from Al-Ahram Newspaper: ... ون5b^K5F 5WXYZ [\]^_ 5WDEF GHIJK اMNOJP ةRHSK اTUJV (1) Transliteration1: /SrHt Alsydp qrynp Alr}ys b>nhA stkvf EmlhA bAltEAwn/ Sakhr's Translation: 1
Buckwalter's Transliteration scheme (Buckwalter 2002, Diab et al. 2004). URL: http://www.qamus.org/transliteration.htm
2 The Seventh Conference on Language Engineering, Cairo, Egypt, 5-6 December 2007
The Mrs. announced the president's wife that it will intensify its work in cooperation … Google's Translation: Mrs. Suzanne Mubarak, the President stated that it will intensify its collaboration... Correct Translation: The president's spouse announced that she will intensify her work in cooperation … AR errors made by such systems are basically due to syntactic, morphological and semantic differences between Arabic and English pronominal systems. Syntactically speaking, Arabic pronouns, unlike the English ones, have the same form for all different grammatical cases; the nominative, the accusative and the genitive. For instance, in (2), 56 ه/hA/ (her/hers) is used in the genitive case being encliticized to the preposition jk /fy/ (in), and in (3) it is used in the accusative case, being encliticized to the verb jlK /lqy/ (encounter); yet in both cases it has the same form. نo^NHX آqHF 5WHk ز5k j^K ا1992ت5F5n^D ا:Z (2) Transliteration: /En AntxAbAt 1992 Alty fAz fyhA byl klyntwn/ Translation: … about the elections of 1992 in which Bill Clinton won … jsolK اtOJuK ا5WHlK j^K اMYOvWK( ا3) Transliteration: /Alhzymp Alty lqyhA Alfryq Alqwmy/ Translation: The defeat which the national team has encountered Morphologically, Arabic 3rd person personal pronouns are sometimes encliticized; the thing that makes them ambiguous. This is evident in :WYK ا/Almhn/ (the professions or their pain), where the last two letters :6 ه/hn/ - being identical to the 3rd person personal feminine plural pronoun :6 ه- are ambiguous; it
3 The Seventh Conference on Language Engineering, Cairo, Egypt, 5-6 December 2007
is not clear whether they are a part of the word – ال/Al/ (the) :6Ws /mhn/ (professions) - or it is an encliticized pronoun –9K ا/Alm/ (pain) :ه/hn/ (their). Semantically, the Arabic pronominal system, unlike the English one, does not linguistically differentiate between ±human entities. As a result, both the –HUMAN FEMININE noun MYKo66bKا /AlEwlmp/ (the globalization) and the +HUMAN FEMININE noun }ريH6 ه/hylAry/ (Hillary) are referred to using the same 3rd person personal pronoun 5 ه/hA/ (she/ her/ hers) as in (4) and (5). 5W^NFن واo^NHX}ري آH( ه4) Transliteration: /hylAry klyntwn wAbnthA/ Translation: Hilary Clinton and her daughter ...5WD اMYKobKة ا5_Es (5) Transliteration: /m>sAp AlEwlmp AnhA …/ Translation: The problem of globalization is that it is … Such differences make AAR a non-trivial task and cause the poor performance of some current MT systems. Therefore, this paper presents a dynamic statistical algorithm for AAR in unrestricted naturallyoccurring texts. Statistical AR algorithms have been extensively used for AR, yet the contributions of the present paper are: • Using the web as corpus to overcome the inherit problem of statistical approaches, namely sparseness of data • Developing an AR algorithm for Arabic unrestricted texts • Building the algorithm dynamically so as to guarantee a high recall rate. The rest of the paper falls in two parts. The first outlines related work to AAR, statistical AR systems and using the web as corpus approaches. The second discusses the proposed AAR algorithm. 2. Related Work
4 The Seventh Conference on Language Engineering, Cairo, Egypt, 5-6 December 2007
This part is divided into three subsections. The first deals with related work to AAR. The second briefly discusses statistical AR systems and the last handles the approaches using the web as corpus to overcome the problems of statistical approaches. 2.1. Arabic Anaphora Resolution (AAR) To the best of the authors' knowledge, the only study for AAR is that of [17]. Their approach takes as an input the output of a POS tagger. It identifies the noun phrases which precede the anaphor within a distance of 2 sentences. It checks them for gender and number agreement with the anaphor and then applies the so-called antecedent indicators to the remaining candidates by assigning a positive or negative score. The noun phrase with the highest aggregate score is proposed as antecedent. The core of the approach lies in activating such empirically-based antecedent indicators which play a decisive role in tracking down the antecedent from a set of possible candidates. They are definiteness/indefiniteness, givenness, indicating verbs, lexical reiteration, section heading preference, non-prepositional NPs, relative pronouns, collocations, immediate reference, sequential instructions, referential distance and preference of terms. The approach was evaluated against a corpus of technical manuals (223 pronouns) and achieved a success rate of 89.7% for English, 95.2% for Arabic and 93.3% for Polish. 2.2. Statistical AR Approaches There are many statistical AR systems such as [4], [7] and [24]. [4] perform an experiment to resolve references of the pronoun it in sentences randomly selected from the corpus, using co-occurrence patterns observed in the corpus as selectional patterns. Candidates for antecedents are substituted for the anaphor and only those candidates available in frequent co-occurrence patterns are approved of. They report an accuracy of 87%. [7] use a small training corpus from the Penn Wall Street Journal Treebank marked with coreference resolution. They obtain an accuracy of 65.3% using just distance and syntactic constraints. After adding word information to the model – gender, number and animacy – the performance rises to 75.7%. Adding information about "mention count" – i.e. the more number of times a referent has occurred in the discourse before, the more likely it is to be the antecedent – improves accuracy to the final value of 84.2%.
5 The Seventh Conference on Language Engineering, Cairo, Egypt, 5-6 December 2007
[24] investigates the usability of linguistically-motivated features for statistical AR. Linguistic features of lexicographic similarity, syntactic knowledge, semantic compatibility and salience have been integrated into a statistical model for AR. According to her results, such features reduce error rate in AR systems by 19.9%. 2.3. Web as Corpus Although statistical approaches are robust, fast and computationally inexpensive, they usually undergo the problem of sparse data due to the fact that they are mainly data-driven approaches. According to [10], a training corpus is said to be sparse if it is bound to have a very large number of cases of zero-probability events that should really have some non-zero probability. Approaches to handle sparseness of data are either statistically-based ones making use of smoothing techniques [14] or linguistically-based ones, using the web as corpus to get massive corpora [11]. According to [11], web documents are to be considered as a corpus since [15] define the corpus as "any collection of more than one text" provided that they are sampling, representative, machinereadable and standard. [12] have even broadened the definition of the corpus, saying that it is simply "a certain amount of data from a certain domain of interest, without having any way in how it is constructed. In such cases, having more training data is normally more useful than any concerns of balance, and one should simply use all the text that is available". Using the web as corpus has many advantages helps avoid bias to a certain language genre or domain [11]. Usually, the statistics of a language model change according to the type of texts used for building it. This imposes a limitation on the applicability of any language model, because it is usually applied to new texts that might not be of the same type of the texts involved in the language model. The only way to guarantee good performance of the language model is to draw it from random samples from different language types and genres. This is quite easy by using the web. Moreover, the web is a good source for massive monolingual, bilingual and multilingual corpora. Not only can it be used to collect such types of corpora [20], but also bilingual web search engines can be used to search for translations [11]. Web counts are proved by [12] to be reliable enough due to the high correlation between web frequencies and corpus frequencies, the reliable correlation between web frequencies and human
6 The Seventh Conference on Language Engineering, Cairo, Egypt, 5-6 December 2007
plausibility judgments, and the reliable correlation between web frequencies and frequencies recreated using class-based smoothing and the correlation with the counts derived from a well balanced corpus. Many studies are based on the web as corpus. [12] examined how useful the web is as a source of frequency information for rare items, especially for dependency relations. [22] gathered lexical statistics for resolving prepositional phrase attachments. [21] balanced their corpus using web documents. [16] built WSD engine using hit counts to identify rank word sense frequencies. [9] built a language-specific corpus using the web from a single document in that language. [6] acquired collections can from the web. However, none of the previous studies has been applied to AR. 3. The AR Algorithm 3.1. Corpus Preprocessing AR preprocessing handles a number of issues including tokenization, Part-of-Speech (POS) tagging and disambiguation, Semantic Features Acquisition (SFA) and Non-Pleonastic Pronouns Identification. 3.1.1. Tokenization and POS Tagging The tokenization and the POS tagging are performed by [5]. [5] have developed a Support Vector Machine (SVM) tokenizer (i.e. SVM-TOK) and POS tagger (i.e. SVM-POS), which are among the most widely used Arabic tokenizers and POS tagger being freely available and highly accurate public domain tools. SVMs are supervised learning algorithms that rely on annotated training data, taken – in [5] – from the Arabic TreeBank2. According to standard evaluation metrics, the SVM-TOK achieves an F-measured performance rate of 99.12% and the SVM-POS tagger achieves 95.49%. Although the results are comparable to state-of-the-art results on English text when trained on similar sized data, the SVM-POS does not tag semantic features since it annotates the segmented words, resulting from the tokenization module, with POS tags drawn from the Arabic Penn TreeBank POS tagset that do not include semantic features. The absence of such features affects some higher NLP applications such as AR.
2
This is a 1-million-word corpus that contains POS tags as well as the parse trees. It s available through LDC, catalogue number LDC2005T20, URL: http://www.ldc.upenn.edu/
7 The Seventh Conference on Language Engineering, Cairo, Egypt, 5-6 December 2007
3.1.2. Semantic Features Acquisition (SFA)
The importance of such semantic features lies in two facts. First, they are among the main differences between English and Arabic pronominal systems. Second, such features are one of the most widely used semantic constraints in the literature of AR [10], [17], [23] and [13] among others. Consequently, the authors integrated monolingual and bilingual bootstrapping techniques to acquire them. The monolingual bootstrapping technique is a cue-based algorithm that depends on gender, number and rationality cues extracted from the target language itself (i.e. Arabic). A number of cues have been used and their output served as the seeds for the algorithm. The first set of monolingual cues is extracted from AraMorph's [3] output which tags semantic features only when they are morphologically marked. Thus 32.8% of the nouns in Al-Ahram Newspaper corpus (≈ 20,000,000 tokens; ≈ 971,000 types) are marked for number, 35.5% are marked for gender and 0% is marked for rationality. The second set is built using a set of Arabic number and/or gender cues, which are illustrated in table (1). Arabic Cue
Cue Type
The Features it indicates
ة/p/
Suffix
Encliticized to Singular; Feminine Nouns
ون/wn/
Suffix
Encliticized to Plural; Masculine; +Human Nouns
ات/At/
Suffix
Encliticized to Plural; Feminine Nouns
Demonstrative
Followed by Singular; Masculine Nouns
Demonstrative
Followed by Singular; Plural Feminine Nouns
Demonstrative
Followed by Dual; Masculine Nouns
Demonstrative
Followed by Dual; Feminine Nouns
Demonstrative
Followed by Plural; Masculine; Feminine Nouns
K أو/>wl}k/ (those)
Demonstrative
Followed by Plural; Masculine Nouns
يK ا/Al*y/ (who/which)
Relative Pronoun
Preceded by Singular; Masculine Nouns
j^K ا/Alty/ (who/which)
Relative Pronoun
Preceded by Singular; Plural; Feminine Nouns
Relative Pronoun
Preceded by Dual; Masculine Nouns
Relative Pronoun
Preceded by Dual; Feminine Nouns
ها/h*A/ (this) K ذ/*lk/ (that) 8 ه/h*h/ (this) X /tlk/ (that) هان/h*An/ (these) :O ه/h*yn/ (these) ن55 ه/hAtAn/ (these) :H5 ه/hAtyn/ (these) هء/handlA'/ (these)
انXK ا/All*An/ (who/which) :OXK ا/All*yn/ (who/which) ن5^XK ا/AlltAn/ (who/which)
8 The Seventh Conference on Language Engineering, Cairo, Egypt, 5-6 December 2007
:H^XK ا/Alltyn/ (who/which) :OK ا/Al*yn/ (who/which)
Relative Pronoun
Preceded by Plural; Masculine; +Human Nouns
Table (1): Arabic Cues for Semantic Features of Gender and Number
The third set is built according to the following algorithm: 1. Words encliticized to any of the aforementioned suffixes in table (1) are extracted from the corpus. 2. Suffixes are stripped off provided that the resulting word exists in the corpus. 3. Then the resulting word can be tagged for number and gender according to the suffix stripped off. One example for such an algorithm is the noun نoHu6K ا/AlSHfywn/ (the journalists), given as a result of the algorithm's first step, being encliticized to the plural, masculine and +Human suffix ون/wn/. Finding the word ju6K ا/AlSHfy/ (the journalist) in the corpus runs the second step of the algorithm where the suffix ون/wn/ is stripped off. Finally, the word ju6K ا/AlSHfy/ (the journalist) is tagged as a singular, masculine, +Human noun. As for rationality, two sets are used. The first is a list of proper +Human nouns gathered using Google search engine. The second is a list of verbs which are typically followed by a +Human noun; the verbs' list is given in table (2). The Verb
Meaning
J ذآ/*kr/
Mention
حJV /SrH/
Declare
:XZ أ/>Eln/
Announce
ل5P /qAl/
Say
9Z ز/zEm/
Claim
P5D /nAq$/
Discuss
مRP /qdm/
Present
أو/>wDH/
Clarify
فJZ /Erf/
Know
[V و/wSf/
Describe
ضJZ /ErD/
Show
9 The Seventh Conference on Language Engineering, Cairo, Egypt, 5-6 December 2007
J^Z ا/AEtbr/
Consider
Table (2): Indicating Arabic Verbs for the Rationality Semantic Feature
These monolingual seeds have resulted in a list of ≈ 30,000 tokens/types that has been manually filtered. The bilingual bootstrapping algorithm is also a cue-based algorithm that uses the cues of one language (e.g. English) to learn about the semantic features of the target language (i.e. Arabic). The algorithm uses the following tools: 1.
English electronic resources: The English WordNet 2.1 (University of Princeton 2005) and English Generic Corpora (Cobb 2004).
2.
A set of English cues which are used to search for words with specific semantic features in the aforementioned English resources. All English cues are illustrated in table (3).
3.
English/Arabic MT systems: Two English/Arabic MT systems have been used to guarantee good coverage; the first is [1] and the second is the Google Statistical Translation Engine (http://www.google.com/language_tools) English Cue
Cue Type
Feature(s) it indicates
A/ an/any/ every/ each
Modifier
Followed by Singular
Some/ all/ any/ many
Modifier
Followed by Plural
Who
Relative Pronoun
Preceded by +Human
Which
Relative Pronoun
Preceded by –Human
Table (3): English Cues for the Semantic Features Number and Rationality
The bilingual bootstrapping algorithm goes as such: 1.
The English cues illustrated in table (3) are used to extract words from generic English corpora. Moreover, words tagged as ±Human, plural or singular in the English WordNet 2.1 are also compiled.
2.
The resulting English words are submitted to [1] English-and Google SMT engine.
3.
Number and rationality semantic features are added to the Arabic noun translations of the English nouns.
For example, the word 'motive' in "… the motive which led this family to …" has been extracted from the aforementioned English resources. Since the word precedes the relative pronoun 'which', it has been
10 The Seventh Conference on Language Engineering, Cairo, Egypt, 5-6 December 2007
tagged as –Human. Submitted to [1] and Google SMT engine, the word is translated as 6k دا/dAfE/ which is thus tagged as –Human. The output list of the bilingual bootstrapping is manually filtered, resulting in a noun-base of ≈ 240,000 types tagged for number and rationality. Combining the results of the monolingual and the bilingual algorithms results in the following: NUMBER
GENDER
RATIONALITY
SINGULAR
PLURAL
FEMININE
MASCULINE
+HUMANN
-HUMAN
26,805
7,083
16, 490
18, 344
4,021
20,477
Table (4): Final results of the monolingual and the bilingual algorithms of SFA
These final lists have achieved a coverage rate of ~ 59% for SVM-POS tagger. 3.1.3. Non-Pleonastic Pronouns Identification According to [2], Arabic recognizes pleonastic pronouns –also known as redundant pronouns – which are non-anaphoric pronouns that are usually invisible in translation. One example of pleonastic pronouns, according to [2] is (6) where the pronoun 8 /h/ (he/him/his) encliticized to the particle أن/>n/ (Indeed) disappears from the English translation: ... qYZ tOJk [HX] 9 D أKر إ5( أ6) Transliteration: /<$Ar >lY
Relative Pronoun + Verb + Ø
Relative Pronoun + Negation + Verb + Pronoun
Relative Pronoun + Negation + Verb + Ø
Relative Pronoun + Verb + Preposition + Pronoun
Relative Pronoun + Verb + Preposition + Ø
11 The Seventh Conference on Language Engineering, Cairo, Egypt, 5-6 December 2007
tm/sytm/ytm + Verb + Pronoun
tm/sytm/ytm + Verb + Ø
tm/sytm/ytm + Negation + Verb + Pronoun
Tm/sytm/ytm + Negation + Verb + Ø
tm/sytm/ytm + Verb + Preposition + Pronoun
tm/sytm/ytm + Verb + Preposition + Ø
Table (5): Regular Patterns of Non-Pleonastic Arabic Pronouns
Heuristics based on the aforementioned patterns are formed and tested on the LDC parallel Arabic-English corpora. According to such a corpus, these heuristics represent 0.1651 of the pronouns' tokens and 0.1883 of the types. 3.1.3. Web Size Estimation In order to use the web as corpus, its size must be estimated. Previous studies [11] have estimate it for many languages such as English, Italian, German … etc. using the counts of function words as predictors of corpus size. Function words, such as the, with, in, etc., occur with a frequency that is relatively stable over many different types of texts. From a corpus of known size, the frequency of the function words can be calculated, and their counts can be brought from the web, and then the web size can be estimated according to the following function: the size of the known corpus * the web frequencies for the function words Web size = the frequencies of the function words in the corpus of the known size However, the web-size results can be affected by the load of the search engine, being used. Thus the authors have used two search engines that support Arabic search, and compared between their results to get a rather stable estimate. The two search engines used are www.alltheweb.com and www.search.com. Their results are illustrated in table (6) below. Al-Ahram Newspaper
www.alltheweb.com
www.search.com
394,030
8,894,438,640
4,293,947,945
Table (6): The Counts of the Function Words in Al-Ahram corpus and Two Search Engines
According to such results, an average web size can be 4,500,000,000 Arabic words 4. AR Algorithm
12 The Seventh Conference on Language Engineering, Cairo, Egypt, 5-6 December 2007
The AR algorithm is a statistical, dynamic algorithm that makes use of the least possible features. Such a type of algorithms is appropriate form the point of view of machine learning, because it does not require much human intervention and from the point of view of ANLP, whose main bottleneck is the absence of enough NLP resources. Besides previously mentioned preprocessing phases, the present algorithm makes use of collocational evidence, recency and bands and searches for candidate antecedents in a -20 window-size. Collocational evidence depends on finding out the collocational relation between candidate antecedents and the pronoun's carrier, using conditional probability as the association measure. It is calculated as: P(A∩B) P(A|B) = P(B) For instance, in (7), the collocational relation between the pronoun's carrier q6 أه/>hl/ (citizens) and the possible candidate antecedent :H£S6Xk /flsTyn/ (Palestine) is, according to the conditional probability measure, stronger than that between the same pronoun's carrier and the candidate antecedent M6Z5Y¤ /jmAp/ (a group). Therefore, :H£SXk /flsTyn/ (Palestine) is selected as the correct antecedent. مR£^K :H£SXk KإJ¤5W أن9WNs MZ5Y¤ رأت:OKد اoWHKدة ا5F إ:s J6X^ ه:]Y^6O 96K (7) .. 5WXهEF Transliteration: /lm ytmkn htlr mn t jmAEp mnhm >n thAjr hlhA/ Translation: Hitler couldn't exterminate the Jews; some of whom have immigrated to Palestine to face its citizens According to the recency feature, the closer the candidate is to the pronoun and its carrier, the more likely it is to be the correct antecedent. For example, in (8), there are two candidate antecedents for the pronoun 56 ه/hA/, namely, ةRHS6K ا/Alsydp/ (the lady) and M6£¦ /xTp/ (a plan); however, the correct antecedent is M£¦ /xTp/ (a plan) which is the closest to the pronoun. .... 5WK}¦ :s 9^H_ j^K اqYbK اM£¦ :Z Kة اوRHSK اTNXZ( ا8) Transliteration:
13 The Seventh Conference on Language Engineering, Cairo, Egypt, 5-6 December 2007
/AElnt Alsydp AlAwlY En xTp AlEml Alty sytm mn xlalhA/ Translation: The first lady has declared the work plan through which …. Finally, the algorithm makes use bands to reduce the search space and thus to reduce the number of candidate antecedents. The search space of -20 words, which covers 81% of the instances, is reduced from -20, to -10, to -5, to -2 and to -1, respectively. This is exemplified in the following: JO وزRZ5Ss §I5D RXHkJ5_ RHuO[ دV و:s MHNH£SXuK اM£XS6K[ ا6Pos لo6U ال6_ j6XZ (9) 5WD5F MHNH£SXuK اM5u^D}K 5H]OJs اMH¤ر5nKا Transliteration: /Ely sandAl Hwl mwqf AlslTp AlflsTynyp mn wSf dyvyd sAtrfyld nA}b msAEd wzyr AlxArjyp AlAmryky lAlAntfADp AlflsTynyp bAnhA/ Translation: on a question about the attitude of the Palestinian Authority from the description of David Satterfield – deputy of the assistant of American Foreign Minister – to the Palestinian Intifada as … The first step is to divide the -20 words into two equal the -10-word bands: Band1: RHuO[ دV و:s MHNH£SXuK اM£XSK[ اPos لoU _الjXZ 8 Band2: ن5F MHNH£SXuK اM5u^D ل اj]OJs اMH¤ر5nK اJO وزRZ5Ss §I5D RXHkJ5_ The second step is to get the score of each band based on the bigrams probabilities: Band1: (0.1603977518) RHuO[ دV و:s MHNH£SXuK اM£XSK[ اPos لoU _الjXZ 8 Band2: (0.7934184451) ن5F MHNH£SXuK اM5u^D ل اj]OJs اMH¤ر5nK اJO وزRZ5Ss §I5D RXHkJ5_ Since the score of band2 is higher then it is further subdivided into two -5-word bands: Band3: (0.165681848) MH¤ر5nK اJO وزRZ5Ss §I5D RXHkJ5_ Band4: (0.6277365971) ن5F MHNH£SXuK اM5u^D ل اj]OJsا The score of band4 is higher than the score of band3, thus it is subdivided into bigrams, excluding function words: Band5: (0.6277365971) MHNH£SXuK اM5u^Dا As a result of using bands, possible candidates are restricted to the correct antecedent.
14 The Seventh Conference on Language Engineering, Cairo, Egypt, 5-6 December 2007
To sum up, the proposed AR algorithm goes as follows: • The corpus passes through the preprocessing steps of tokenization and POS tagging and nonpleonastic pronouns are removed, • The -20-word search space is determined for each pronoun, • Bigrams consisting of the candidate antecedents and the pronoun's carrier are compiled, • Such bigrams are filtered using the semantic features of gender, number and rationality, • Web counts are got for the bigrams that pass the semantic filtration, • The -20 window size is further subdivided into two bands, out of which the band with the highest score is chosen; the score of the band is counted as the summation of the counts of the band's bigrams with the pronoun's carrier scaled by the band's distance from the pronoun, • The band with the highest score is further divided into smaller bands, for which the band's score is counted in the same way discussed in the previous point, • The same procedure is repeated until the band consists of a single possible candidate, which is supposed to be the one. 4. Results and Discussions Due to the lack of Arabic corpora annotated for anaphoric expressions, the authors built a gold standard set, which consists of 5000 pronoun's types equally divided for each pronoun, to test the algorithm. The evaluation metrics used here are the standard ones of precision, recall and F-measure. According to the used gold standard, the AR algorithm achieves a precision rate of 78%, a recall rate of 100% and an F-measured performance of 87.6%. According to the evaluation results, 12% of the errors are cause by the insufficient window size; that is, the correct antecedent exists outside the -20 words. However, any attempt to widen the windowsize reduces precision rate. This is because the wider the window, the more nouns are introduced to the algorithm. Thus a band might get a higher score, only because it contains more words than the other competing one, even after applying the semantic features filtration. Moreover, 5% of the errors are related to POS tagging; that is, 5% of the words are tagged as ending up in a pronoun, yet this part which has been tagged as a pronoun is a part of the word. For
15 The Seventh Conference on Language Engineering, Cairo, Egypt, 5-6 December 2007
instance, 96WuO /lAyfhm/ (does not understand) is analyzed as being encliticized to a pronoun, although /hm/ is part of the word. The authors have proposed a methodology to over come this 5% error. Typically, if the last part of the word is a pronoun, then the stemmed word must have occurred in the corpus. The web is used as a corpus to compare the stemmed words against it: if the stemmed word is found on the web, then it is considered an encliticized word to a pronoun; otherwise it is ignored as an encliticized word. This method has reduced the 5% error rate to 4.45%. Thus it has increase precision to 77.6% and thus the final result for the algorithm is a performance rate of 87.4%. As for the remaining 10% error rate, it is related to the collocational evidence and recency features. Sometimes, such features do not work at all as in (10), where both مRl^6K ا/Altqdm/ (progress) and ن5S6D ا/AlAnsAn/ (Man) agree in gender and number with the anaphor 8 /h/ (i.e. both are SINGULAR and MASCULINE), and ن5S6D ا/AlAnsAn/ (Man) is the closest candidate to the anaphor, yet the correct one is مRl^K ا/Altqdm/ (the progress). HKل اoVoK اjk ن5SDرع ا5S^O يK اj¤oKoN]^Kم اRl^K( ا10) Transliteration: /Altqdm Altknwlwjy Al*y ytsArE AlAnsAn fy AlwSwl Alyh/ Translation: The technological progress to which Man is eager to reach Such cases are frequent when the pronoun's carrier is a function word, a preposition, for instance. This is because function words are very frequent and therefore they can occur frequently with all candidates. 5. Conclusion This paper presented a statistical, dynamic algorithm for AR in Arabic unrestricted texts. For preprocessing tasks required for such an algorithm, the authors used of-the-shelf tools for tokenization and POS tagging, and developed their own approach to acquire necessary semantic features and to identify non-pleonastic Arabic pronouns. The algorithm made use of collocational evidence, recency and bands as AR-related features. It made use of the web as corpus so as to overcome the problem of sparseness of data. As a result, the algorithm achieved a performance rate of 87.6% measured according to a gold standard set of 5000 pronouns.
16 The Seventh Conference on Language Engineering, Cairo, Egypt, 5-6 December 2007
References [1] ATA Software Technology Ltd. (2002). Golden Al-Wafi Translator Software. Version 1.12. [2] Badawi, S, Michael, G, and Gully, A. (2004). Modern Written Arabic: A Comprehensive Grammar. London; New York: Routledge. [3] Buckwalter, T. (2002). Buckwalter Arabic Morphological Analyzer Version 1.0. LDC. Catalog number LDC2002L49, ISBN 1-58563-257-0. [4] Dagan, I. and Itai, A. (1990). Automatic Processing of Large Corpora for the Resolution of Anaphora References. Proceedings of the 13th International Conference on Computational Linguistics (COLING'90), Finland, 330-332. [5] Diab, M., Hacioglu, K. and Jurafsky, D. (2004). Automatic Tagging of Arabic Text: from Raw Text to Base Phrase Chunks. In Dumas, S., Marcus, D. and Roukos, S. (Eds.). HLT-NAACL 2004: Short Papers (pp.140-152). Boston: Association for Computational Linguistics. [6] Fujii, A. and Ishikawa, T. (2000). Utilizing the World Wide Web as an Encyclopedia: Extracting Term Descriptions from Semi-Structured Text. Proceedings of 38th Meeting of the ACL, pages 488-495, Hong Kong, October. [7] Ge, N., Hale, J. and Charniak, E. (1998). A Statistical Approach to Anaphora Resolution. Proceedings of the 6th Workshop on Very-large Corpora, 161-170. [8] Hasan, A. (1999). AlnHw AlWAfy. Vols. 1 and 2. Cairo: dAr AlmEArf. [9] Jones, R. and Ghani, R. (2000). Automatically Building a Corpus for a Minority Language from the Web. Proceedings of the Student Workshop of the 38th [10] Jurafsky, D. & Martin, J. (2000). Speech and Language Processing; An Introduction to Natural Language Processing, Computational linguistics and Speech Recognition. New Jersey: Prentice Hall Ltd. [11] Kilgarri, A. and Grefenstettey, G. (2006). Web as Corpus. [12] Keller, F., Lapata, M. and Ourioupina, O. (2002). Using the Web to Overcome Data Sparseness. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 230-237.
17 The Seventh Conference on Language Engineering, Cairo, Egypt, 5-6 December 2007
[13] Kenney, C. and Boguraev, B. (1996). Anaphora for Everyone: Pronominal Anaphora Resolution without a Parser. Proceedings of the 16th International Conference on Computational Linguistics (COLIN'96), Denmark, 113-118. [14] Manning, C. and Schütze, H. (2002). Foundations of Statistical Natural Language Processing. London: The MIT Press. [15] McEnery, T. and Wilson, A. (2001). Corpus Linguistics. Edinburgh: Edinburgh University Press. [16] Mihalcea, R. and Moldovan, D. (1999). A Method for Word Sense Disambiguation of Unrestricted text. Proceedings of 37th Meeting of ACL, 152-158, Maryland, June. [17] Mitkov, R. (1998). Robust Pronoun Resolution with Limited Knowledge. Proceedings of the 18th International Conference on Computational Linguistics (COLING'98)/ACL'98 Conference, Montreal, Canada, 869-875. [18] Mitkov, R. (1999). Anaphora Resolution: The State of the Art. Technical Report based on COLING'98 and ACL'98 Tutorial on Anaphora Resolution, University of Wolverhampton. [19] Mitkov, R. (2001). Outstanding Issues in Anaphora Resolution. Proceedings of the 2nd International Conference on Computational Linguistics and Intelligent Text Processing, Mexico City, 110-125. [20] Resnik, P. (1999). Mining the Web for Bilingual Texts. Proceedings of the 37th Annual Meeting of the Association Computational Linguistics, College Park, Maryland, 527-534. [21] Villasenor-Pineda, L., Montes y Gomez, M., Perez-Coutino, M. and Vaufreydaz, D. (2003). A Corpus Balancing Method for Language Model Construction. Fourth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2003), 393-401, Mexico City, February [22] Volk, M. (2001). Exploiting the WWW as a Corpus to Resolve PP Attachment Ambiguities. Proceedings of Corpus Linguistics 2001, Lancaster, UK. [23] Williams, S., Harvey, M. and Preston, K. (1996). Rule-Based Reference Resolution for Unrestricted Text Using Part-of-Speech Tagging and Noun Phrase Parsing. Proceedings of the Internat8ional Colloquium on DAAR, Lancaster, UK, 441-456. [2] Uryupina, O. (2006). Coreference resolution with and without linguistic knowledge. In Proceedings of LREC 2006, pages 893–898.