A Web-Based Approach to Arabic PP Attachment

Viewer
Transcript

A Web-Based Approach to Arabic PP Attachment Rania Al-Sabbagh Khaled Elghamry [email protected] [email protected] Faculty of Al-Alsun (Languages), Ain Shams University, Cairo, Egypt

Abstract Motivated by the importance of PP attachment for parsing and the poor performance of current Arabic parsers, this paper presents an algorithm for Arabic PP attachment. The algorithm takes tokenized corpora as input and uses web-based bigrams to achieve a performance rate of ≈ 82%.

1. Introduction Prepositional Phrase (PP) attachment is the process of determining which part of a sentence is modified by a given PP (Kulick et al. 2006). Sentence (1) below, where the PP #$ +%&'()‫ ا‬/mn Albnjr/1 (from the beet) modifies +-.%)‫ ا‬/Alskr/ (the sugar), is one example of PP attachment in Arabic. .... +&'()‫ ا‬#$ +-.)‫ج ا‬2345 6'7$ +(‫( أآ‬1) Transliteration: />kbr msnE lAntAj Alskr mn Albnjr/ Translation: The biggest factory to produce the sugar from the beet ... In Arabic, there are two types of prepositions. The first is <=7%%>'?)‫ ا‬+%%&)‫وف ا‬+%%B /Hrwf Aljr AlmnfSlp/ (separate prepositions) and the second is <=7%3?)‫ ا‬+%&)‫وف ا‬+%B /Hrwf Aljr AlmtSlp/ (procliticized prepositions); both of which are outlined in table (1). Separate Prepositions Procliticized Prepositions #$ /mn/ E)‫ إ‬/
PP attachment is important for parsing which is an essential pre-processing step for 1

Buckwalter's Transliteration scheme. URL: http://www.qamus.org/transliteration.htm

many Natural Language Processing (NLP) tasks such as Machine Translation (MT), Anaphora Resolution (AR) ... etc. However, both Arabic PP attachment and parsing are understudied; and the performance of existing parsers for Arabic is unsatisfactory compared to the English ones (Habash 2007). The main problem of Arabic parsers and PP attachment approaches is the lack of sufficient syntactic and lexical Arabic NLP resources (Diab et al. 2004). Therefore, the authors present an algorithm that depends on tokenized corpora and uses web-based collocations. The rest of the paper falls in five parts. The first discusses related work to web-based approaches of PP attachment in general and Arabic PP attachment in particular. The second part deals with the proposed PP attachment algorithm and the evaluation methodology used. The third part shows results and gives a brief error analysis. Finally, the last part highlights future directions to improve the proposed algorithm.

2. Related Work Statistical PP-attachment approaches are robust, fast and computationally inexpensive; however, they usually undergo the problem of sparse data due to the fact that they are mainly corpus-based. A corpus is sparse if it is bound to have a very large number of cases of zero-probability events that should really have some non-zero probability (Jurafsky and Martin 2000). Banko and Brill (2001) and Kilgarriff and Grefenstette (2003) among others have considered corpora with millions of words as small data sets that contain only a sample of the dominant meanings and usage-patterns and where rare words and rare meanings of common words and combinations of words have almost no evidence. Therefore, sparseness of data will always be a problem for statistical approaches and their applications such as statistical PP attachment. In order to overcome sparseness of data in terms of PP attachment, Volk (2001) used a

web-based approach to gather lexical collocations from web documents in order to handle German PP attachment. Volk's algorithm assumes that a preposition attaches to a noun simply when the noun appears within a fixed context window of the preposition. His algorithm achieves a performance rate of ≈ 75%. To the best of the authors' knowledge, webbased PP attachment has not been applied to Arabic before. Moreover, no previous results on Arabic PP attachment are available. However, the performance of current Arabic parsers shows that PP attachment can still be largely improved. Compared to English, few parsers are available for Arabic. Bikel (2004) developed a multi-lingual statistical parser that deals with Arabic among other languages. Trained on Arabic TreeBank 1 (ATB1), the parser achieves an F-measured performance rate of ≈ 75.7% for parsing ABT1 sentences that are <= 40 words and 73% for all ABT1 sentences. Kulick et al. (2006) tried to modify Bikel's parser (2004) using better punctuation schemes and improved tagset mapping. They achieve an F-measured performance rate of 79% for parsing ABT3 sentences that are <= 40 words and 74.6% for all ABT3 sentences.

3. The PP Attachment Algorithm 3.1. Corpus Preprocessing As previously mentioned, the main problem with Arabic NLP in general is lack of resources and tools (Diab et al. 2004). Thus, the proposed algorithm uses the least available resources and tools for corpus preprocessing. Consequently, tokenization is the only used preprocessing step for the current algorithm. Tokenization is necessary in order to split possible procliticized prepositions. The tokenizer used is the Support Vector Machine Tokenizer (SVM-TOK) proposed by Diab et al. (2004), which achieves a performance rate of 99.12% tested on the Arabic TreeBank. 3.2. Training Corpus Unlike previous corpus-based/statisticallybased approaches, the proposed algorithm does not make use of any training corpora. Alternatively, the algorithm dynamically generates the collocations necessary to resolve

PP attachment in the input sentences and searches the web for their frequencies. 3.3. PP Attachment Approach The present algorithm uses collocational association between the PP and its candidate binders to resolve PP attachment ambiguities. Simply, the candidate binder (i.e. nouns or verbs and adjectives) with the highest association with the given PP is selected as the correct binder. Collocational association is measured using Conditional Probability (CP), which is calculated as follows: P (x∩y) P(x|y) = P(y) Where x refers to the PP, and y to the candidate binder However, CP is not the association measure but only one of the possible association measures that can be used like chi-square (X2), log-likelihood ratio and others. However, the availability of huge web-based frequencies motivates using CP. Candidate binders are extracted from a -20word window size. Three reasons stand behind using a window-size-based search space instead of using phrase/sentence boundaries. First, phrase/sentence boundaries are difficult to be determined due to punctuation inconsistencies in Modern Standard Arabic (MSA) corpora (Chalabi 2001, Buckwalter 2002). Second, parsers are not efficient enough to determine phrase/sentence boundaries (see section 2 above). Finally, even if the phrase/sentence boundary is determined, a PP is not necessarily attached to the head noun/verb as in sentence (1) above, where the PP +%&'()‫ ا‬#%$ /mn Albnjr/ (from the beet) is not attached to the head noun 6'7%%$ +%%(‫ أآ‬/>kbr msnE/ (the largest factory) but to a noun in an internal NP, that is, +-.)‫ ا‬/Alskr/ (the sugar). After bigrams are generated, their frequencies are extracted using web search engines. Four search engines are used. The first is www.araby.com2. It is an Arabic-based engine that uses Arabic resources such as dictionaries and morphological analyzers, in

2

order to get as accurate search results as possible3. The last three search engines – www.search.com, www.exalead.com and www.findforward.com – are meta-search engines that support Arabic search. Using different types of search engines is motivated by testing the usability of the web-based PP attachment approach and by avoiding biased results. In order to use web frequencies to calculate the conditional probabilities of the target collocations, the web size (i.e. the size of the web documents uploaded to each of the used search engines) must be estimated. The authors use the same equation used in Elghamry et al. (2007): SKC * WFOFW Web Size = FWFIKC SKC stands for Size of a Known Corpus, which is Al-Ahram Newspaper from 1998 to 2006 (≈ 20,000,000 million tokens). WFOFW refers to the Web Frequencies of Function Words and FWFIKC for Function Word Frequencies in Known Corpus. According to this equation, the sizes of Arabic documents uploaded on each of the aforementioned are4: The Website Araby.com Search.com Exalead.com FindForward.com

An Estimate of Arabic Documents Size 1,713,418,618 43,911,978,275 21,497,417,252 39,028,890,693

Table (1): Arabic Web Page Size on Used Search Engine

3.4. A Walk-Through Example. Briefly, in order to resolve PP attachment ambiguity, the PP algorithm uses the collocational association, measured using CP, between the PP and the candidate binder, which is searched within a window size of −20 words. A walk-through example is detailed in the following lines: Given a tokenized corpus, the −20-word window size is extracted for each PP. This results in Table (2) below:

3

See the search engine homepage available at www.araby.com. Accessed 2 January 2008. 4 All frequencies are acquired on 25 December 2007

J%%%K ‫ب‬2%%%‫ره‬5‫< ا‬%%%pK2-?) )‫ا‬ Transliteration: /fy bEvp tdrybyp lmkAfHp AlArhAb fy Alfylybyn/ Translation: In a training mission to fight terrorism in the Philippines Collocations between the PP #q%%(q=>)‫ ا‬J%%K /fy Alfylybyn/ (in the Philippines) and candidate binders are generated. A sample of the generated collocations is in table (2): Collocation

Translit.

J%%%%%%%%K ‫ب‬2%%%%%%%%‫ره‬5‫ا‬ #q(q=q>)‫ا‬ J%%%%%%%%%K <%%%%%%%%%pK2-?) #q(q=q>)‫ا‬ J%%%%%%%%%%K )‫ا‬

Translat.

/AlArhAb fy Alfylybyn/

Terrorism in the Philippines

/lmkAfHp fy Alfylybyn/

To eliminate in the Philippines

/tdrybyp fy Alfylybyn/

Training in the Philippines

/
#q(q=q>)‫ ا‬JK ‫ر‬2w‫إ‬

Web frequencies are collected for each candidate binder. Results for the sample candidates given in table (3) are shown in table (2) Collocation

Araby

Search

JK ‫ب‬2‫ره‬5‫ا‬ #q(q=q>)‫ا‬

3

15

JK )‫ا‬

0

0

JK )‫ا‬

0

0

JK ‫ر‬2w‫إ‬ #q(q=q>)‫ا‬

0

0

Table (3): Web Frequencies for Sample Bigrams

According to the web frequencies, CP is measured for each collocation. The candidate binder with the highest CP score is selected as the correct one. For sentence (2) above, this candidate was ‫ب‬2‫ره‬5‫ ا‬/AlArhAb/ (terrorism). 3.5. Evaluation The authors apply two evaluation methodologies. The first compares results to a baseline model, which attaches the target PP to the closest NP/VP. The second uses a goldstandard set.

The gold-standard set consists of 1000 PPs, randomly extracted from Arabic TreeBank 35, where all PPs are disambiguated. As for the evaluation metrics, precision, recall and the F-measure score are the ones used. Precision is "a measure of the proportion of selected items that the system got right" (Manning and Schütze 2002: 268). It is calculated as follows: Correctly Resolved Events Precision = Correctly resolved events + Incorrectly resolved events Recall is "the proportion of the target items that the system selected" (Manning and Schütze 2002: 268). It is calculated as: Correctly resolved events + Incorrectly resolved events Recall = Total Number of Events F-measure is the weighted mean of precision and recall. It is calculated as: 2 x Precision x Recall F-Measure = Precision + Recall

4. Results and Discussion Compared to the aforementioned gold standard set, the baseline model achieves a precision rate of 66%, recall of 100% and Fmeasure of 79.5%. Using the same aforementioned gold standard set, the following performance rates were achieved based on the search engine used in computing the conditional probabilities: Search Engine Araby.com

5

Precision

Recall

FMeasure

0.794

0.807

0.8004

Arabic TreeBank is a 1-million-word corpus which contains POS tags and parses. It s available through LDC, Catalogue Number LDC2005T20, URL: http://www.ldc.upenn.edu/

Search.com

0.731

0.9185

0.814

Exalead.com

0.8475

0.747

0.794

FindForward .com

0.776

0.8781

0.822

Table (4): Evaluation Results of the Used Search Engines

Recall error rate is caused by deficiencies in the used search engines not by the used window size. The heuristically assumed window size of -20 words covers 100% of the tested cases. However, some collocations are assigned zero frequencies by the used search engines, although such collocations can exist like 4 )‫ا‬ Transliteration: /w y*kr An jmAEp " Abw syAf " tmknt mn AlnjAp mn Emlyp wAsEp nf*hA Aljy$ Alfylybyny fy AlAdgAl fy jnwb/ Translation: "Abu Saiaf" managed to survive an all-out operation carried out by the Philippine army in the jungle in the south of ...

5. Conclusion and Future Work This paper presented an algorithm for Arabic PP attachment that depends on tokenized corpora and web-based collocations and achieves a performance rate of ≈ 82%. Compared to the baseline model, the precision of the algorithm is ≈10% higher, which is a significant improvement. The proposed algorithm can be improved in various ways. First, other search engines can be tested for better recall/precision. There are many Arabic-based search engines such as: www.ayna.com, www.amamk.com and many

others. These Arabic-Based search engines need to be tested for their precision and recall.

Analysis and Improvements. Treebanks and Linguistic Theories.

Second, a filtration methodology should be developed to filter some unwanted binders. One of these methodologies can be using an Arabic chunker which will reduce the search space and limit it to complete linguistic units rather than individual words.

Manning, C. and Schütze, H. (2002). Foundations of Statistical Natural Language Processing. London: The MIT Press.

References Banko, M. and Brill, E. (2001). Scaling to Very Large Corpora for Natural Language Disambiguation. Proceedings of ACL, 2001. Bikel, D. (2004). On the Parameter Space of Generative Lexicalized Statistical Parsing Models. PhD Thesis. University of Pennsylvania. Buckwalter, T. (2002). Issues in Arabic Orthography and Morphology Analysis. Proceedings of the Workshop on Computational Approaches to Arabic Scriptbased Languages (COLING 20004), 31-34. Chalabi, A. (2001). Sakhr Web-Based ArabicEnglish MT Engine. Proceedings of the Association for Machine Translation in the Americas (AMTA'98), Toulouse, France, July 2001, 518-521. Diab, M., Hacioglu, K. and Jurafsky, D. (2004). Automatic Tagging of Arabic Text: from Raw Text to Base Phrase Chunks. In Dumas, S., Marcus, D. and Roukos, S. (Eds.). HLT-NAACL 2004: Short Papers (pp.140-152). Boston: Association for Computational Linguistics. Elghamry, K., El-Zeiny, N. and Al-Sabbagh, R. (2007). Arabic Anaphora Resolution Using the Web as Corpus. Proceedings of The 7th Conference on Language Engineering, Egypt, 5-6 December 2007. Habash, N. (2007). Syntactic Preprocessing for Statistical Machine Translation. Proceedings of the Machine Translation Summit (MTSummit), Copenhagen, Denmark, 2007. Jurafsky, D. and Martin, J. (2000). Speech and Language Processing; An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. New Jersey: Prentice Hall Ltd. Kulick, S., Gabbard, R. and Marcus, M. (2006). Parsing the Arabic Treebank:

Volk, M. (2001). Exploiting the WWW as a Corpus to Resolve PP Attachment Ambiguities. Proceedings of Corpus Linguistics. Lancaster, UK, 2001.

A Web-Based Approach to Arabic PP Attachment

tasks such as Machine Translation (MT),. Anaphora ... Machine Tokenizer (SVM-TOK) proposed by. Diab et al. (2004), which .... the closest NP/VP. The second ...

Download PDF

170KB Sizes 1 Downloads 224 Views

Report

A Web-Based Approach to Arabic PP Attachment

Recommend Documents