The Feasibility of Using the Web in Building Sense ...

Viewer
Transcript

To be published in “Perspectives on Arabic Linguistics”

The Feasibility of Using the Web in Building SenseTagged Corpora for Arabic Khaled Elghamry e-mail address: [email protected]

Abstract There are almost no resources for Arabic Word Sense Disambiguation (WSD). This paper presents a new method that uses an Arabic-English dictionary and the World Wide Web to build Arabic sense-tagged corpora. This method is based on two main assumptions. The first is that senses of ambiguous words in one language are often translated into distinct words in a second language. Accordingly, second-language translations can be used as an approximate sense inventory for the first language. The second assumption is that Web pages that contain the first-language ambiguous word and its second-language translations provide ‘pretty good’ information for word sense disambiguation. The feasibility of the suggested method is tested on a set Arabic polysemous nouns extracted from the Lexical Data Consortium Arabic Treebank. The initial results are encouraging and show that the Web is a good source of information for Arabic word sense disambiguation.

1. Introduction Given an instance of a polysemous word in a certain context, word sense disambiguation (WSD) involves determining which sense of this word is used in this particular context. This means that there are two main requirements for WSD. The first is a sense inventory for the words in a given language. The second is an algorithm that matches senses with contexts. WSD is essential for many natural language processing applications and tasks such as machine translation, information retrieval, content and thematic analysis, parsing and speech recognition (Ide and Vernois, 1998). Therefore much recent work has been done on finding sources for sense inventory and WSD algorithms. Sources for predefined word senses include lists of senses in monolingual dictionaries, thesauri, and

bilingual dictionaries and corpora. WSD algorithms depend on two main sources of information: information from an external knowledge source (knowledge-driven WSD), or information about the contexts of previously disambiguated instances of the word derived from corpora (data-driven or corpus-based WSD). Any of a variety of association methods is used to determine the best match between the current context and one of these sources of information, in order to assign a sense to each word occurrence. To the best of the authors’ knowledge, there are no resources for Arabic Word Sense Disambiguation: no sense-tagged corpora for training and no sense inventory (with the exception of the ongoing work on the Arabic WordNet). There have been some attempts to use both parallel text and a sense inventory for the target language in order to bootstrap sense-tagged data for Arabic. For example, Diab (2004) exploits translation correspondences between words in an aligned parallel Arabic-English corpus to annotate Arabic text using English WordNet taxonomy. However, this approach to WSD is limited by the amount of the available parallel corpora, among other limitations (Ng et al. 2003). This paper follows the same approach of using bilingual bootstrapping in relieving the knowledge acquisition bottleneck in WSD, yet without the need of an “explicit” parallel corpus. Instead, this paper uses the whole World Wide Web as a bilingual corpus that can be mined for WSD-related information. This paper is organized as follows. Section 2 reviews related previous research on word sense disambiguation and on using the Web as corpus in natural language processing tasks. Section 3 describes the suggested method for creating sense-tagged corpora for Arabic and discusses the results. The final section draws some conclusions and points to some directions for future research.

2. Previous Work This section reviews related previous research on two main points. The first is the bilingual approach to WSD, and the other is the Web-based research on natural language processing, in general, and WSD, in particular. 2.1 The Bilingual Approach to WSD The main idea behind the bilingual approach to WSD is that different senses of ambiguous words in one language are often translated into distinct words in another language with the particular choice depending on the translator and contextualized meaning; thus the corresponding translation can be thought of as a sense indicator for the instance of the word in its context (Brown et al. 1991, Resnik and Yarowsky 1998, Chugur et al. 2002, Ide et al. 2002, Ng et al. 2003, Diab 2004, Diab and Resnik 2000). For this approach to give significant results, it is required that one of the two languages used should be resources-rich, and there should exist an aligned parallel corpus of the two languages. This approach was used to build sense-tagged corpora for Arabic (Diab 2004) and for Chinese (Ng et al. 2003).

Diab (2004) used translational correspondences between words in a parallel Arabic English corpus to annotate Arabic text using English Wordnet taxonomy. Diab reported 90% accuracy in sense tagging the evaluated Arabic data. However, Diab did not mention a baseline performance that we can use to evaluate this level of accuracy. As noted by Ng et al. (2003), by tying sense distinction to the different translations in a target language, this bilingual approach of getting sense-tagged corpus introduces a “data-oriented” view to sense distinction and serves to add an element of objectivity to sense definition. Moreover, WSD has been criticized as addressing an isolated problem without being grounded to any real application. By defining sense distinction in terms of different target translations, the outcome of WSD of a source

language word is the selection of a target word, which directly corresponds to word selection in machine translation. However, while this use of parallel corpus for word sense disambiguation seems appealing, several practical issues arise in its implementation. The first is that there is a limited size of parallel corpora for a few number of languages. The other is that even if we can obtain large parallel corpora in the long run, to have them manually word-aligned would be too time-consuming and would defeat the original purpose of getting a sense-tagged corpus without manual annotation. The present paper tries to solve these issues by using a bilingual dictionary and the whole Web as a bilingual corpus.

2.2 Web as Corpus With the comeback of the statistical data-driven approach to natural language processing tasks in the 1980’s, the size of the corpus used as a source of linguistic information has become an important factor in the performance of the methods following this approach in a given task. The assumption is that the larger the corpus, the better the performance. The dynamic nature of the World-Wide Web makes it a perfect place to acquire very large corpora for a given natural language processing task. It is free and immense; it contains hundreds of billions of words of text (Kilgarriff and Grefenstette 2003). Recently the Web has been tried for tasks on various linguistic levels: lexicography, syntax, semantics and translation. Jacquemin and Bush (2000) used the Web in learning and classifying English proper names. Web-based frequencies were used to resolve pp-attachment ambiguities in German (Volk (2002b) and in Arabic AlSabbagh and Elghamry (2008) in Arabic. Agirre et al. (2000) presented a method for enriching the WordNet ontology using the Web. Grefenstette (1999) has shown that Web frequencies can be used to find the correct translation of German compounds. Resnik

(1999) developed a method to automatically mine the web for parallel texts that can be used in other NLP tasks later. Elghamry (2008) used the Web for building a hypernymyhyponymy lexicon for Arabic. The Web has also been used in WSD-related tasks. In Santamaría et al. (2003), WordNet senses were automatically associated to Web directories. The hypothesis of Santamaría is that one or more assignments of web directories to a word sense could be an enormously rich and compact source of topical information about the word sense, which includes both the hierarchy of associated subdirectories and the web pages beneath them. Mihalcea (2002) created a set of seeds extracted from WordNet. The Web was searched using queries formed with the seed expressions. Finally, the words surrounding the seed expressions are disambiguated, which in turn serve as new seed expressions for a new bootstrapping iteration. The sense-tagged corpus generated with this approach was tested in the Senseval 2 WSD task, with excellent results: the system performed the best both in the English ‘lexical sample’ and ‘all words’ tasks, and a good part of the success is due to the web acquired corpora. For instance, in the all-words task, the first sense heuristic gives 63.9% precision; if only WordNet are used for training, the result is 65.1% (+ 1.2 absolute improvement). The same algorithm, trained with the web-based corpus, achieves 69.3% precision (+ 5.4 absolute improvement).

3. Our Approach The approach presented here for creating sense-tagged corpora for Arabic is based on two main assumptions. The first is that senses of ambiguous words in one language are often translated into distinct words in a second language. Accordingly, second-language translations are used as an approximate sense inventory for the first language. The result of this is that every sense of the first-language polysemous word is associated with a translation in the second language.

The second assumption is that Web pages that contain the first-language ambiguous word and its second-language translations provide ‘pretty good’ information for word sense disambiguation. This means that if we search the Web using the polysemous word and its second-language translations, we can acquire contexts where the polysemous word occurs in a context indicating the translation, and consequently the sense of the polysemous word in this particular context. It is important to mention here that the second language is used only in the initial phase of bootstrapping seed contexts where the first-language word co-occurs with its second-language translations. Then the monolingual part containing the first-language polysemous word in these contexts is used to bootstrap more contexts for this word.

3.1. Generation Algorithm The process of acquiring training contexts for word sense disambiguation is carried out according to the following steps: Step1: An Arabic-English dictionary is used to get all the possible English translations of the Arabic polysemous word. For this purpose, the Arabic-English glosses in Buckwalter’s (2000) were used as an Arabic-English dictionary. Of course, any ArabicEnglish machine-readable dictionary can be used for this purpose. Using Buckwalter’s was mainly motivated by its availability and by the richness of the linguistic information provided by the whole morphological analyzer. Step2: Search terms are made of the Arabic word and every English translation. To maximize the search results, different inflectional forms of the Arabic word are used separately in Web searches. Step3: A search engine is used to get contexts where these search terms occur.

Step4: Trigrams centered around the Arabic polysemous word are extracted from the snippets acquired in the previous step and used as search terms. Step5: The search engine is used again to get contexts where these trigrams occur. After cleaning the HTML markings, numbers, punctuations and Latin characters in the search results, the output of this process is a set of snippets containing the Arabic polysemous word in context, where every snippet is associated with an English translation of the Arabic word. The following example shows three different snippets for the three different English translations of the Arabic polysemous noun ‫م‬LMNO‫ا‬/AlnZAm (order/regime/system). (Throughout this paper, Arabic examples are given first using the Arabic script, followed by their Buckwalter transliteration and English translations. (See Appendix A for the Buckwalter Arabic Transliteration Scheme). order:

...XYZ [S\O‫`^_^ وا‬O‫ ا‬aROLSO‫م ا‬LMNO‫اءة ا‬VW ...fyh wAlTEn Aljdyd AlEAlmy AlnZAm qrA'p understanding the new world order and critizing it…

regime: …‫ن‬LNfO aZ ‫ري‬ijO‫م ا‬LMNO‫ا ا‬ik‫ر‬LW l‫^ه‬n‫ و‬l‫ ه‬lop‫ن أ‬iRkr_ …lbnAn fy Alswry AlnZAm qArEwA wHdhm hm >nhm yzEmwn They claim that only they have rejected the Syrian regime in Lebanon… system: … tRMp‫ أ‬unLvw axO‫ ا‬yz‫ا‬VfO‫|{ ا‬Z‫[ أ‬z ‫م‬LMNO‫دة ا‬LSx~‫ ا‬yzLpV ^S_ … >nZmp tSAHb Alty AlbrAmj >fDl mn AlnZAm AstEAdp brnAmj yEd The system restore program is one of the best programs that come with the systems… Figure 1: Sample snippets for the Arabic polysemous noun ‫م‬LMNO‫ا‬/AlnZAm (order/regime/system)

These snippets represent the set of training examples for sense disambiguation. Every Arabic content word in a given snippet is expected to provide information about the sense of the Arabic polysemous word in this particular snippet, where different senses are

indicated by the English translations associated with the snippets. The contexts retrieved this way usually include bilingual dictionaries, bilingual glossaries, intext bilingual keywords as well as parallel texts.

3.2. Experiments The previous generation algorithm was tested on 10 polysemous nouns extracted from the Lexical Data Consortium Arabic Treebank (Part 2 v 2.0, LDC Catalog No. LDC2004T02). These nouns were chosen such that: every noun occurs at least 5 times in the Treebank, and that at least two senses of the target noun are represented in the corpus. All the sentences containing the test nouns were extracted from the corpus. If a target noun has more than one part of speech, only sentences where the target word occurs as a noun are kept as part of the test sentences. Then a gold standard was created in the following manner: 1. All these sentences were given to three professional translators who are native speakers of Arabic, and are not directly or indirectly related to the present research project. 2. Every annotator was given the list of Arabic test nouns and their English translations as given by the Arabic-English dictionary as explained above, and was asked to annotate every sentence containing the target noun with the closest English translation for this noun. 3. The agreement of two annotators on a translation was considered enough to accept this translation as the correct one for the target noun. The output of this human annotation process is a set of test sentences containing the target nouns where every sentence is annotated with the appropriate English translation of

the target noun in the given sentence. There were no disagreements among the annotators, and all test sentences were annotated. These annotated sentences were then used to establish a baseline performance for every test noun in the following manner. Every English translation for every test noun was counted to determine the most frequent translation for this particular noun. The count of the most frequent translation was then divided by the sum of all the translations of the same noun. Table 1 shows the Arabic test nouns, their English translations, the frequency of every translation, and the baseline for every noun. Word tSzL`O‫ا‬ AljAmEp ‫^ة‬kLO‫ا‬ AlqAEdp

Translation University League Qaida Base Rule ‫م‬LMNO‫ا‬ Regime AlnZAm System Order tOi\fO‫ا‬ Championship AlbTwlp Heroism Starring ‫ر‬L‫ا‬ Effects Al|vAr Antiquities Traces ‫م‬i~VO‫ا‬ Fees (taxes, ) Alrswm Drawings t~LYjO‫ا‬ Policy AlsyAsp Politics ‫ور‬VRO‫ا‬ Passing Almrwr Traffic ‫دة‬LoO‫ا‬ Certificate Al$hAdp Testimony Martyrdom Witness ‫اف‬Vxk‫ا‬ Recognition AlAEtrAf Admitting Confession Average Baseline

Freq 188 38 86 36 0 94 56 21 75 4 4 14 7 5 29 5 105 37 49 8 18 12 5 0 17 9 4

Baseline 0.83 0.70

0.55

0.90

0.54

0.85 0.74 0.86 0.51

0.57

0.71

Table 1: Arabic test nouns, English translations, and baseline performance

The performance of the suggested method is measured by how good it performs compared to this baseline performance. The generation algorithm was then used to automatically acquire a set of training contexts for each one of these nouns as explained above. Google was used to get the Web

search results. The total number of snippets acquired from the Web for the test nouns was 15160, with an average 1,500 snippets for every search term. These contexts were used to tag the polysemous Arabic test nouns in the test sentences with the appropriate English translations using Web-based Pointwise Mutual Information. The idea of computing an association measure by using statistics obtained from an Internet search engine was first introduced by Turney (2001), who proposed the Web-based Mutual Information (WMI) method. Generally, the Pointwise Mutual Information of two words w1 and w2 is:

1. MI ( w1 , w2 ) = log 2

P ( w1 , w2 ) P ( w1 ) P ( w2 )

The mutual information between two words can be seen as the ratio between the probability of seeing one of the two words if we saw the other and the context-independent probability of seeing the word. On mutual information see, e.g., Church and Hanks (1989). The Web-based version of Mutual Information is computed as follows. See Turney (2001) and Baroni and Vegnaduzzo (2004) for the derivation of the formula in 2 from the formula in 1.

2. WMI ( w1 , w2 ) = log 2 N

hits ( w1 NEAR w2 ) hits ( w1 ) hits ( w2 )

Where:

a. hits(w1 NEAR w2) is the number of documents retrieved by AltaVista for a query in which the two target words are connected by the NEAR operator, b. hits(wn) is the number of documents retrieved for a single word query, and c. N is the number of documents indexed by AltaVista (1 billion). In this paper, the search operator NEAR was not used. Accordingly, Mutual Information was computed based on the snippets acquired from the search engine using the following modified formula:

3. WMI ( E , A) = log 2 M

snips ( E , A) snips ( E ) snips ( A)

where: a. M is the number of snippets for every polysemous Arabic noun, b. E is the English translation associated with every snippet, c. A is an Arabic content word co-occurring in the same snippet with the Arabic polysemous noun, d. snips(E, A) is the number of times A occurs in a snippet associated with the English translation E, e. snips(E) is the number of snippets associated with the English translation E, and f. snips(A) is the number of snippets for every polysemous noun containing the content word A. The formula in 3 was used for sense disambiguation in the following manner: 1. The sentence containing the polysemous noun was taken as the window size. Every test sentence containing the polysemous Arabic noun was compared to all the snippets containing this particular noun. 2. Every time there is a common content word between the test sentence and the snippet containing the target noun, the score of the English translation associated is increased with the weight of this word as computed by the formula in 3. 3. Then the translation with the highest WMI is chosen to be the correct translation (i.e., sense) for the ambiguous word in the given sentence.

3.3. Results and Discussion The ‘precision and recall’ evaluation metric was used to measure the performance of the algorithm in the disambiguation of the Arabic test nouns. ‘Recall’ is the ratio of the

test examples the algorithm was able to disambiguate, correctly or incorrectly, of all the test examples. ‘Precision’ is the ratio of the examples correctly disambiguated of all the test examples disambiguated. Table 2 shows the results of the performance of the suggested method in disambiguating the Arabic test nouns.

Word tSzL`O‫ا‬ AljAmEp ‫^ة‬kLO‫ا‬ AlqAEdp

Translation University League Qaida Base Rule ‫م‬LMNO‫ا‬ Regime AlnZAm System Order tOi\fO‫ا‬ Championship AlbTwlp Heroism Starring ‫ر‬L‫ا‬ Effects Al|vAr Antiquities Traces ‫م‬i~VO‫ا‬ Fees Alrswm Drawings t~LYjO‫ا‬ Policy AlsyAsp Politics ‫ور‬VRO‫ا‬ Passing Almrwr Traffic ‫دة‬LoO‫ا‬ Certificate Al$hAdp Testimony Martyrdom Witness ‫اف‬Vxk‫ا‬ Recognition AlAEtrAf Admitting Confession Average Performance

Baseline 0.83

Precision 0.80

Recall 1

0.70

0.79

1

0.55

0.61

1

0.90

0.92

1

0.54

0.65

1

0.85

0.85

1

0.74

0.75

1

0.86

0.82

1

0.51

0.6

1

0.57

0.57

1

0.71

0.74

1

Table 2: Precision and recall rates of the algorithm performance on the test nouns

As Table 2 shows, the overall average performance of the suggested method is 3 points higher than the baseline performance. The recall, on the other hand, rate was perfect. In more specific terms, the algorithm performed higher than the baseline on 6 out of the 10 test nouns, the same as the baseline on 2, and lower than the baseline on 2. The following error analysis highlights the main sources of errors that negatively affect the performance of the suggested method. The first source of errors is when the second-language

translation is ambiguous itself and systematically provides misleading features in the training contexts as the following snippet illustrates:

…. IVY LEAGUE t\‫ا‬VO tR|NRO‫ت ا‬LSzL`O‫ ا‬Vf‫^ أآ‬Sw a‫ وه‬،Lo uOL 23000 ‫د‬ii tSzL`O‫ ا‬rYRxw‫و‬ …IVY LEAGUE lrAbTp AlmnDmp AljAmEAt >kbr tEd whY, bhA TAlb 23000 bwjwd AljAmEp wttmyz And the University has 23000 enrolled students, and is one of the largest Ivy League Universities… Figure 2: A sample snippet containing an ambiguous English word

In this case, this snippet is incorrectly associated with the English translation ‘league’, which provides the disambiguation algorithm with inaccurate information. The second source of errors is machine-translated Web documents, which is closely related to the ambiguity of the second-language translation, as illustrated by the following snippet of an English phrase and its MT Arabic translation.

Responses to “Touch Order Allows You To Place Order At McDonald’s Via Handset” "tkLR~ Vfk ‫^ز‬OLp‫آ^و‬Lz ‫م‬LMNO‫ ا‬aZ LpLz O Yx_ ‫م‬LMNO‫ل ا‬Lvw‫ "ا‬k ‫دود‬VO‫ا‬ "smAEp Ebr mAkdwnAldz AlnZAm fy mkAnA lk ytyH AlnZAm AtSAl" ElY Alrdwd Figure 3: A sample snippet of English-to-Arabic MT-translated text

The source of error in this snippet is the incorrect MT translation of the English word ‘order’ in the English phrase as ‘‫م‬LMNO‫ا‬/AlnZAm’ . The manual examination of the first 100 snippets for every test noun showed that this type of ‘contaminated’ search results represented almost 4% of all the snippets. Given this relatively significant percentage, the automatic identification of this MT-contaminated portion of the Web becomes an important task for the Web-based approach to natural language processing.

The third source of errors results from the fact that the Arabic script is shared with Urdu, Farsi, Dari and Pashto. Errors will occur only when the snippet contains word forms that are common between Arabic and the other languages using the same script, as the following Farsi snippet shows.

Please express the entirety of the order and what is your opinion about the .... … VYMNw ^_^`w Vz ^p‫^ار‬p ‫ م‬XYk Yfw r ‫ب‬Lw‫ ار‬V xO‫ د‬rYp …tnZyr tjdyd mrjE ndArnd nZAm Elyh tblyg bzh ArtkAb br dlAltY nyz Figure 4: A sample of non-Arabic snippets retrieved from the Web

The manual examination of the first 100 snippets for every test noun showed that this type of search results represented about 2.5% of all the snippets. However, restricting the search in Google to pages written in Arabic reduced the number of search results significantly and excluded very few non-Arabic pages that had the same Arabic script. Moreover, testing the algorithm using the snippets retrieved with the language restriction showed a significant deterioration in the performance of the algorithm in disambiguating the test nouns. This ‘script effect’ in the Web search results requires highly accurate – if not perfect - language identification algorithms in order to restrict the Web search results to the language under investigation. The last source of errors is that the search process yields Web documents containing the Arabic word and its English translation, which is a relatively large search window does not guarantee that the English and Arabic are translations of each other. In attempt to solve this problem, other search engines which allow proximity searches such as AltaVista were tried. However, the search terms using the proximity operator ‘near’

returned zero results in many cases, which would defeat the original purpose of using the Web for the task. As for the recall rate, Table 2 shows that the suggested method was able to disambiguate – correctly or incorrectly – all the test nouns. This ‘prefect’ recall rate can be mainly attributed to the relatively large number of training contexts acquired from the Web for every polysemous noun. Some changes to the disambiguation experiment were tried to reduce recall for a higher precision. For example, in one experiment disambiguation was tested on a smaller context around the target noun – few words on both sides. In other experiments, different Mutual Information thresholds were tried. These changes resulted in reducing the recall rates for all nouns, and increasing the precision rates for some of the test nouns. However, the average overall precision rate was lower than that achieved in the original experiment. Though future research is still needed to establish the plausibility of the proposed approach, it is safe to claim that the overall performance of the suggested method (see Table 1) is encouraging in terms of the feasibility of using the Web as a source of information for sense disambiguation. One difficulty that should be emphasized in this context is that posed by the dynamic nature of the Web. The growing size of the Web, being an advantage in terms of corpus size, poses some challenges for duplicating the results of the Web-based approach to the natural language processing tasks. These challenges raise some interesting issues that deserve future research.

4. Conclusions and Future Directions This paper presented a new method for acquiring sense-tagged corpora for Arabic using the Web as corpus and an Arabic-English dictionary. The performance of the suggested

method in disambiguating a set of Arabic polysemous nouns showed that sense-tagged corpora can be acquired with a reasonable accuracy with minimum resources and human supervision. Future work is still needed to test the feasibility of this method with a larger test set and other parts of speech, such as verbs and adjectives. More importantly, the implementation of this method raised two main issues concerning using the Web as corpus in natural language processing tasks in general. The first is the machinetranslation contaminated portion of the Web, and the other is Web-based language identification. Designing accurate algorithms to handle these two issues is an important topic that deserves future research in order for the Web to be a rich and reliable source of linguistic information for natural language processing tasks, in particular, and linguistic research, in general.

5. Bibliography Agirre, E., Ansa, O., Hovy, E. and Martínez, D. 2000. Enriching very large ontologies using the WWW. In Proceedings of the Ontology Learning Workshop, ECAI, Berlin, Germany, 2000. Baroni, M. and Stefano Vegnaduzzo. 2004. Identifying Subjective Adjectives through Web-based Mutual Information. In Proceedings of the Conference for the Processing of Natural Language and Speech (KONVENS), Vienna, Austria, September 2004, pp 17-24 Brown P. F., Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1991. Word-sense disambiguation using statistical methods. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, pages 264-270. Buckwalter, Tim. Buckwalter Arabic Morphological Analyzer Version 1.0., LDC Catalog No.: LDC2002L49. Linguistic Data Consortium, University of Pennsylvania, 2000. Chugur I., Julio Gonzalo, and Felisa Verdejo. 2002. Polysemy and sense proximity in the Senseval-2 test suite. In Proceedings of the ACL SIGLEX Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pp. 32-39.

Church, K. W., and Hanks, P. (1989). Word association, norms, mutual information, and lexicography. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, pp. 76–83 Diab, M. 2004. An Unsupervised Approach for bootstrapping Arabic Sense Tagging. In Proceedings of Arabic Script Based Languages Workshop, Coling 2004. Diab, M. and Resnik, P. 2002. An unsupervised method for word sense tagging using parallel corpora. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, July 07-12, 2002, Philadelphia, Pennsylvania. Elghamry, Khaled. 2008. Using the Web in Building A Corpus-Based HypernymyHyponymy Lexicon with Hierarchical Structure for Arabic. In Proceedings of the 6th International Conference on Informatics and Systems (INFOS2008), Cairo, Egypt, March 2008. Grefenstette, G. (1999). The WWW as a Resource for Example-Based MT Tasks. ASLIB'99 Translating and the Computer 21, London, UK, November, 1999. Ide N. and J. Veronis (1998). Introduction to the Special Issue on Word Sense Disambiguation: the State of Art. Computational Linguistics, 24(1), 1-39. Ide N., Tomaz Erjavec, and Dan Tufis. 2002. Sense discrimination with parallel corpora. In Proceedings of the ACL SIGLEX Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pages 54-60. Jacquemin, C. and Caroline Bush. 2000. Combining lexical and formatting cues for named entity acquisition from the web. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, Hong Kong, pp. 181–189. Kilgarriff, A. and Grefenstette, G. (2003). Introduction of the Special Issue on the Web as Corpus. Computational Linguistics, 29:333--347 Mihalcea, R. (2002) Bootstrapping large sense-tagged corpora. In Proceedings of the 3rd International Conference on Language Resources and Evaluations (LREC), Las Palmas, Spain, May 2002. Ng Hwee Tou, Bin Wang, and Yee Seng Chan. 2003. Exploiting parallel texts for word sense disambiguation: an empirical study. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, July 07-12, 2003, Sapporo, Japan, pp.455-462. Resnik, P. (1999). Mining the web for bilingual text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL'99), College Park, Maryland, June 1999.

Resnik, P. and Yarowsky, D. (1999). "Distinguishing Systems and Distinguishing Senses: New Evaluation Methods for Word Sense Disambiguation", Natural Language Engineering 5(2), pp. 113-133. Al-Sabbagh, R. and Elghamry, K. (2008). A Web-Based Approach for Arabic PP Attachment. In Proceedings of the 6th International Conference on Informatics and Systems(INFOS2008) , Cairo, Egypt, March 2008. Santamaría, C., Gonzalo, J. and Verdejo, F. (2003) Automatic association of web directories to word senses. Computational Linguistics, v.29 n.3, pp.485-502. Turney, P. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of ECML 2001. Volk, Martin. 2002b. Combining Unsupervised and Supervised Methods for PP Attachment Disambiguation. In Proceedings. of COLING-2002. Taipei.

Appendix A: The Buckwalter Arabic Transliteration Scheme

The Feasibility of Using the Web in Building Sense ...

following snippet of an English phrase and its MT Arabic translation. Responses to âTouch Order Allows You To Place Order At McDonald's Via Handsetâ. Ø¯ÙØ¯ Ø§.

Download PDF

209KB Sizes 0 Downloads 96 Views

Report

The Feasibility of Using the Web in Building Sense ...

Recommend Documents