Using the Web in Building a Corpus-Based Hypernymy ...

Viewer
Transcript

Using the Web in Building a Corpus-Based HypernymyHyponymy Lexicon with Hierarchical Structure for Arabic

Khaled Elghamry Alsun, Ain Shams University [email protected]

Abstract The hypernymy-hyponymy links form the backbone of the noun hierarchy in a semantic lexicon. Carrying out this task manually is labor-intensive and time consuming, and could lead to inconsistencies and to problems in coverage, updating, and scaling up. This paper shows how a corpus-based hypernym-hyponym lexicon with partial hierarchical structure for Arabic can be created directly from the Web with minimal human supervision. The creation method bootstraps the acquisition process by searching the Web for the lexicosyntactic pattern “ ‫ ﺑﻌﺾ‬x ‫ ﻣﺜﻞ‬y1…yn” (some x such as y1,…yn). The results reported in this paper show the effectiveness of the suggested method and (when compared to the current version of the Arabic Wordnet) raise some important theoretical as well as practical issues on different levels and directions in (Web-) corpus-based approaches to linguistic knowledge acquisition, in general, and to semantic lexicon acquisition, in particular.

structured information about semantic relations between words. Building such taxonomies, however, is an extremely slow and labor-intensive process. Further, semantic taxonomies are invariably limited in scope and domain, and the high cost of extending or customizing them for an application has often limited their usefulness. Consequently, there has been significant recent interest in finding methods for automatically learning taxonomic relations and constructing semantic hierarchies from large text corpora (e.g. [2], [3], [5], [7], and [8], among others). The main idea in this approach is that many lexical semantic relations, such as the hyponymy relation, can be extracted from text as they occur in detectable lexico-syntactic constructions, such as the following English patterns [5]:

1. Introduction Semantic taxonomies such as Wordnet [4] provide a rich source of knowledge for natural language processing applications, but are expensive to build, maintain, and extend. These taxonomies are a key source of knowledge for natural language processing applications, and provide

For example, the pattern NP0 such as {NP1, NP2 ..., (and | or)} NPn can be used to conclude that the noun phrases NP1...NPn are hyponyms of the noun phrase NP0. For instance, given the following English

INFOS2008, March 27-29, 2008 Cairo-Egypt © 2008 Faculty of Computers & Information-Cairo University

NLP-157

example “Animals such as rabbits, pigs and goats…”, it can be concluded that rabbits, pigs and goats are kinds of Animals.

conclusions and suggests future directions for improvement. 2. Pattern-Based Hyponymy Extraction

The researchers mentioned above reported encouraging results on hypernymyhyponymy acquisition for English and Swedish using this corpus-driven patternbased approach. This paper uses the same approach, with two modifications, to create a hypernym-hyponym lexicon for Arabic. The first modification is using more syntactically-constrained patterns that are almost always indicative of the IS-A relation. The experiments reported in this paper are restricted to using the lexicosyntactic pattern “ ‫ ﺑﻌﺾ‬x ‫ ﻣﺜﻞ‬y1…yn” (some x such as y1,…yn) in the discovery of the hyponymy relation. This pattern was used because it is easily recognizable, occurs frequently and across text genre boundaries, and above all indisputably indicates the lexical relation of interest (i.e. hypernymy-hyponymy). It is also motivated by the lack of syntactic parsers for Arabic, and by the need for accurate identification of simple noun phrases in the corpus. The other modification is using the whole World-Wide Web to search for contexts of this pattern. This modification is motivated by the restrictive nature of the pattern, which requires a very large corpus in order to yield quantitatively significant results, and by the need of wide domain coverage of the relations. The rest of the paper is structured as follows. Section 2 describes an implementation of pattern-based, corpusdriven hyponymy discovery on Arabic using the Web as corpus, and reports the results of this implantation. Section 3 describes two evaluation methods of these results. Section 4 discusses the results of the implementation and some of the theoretical and practical issues involved. The last section draws some

The original purpose of this paper was to use the Arabic equivalent of all the Hearststyle lexico-syntactic patterns in Arabic hyponymy extraction. However, the initial experiments showed that using these English-based patterns, as is, with Arabic gave poor results and a low level of precision in hyponymy extraction, and also brought in a lot of noise that requires significant syntactic processing. By examining these results, it was found that for these English-based patterns to give good results for Arabic, a good parser is highly required for the identification of noun phrases, in particular, and for handling issues of word order in Arabic, in general. Given that such a parser is lacking for Arabic, the initial results were carefully analyzed in order to identify the patterns that gave the best results for Arabic. It was found that …such as was such a pattern. This pattern was constrained even further in order to circumvent the need for a full-fledged syntactic parser for Arabic, and to extract more restricted contexts, which guarantee a simple and more accurate identification of noun phrases. Then the Web was used as corpus to identify contexts where this pattern occurs. Using the whole Web secures a good quantitative yield of this pattern, and maintains its level of accuracy resulting from its restrictiveness. 2. 1 Web Searches The lexico-syntatic pattern “ ‫ ﺑﻌﺾ‬x ‫ﻣﺜﻞ‬ y1…yn” was used as a search term and looked for on the Web in order to retrieve contexts where this pattern occurs. The pattern was used with variable distance between the two words of the term (“ ‫* ﺑﻌﺾ‬

NLP-158

‫”* ﻣﺜﻞ‬, “ ‫”* ﻣﺜﻞ ** ﺑﻌﺾ‬, etc.) in order to even maximize the number of the search results returned by the Web searches, as well as the domain-coverage of the documents returned. For this purpose, Google was first used in order to establish some sort of an estimate of the number of Web documents containing this pattern. Accessed on January 1st, 2nd, and 3rd 2008, Google gave ≈ 550,000 Web documents for the different forms of the search term. Almost 110,000 of these documents were retrieved and downloaded to the local machine for further processing.

mainly simple NPs of the form (N ADJ); (2) the first word does not start with the definite marker but the second word does, and this identifies mainly simple IDafa NPs of the form (N1 N2). Limiting the hypernym to two-word NPs was motivated by the results of some test experiments that showed that (1) this NP length would give the most accurate results, highly specific types of hypernymy, and a good quantitative yield, (2) whereas single-word NPs would give a better quantitative yield but mostly highlevel relations, and (3) NPs longer than twowords would bring in a lot of noise and a limited quantitative yield.

2.2 Noun Phrase Identification Then hyponymy-hypernymy relations were extracted from these documents as follows. Given that the two sides of this relation are noun phrases (either single- or multi-word); some simple rules were used to identify relevant noun phrases. Before applying NP identification rules, some spelling and textual processing was applied. All the different forms of Alef /‫أ|آ|إ‬/ were normalized as /‫أ‬/. Final Haa /‫ه‬/ was normalized as taa marbuta /‫ة‬/. Final yaa was normalized as /‫ي‬/. The conjunction /‫او‬/ was changed into / ‫و‬/ and attached to the beginning of the following word. All punctuation marks, numbers, and Latin characters were removed.1 For possible hypernyms, noun phrases were limited to two-word sequences that occur between the first (‫|ﺑﻌﺾ‬bED|some) and the second part (‫|ﻣﺜﻞ‬mvl|such_as) of the pattern and satisfy either of the following conditions: (1) the first and second word start with the definite marker (‫|ال‬Al), and this identifies 1

All transliterations were done using Buckwalter's Transliteration scheme. URL: http://www.qamus.org/transliteration.htm

Possible hyponyms included single as well as two-word NPs and were looked for in the word sequence after the second part of the pattern (‫|ﻣﺜﻞ‬mvl|such_as), using the following rules. Two-word NPs were identified according to the same conditions used in hypernym NPs. Single word NPs were identified using the definite marker (‫|ال‬Al). Other NPs were identified using any sequence of these NPs connected with the conjunction /‫و‬/. A general condition that was applied in NP identification was that none of the words making an NP is among function words that start with (‫|ال‬Al) such as ‫|اﻟﺬي‬Al*y ‫|اﻟﺘﻲ‬Alty ‫|اﻟﺬﯾﻦ‬Al*yn ‫|اﻟﻼﺗﻲ‬AllAty and ‫|اﻟﻰ‬Aly. 2.3 Hyponymy Extraction Identifying a hypernymy-hyponymy relation was based on the assumption that the NP occurring between the two elements of the pattern (‫ ﺑﻌﺾ‬NP0) is a hypernym of the NP(s) occurring after the second element of the pattern (‫ ﻣﺜﻞ‬NP1…NPn). Applying the NP and hyponymy extraction rules resulted in the identification of 3475 unique contexts and possible

NLP-159

hyponymy-hypernymy relations, totaling 37203 context tokens. Table 1 shows some examples of these relations (conjunction /‫و‬/ and pattern words removed).

(‫|ال‬Al) to the first indefinite noun in the IDafa NP was used to catch the similarity of the definite and indefinite forms of the same NP. For example, all hypernyms starting with ‫اﻷﻣﺮاض‬/AlAmrAD/diseases, become child nodes of the parent node ‫اﻷﻣﺮاض‬/AlAmrAD/diseases as illustrated below (See Appendix A for more examples). The result was 322 top-level super-ordinates containing the original 716 hypernym sets.

Table 1: Example Hypernyms and Hyponyms

Then all contexts containing the same hypernym were grouped together into a larger set, on the assumption that all hyponyms that occur with the same hypernym in different contexts are examples of this same hypernym. The result was that the 3475 unique contexts were compressed into only 716 hypernym sets. These sets contain 4080 unique nouns and noun phrases, and 5899 tokens. The largest set was that of the hypernym ‫اﻷﻛﻼت‬ ‫اﻟﺸﻌﺒﯿﺔ‬/popular foods) which contained 57 unique hyponyms. On the other hand, there were 150 one-member sets. Table 2 shows the breakdown of hypernym sets in terms of cardinality and density; i.e., the number of hyponyms in each set.

An important note in this respect is that there were cases of synonymy in hyponyms and hypernyms alike. For example, there were two separate hypernym sets under <‫اﻟﻔﻌﺎﻟﯿﺎت_اﻻﺟﺘﻤﺎﻋﯿﺔ‬/social_activities> and <‫اﻷﻧﺸﻄﺔ_اﻻﺟﺘﻤﺎﻋﯿﺔ‬/social_activities>. Such cases were treated as two independent sets. This point is not pursued any further in this paper and is left for future research. 3. Evaluation

Table 2: Breakdown of Hypernym Sets Density

2.4 Hierarchy Construction Then partial hierarchical structure was added to these relations in the following manner. Given the hypernym (N ADJ) and (N1 N2) noun phrases, the definite marker (‫|ال‬Al) was attached to the beginning of the first noun (N1) in the IDafa NP, and then all hypernyms with the same head noun were merged under a super-ordinate containing the original hypernyms as members. Adding

Before describing the details of how the results were evaluated, a disclaimer is important in this context. First of all, a quantitative precision/recall based evaluation was not feasible given the lack of a gold standard and a set of pre-defined theoretical criteria to compare the results with the Arabic Wordnet (discussed below). Secondly, some of the possible hyponymy relations identified required background (and technical) knowledge to be evaluated. Finally, there were some cases where points of view and value judgments were involved in the possible hyponymy relations

NLP-160

identified. (Each of these points is elaborated below). This said, the evaluation reported here is not claimed to be a STANDARD evaluation by any means. It is only meant to give an indication of the performance of the suggested method for hyponymy identification. (The authors will make the results available for researchers for feedback, suggestions, and further examination). The evaluation was done in two different ways: A manual quantitative evaluation; and a comparison of the results with the Arabic WordNet2. The manual evaluation was carried by three native speakers of Arabic, who were not directly involved in this work. Two membership decisions were used in evaluation: (1) BELONGS, if the hyponym belongs to the right hypernym; (2) DOES NOT BELONG, if the hyponym does not belong to the right hypernym. If an evaluator was not sure of the relation between a given hyponym and its respective immediate hypernym, the original document containing the pattern where this occurs was given to the evaluator(s) in order to reach a final membership decision. This mostly happened with cases where the judgment required some background (technical) knowledge such as names of persons, cities, local foods, and medications. This evaluation was done on two levels: top-level and intermediate-level hypernyms. The only cases that were excluded from evaluation on the intermediate level were hypernym-hyponym relations that express a point-of-view or a value judgment, as in the following examples.

2

http://www.globalwordnet.org/AWN/

The agreement of two out of the three human evaluators was considered enough to establish (non)-membership of a given noun phrase in a given hyponymy set. Surprisingly, it was found that almost all nouns and noun phrases belonged to the right hypernymy set, on both (top and intermediate) levels. The underlined words in the following examples were the only exceptions:

In the first example, the error is in the intermediate level, where the noun ‫اﻟﮭﺠﺮة‬/Alhjrp/immigration is classified as a kind of criminal activity. This error seems to result from noun phrase identification, where this noun seems to be part of a larger noun phrase, ‫اﻟﮭﺠﺮة_ﻏﯿﺮ_اﻟﺸﺮﻋﯿﺔ‬/ illegal_immigration. However, on the top level, there seems to be no problem, since this noun denotes a kind of activity. In the second example, the error occurs on both levels. The source of this error seems to be that the noun ‫اﻟﺪﻣﺎغ‬/Aldmag/the_head is used in this context to mean ‫اﻟﻤﺦ‬/Almx/the_brain. Though the results are extremely accurate, the evaluation disclaimer should be kept in

NLP-161

mind until further research is conducted to establish evaluation techniques and methods for this type of corpus-driven results. The evaluation of the results in comparison to the Arabic Wordnet [6] was restricted to coverage and to single-word noun phrases. Nouns were extracted from the AWN database, and normalized in the same way mentioned above. It was found that 548 nouns were common between our results and AWN, and that 2014 were found in our results but not in AWN. For samples of these nouns, see Appendixes B and C, respectively. Two important observations in this respect are that (1) most of the nouns that are in our list but not in AWN are names of persons, places, organizations, local foods, and medications, and (2) that some nouns were in our list but not in AWN because of spelling variation, and this mostly involved nouns of foreign origins. An interesting example of the second observation is the noun ‫ﯾﻮﻧﺴﻜﻮ‬/ywnskw which was in the common nouns with one spelling ‫ﯾﻮﻧﺴﻜﻮ‬/ywnskw and in ours but not AWN with another spelling ‫ﯾﻮﻧﯿﺴﻜﻮ‬/ywnyskw. It was also found that the AWN database contained 4200 nouns that were not in our results. The two previous observations apply here, as well. 4. Discussion Given the evaluation disclaimer above, the results of the suggested method are cautiously encouraging, in terms of accuracy, domain coverage, and specifity. Though there was no systematic evaluation of the domain coverage or the specifity of the obtained hypernymy sets, the results seem to cover a wide range of domains, such as politics, economy, society, medicine, sports, and agriculture, to give a few examples. As for the degree of specifity and the quality of the acquired relations, almost

all the hyponymy relations are remarkably specific, such as home furniture, ethnicities, types of acids, types of diseases, ideologies, psychological problems, and local foods, to give a few examples. There were only two related high-level hypernymy sets: ‫اﻷﺷﯿﺎء‬/Al>$yA'/things|matters and ‫اﻷﻣﻮر‬/Al>mwr/matters The wide domain coverage of the relations seem to result mainly from using the Web as corpus, whereas the degree of specificity is a byproduct of the specifity and restrictive nature of the lexico-syntactic pattern used in the acquisition process. However, there arised some issues as a result of this corpus-driven web-based approach to hyponymy acquisition, which could also apply whenever this approach is used in linguistic knowledge acquisition in general. The first issue is spelling variation. For example, the following noun has three different spellings: ‫ﯾﻮﻧﺴﯿﻒ‬/ywnsyf, ‫ﯾﻮﻧﯿﺴﻒ‬/ywnysf and ‫ﯾﻮﻧﯿﺴﯿﻒ‬/ywnysyf (Interestingly enough, the three different forms of the noun were hyponyms of the same hypernym (i.e., international organizations). The second issue is the context-dependency and subjectivity of some relations, as mentioned above. This raises the question of how this type of knowledge can be encoded and evaluated, if at all. The third issue is what can be called temporary status of some hypernymyhyponymy relations, such as opposition parties. The last issue in this respect is that sometimes (technical) background knowledge is required in the evaluation of the validity of a given hypernymy relations, such as names of medications and local foods, as mentioned in the evaluation section above. During the process of comparing our results to nouns in the AWN database there emerged a number of important issues. The first issue was that of spelling variation mentioned above. The

NLP-162

second issue was the different nature of corpus-driven and human-constructed semantic classes. The last issue was that a good number of our corpus-driven relations reflected the point of view and value judgment of the language user, something that is absent in the AWN. These are all open issues that deserve considerable future research to settle down. The large number of nouns that were in our list, but not in AWN, calls for more work in order to augment AWN with these results.

suggestions for improvements. The usual disclaimers apply. References [1] Caraballo, Sharon. 1999. Automatic construction of a hypernym-labeled noun hierarchy from text. In 37th Annual Meeting of the Association for Computational Linguistics: Proceedings of the Conference, pages 120–126. [2] Caraballo, Sharon. 2001. Automatic Acquisition of a Hypernym-Labeled Noun Hierarchy from Text. Brown University Ph.D. Thesis.

5. Future Directions and Conclusions This paper presented an experiment on a corpus-driven web-based acquisition of a hypernymy-hyponymy lexicon with partial hierarchical structure for Arabic, using the lexico-syntactic pattern “‫ ﺑﻌﺾ‬x ‫ ﻣﺜﻞ‬y1…yn” (some x such as y1,…yn). The results of the experiments were encouraging in terms of coverage, accuracy, and specifity. Similar methods could also be used to automatically acquire other types of lexical-semantic knowledge. Such a Web-based approach is claimed to be as dynamic as the Web itself, which makes updating the acquired resources feasible and less labor-intensive. It can also be used to augment and facilitate the construction of the Arabic Wordnet. However, further work is still highly needed to establish the proper evaluation techniques of the performance of this corpus-driven Web-based approach. Also future research is needed to benefit from the results of this approach in augmenting and updating the current Arabic Wordnet. 6. Acknowledgments The author is grateful to the anonymous reviewers for their useful comments and

[3] Scott Cederberg, Dominic Widdows. 2003. Using LSA and Noun Coordination Information to Improve the Precision and Recall of Automatic Hyponymy Extraction. Proceedings of Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, pages 111118. [4] Fellbaum, C. 1998. WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press. [5] Marti A. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the Fourteenth International Conference on Computational Linguistics. [6] Elkateb, Sabry, David Farwell, Piek Vossen, Adam Pease, Christian Fellbaum. 2006. Arabic Wordnet and the Challenges of Arabic. http:// www.mt-archive.info/BCS- 2006-Elkateb.pdf [7] Sara Rydin. 2002. Building a hyponymy lexicon with hierarchical structure. Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX), Philadelphia July 2002, pp. 26-33. Association for Computational Linguistics. [8] Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. Learning syntactic patterns for automatic hypernym discovery. Advances in Neural Information Processing Systems 17, 2004.

NLP-163

NLP-164

NLP-165

The Feasibility of Using the Web in Building Sense ...