1 A Constraint-based Algorithm for the Identification of ...

Viewer
Transcript

A Constraint-based Algorithm for the Identification of Arabic Roots Khaled Elghamry Indiana University, Department of Linguistics [email protected] Abstract This paper presents two distributional constraints on the relative positions of root letters in an Arabic word. Given these constraints, it is possible to significantly reduce the search space for roots. To test their utility, these constraints were used in an experimental root-identification algorithm that assumes no previous knowledge about Arabic other than that provided by these constraints. This algorithm was tested on an unedited Arabic corpus extracted from Aljazeera website. The initial results demonstrate that these constraints can be used to improve and speed up the task of Arabic morphological parsing. Introduction The traditionally held view of Arabic morphology is that Arabic words are derived from a closed set of roots (see [1] for different counts of Arabic roots in different Arabic dictionaries). The vast majority of these roots are 3 letters, with a few 4- and 5-letter roots. The derivation of a word from a root is usually done in two stages. In the first stage a stem is generated by applying a template to the root. Then in the second stage affixes are added to the stem. For a very brief description of Arabic morphology see the Appendix in [2]. Finding the root of an Arabic word is an interesting challenge in Arabic morphological parsing in general, and information retrieval in particular. It has been reported that Arabic information retrieval is enhanced when the roots are used in indexing and searching. [3], [4], [5]. The majority of previous systems to detect roots start with dictionaries of possible roots and affixes in Arabic, e.g. [6]. Typically algorithms in these methods operate in a trial- and-error fashion, trying every possible root until the correct root is found. Some other systems start with less knowledge about Arabic morphology in order to identify roots. For example, [7] utilizes a list of word-root pairs in order to derive a list of prefixes and affixes that are then used in root identification. In [2], a clustering algorithm is used to group words that share the same root, based only on their morphological similarities. This paper introduces a set of distributional constraints on the relative positions of root consonants in an Arabic word. These constraints can be used to significantly reduce the search space in the process of root identification. To test their utility, these constraints were used in an experimental root-identification algorithm that assumes no previous knowledge about Arabic other than that provided by these constraints. This algorithm was tested on an unedited Arabic corpus extracted from Aljazeera website. The initial results demonstrate that incorporating these constraints in any system for the 1

morphological analysis of Arabic has the promise of speeding up the system and improving its results. This paper comprises five main sections. The first section overviews the problems and previous systems for Arabic morphology. The second section presents a formal description of the proposed distributional constraints. The third section presents an experimental root-identification algorithm. The fourth section discusses the results of this experiment and the limitations of the algorithm. The last section concludes the paper and points to possible future directions. Distributional Constraints on Arabic Roots Let’s assume that root consonants occur in a sequential order within a word, where the first radical precedes the second, the second precedes the third, etc…Let’s also assume for now that there are no constraints on the linear distance between every two consecutive radicals, where distance is measured in terms of the positions of every two consecutive radicals. Given this, the number of possible roots, R, can be calculated using the following equation [8],

where n is length of the target word, and r is the length of the root. For example, given a word of length 8, the number of possible tri-lateral roots for this word is

and the number of possible quad-lateral roots is

Table 1 shows the number of possible tri- and quad-lateral roots as a function of word length.

2

It is clear from Table 1 that the number of possible roots grows exponentially as a function of word length. Below, I present two distributional constraints on the relative positions of radicals within a word, and how they provide a method to significantly reduce the number of roots. The focus here is on tri-lateral roots only. Constraints The intuition behind these constraints is that if the root plays a central role in Arabic morphology, then the positions of radicals should respect some distributional properties that make them easy to identify. Accordingly, given a word W which has the root (rirjrk), where each index indicates the position of the respective letter in W, it was found that the following constraints should hold. Constraint 1: The linear distance between every two consecutive radicals cannot be larger than 3. This means that for a tri-lateral • Τhe linear distance between the first and the second radicals cannot exceed 3. • Τhe linear distance between the second and the third radicals cannot exceed 3. So if ri, rj, and rk are the first, second, and third radicals in W, respectively, the following inequalities should hold.

From a generation perspective, this constraint dictates that no morphological operation is allowed to insert more than two letters between every two consecutive radicals. The second constraint controls the distance between the first and third radicals, i.e., the overall root internal distance (RID). Constraint 2: The distance between the first and third radicals cannot be larger than 5. This constraint can be written in terms of the distance between these two radicals or alternatively in terms of (j-i) and (k-j). Both forms give the same result.

The following example shows how the two constraints work. Given a word such as msAgYn (prisoners), that can be represented as follows,

3

these constraints exclude the letter triplets in Table 2 from the set of possible roots for this word.

Table 2: root candidates for ‘msAgYn’excluded by the constraints Naturally, the number of excluded roots should be proportional to word length. In more specific terms, the number of possible trilateral roots that satisfy these constraints, Rc, for a word of length n, where n>3, can now be calculated by the following linear equation. In other words, the number of possible roots is a linear function of word length. Figure 1 shows the significance difference between the number of roots with and without constraints. For example, given these constraints, the number of roots for a 9-letter word is only 34, compared to 84 without the constraints.

The effect of this reduction is highly significant in a large-size corpus and long words. Below, I will show how this reduction in the search space for roots can be used in an experimental algorithm for root identification. Algorithm The idea behind this algorithm is that with every reduction in the search space, evidence accumulates in favor of or against a certain letter being a member of the actual root. If a letter is frequently excluded by the constraints, this is evidence against this letter

4

being part of the root, and evidence for it being an affix or part of an affix. This idea is implemented in the root identification algorithm as follows. The algorithm runs in two phases. In the first phase, for every word in the corpus, a set of possible roots and a set of possible affixes (i.e., letters that are not part of a possible root) are generated according to the constraints discussed above. Then in the second phase, for every letter, r, we calculate • its frequency in the corpus, F(r), • how many times it occurs in the set of possible roots, Fr(r), • how many times it occurs in the set of possible affixes, Fa(r). We then calculate • the number of letter tokens in the set of possible roots, S, • the number of letter tokens in the set of possible affixes, Z. Then for every triplet, ri rj r k, in the set of possible roots we get the following pieces of evidence that it is the actual root,

and the following pieces that it is not.

Then these pieces of conflicting evidence are used to calculate an overall score for every triplet as the actual root, in the following manner:

5

The actual root should be the triplet with the highest score, i.e.,

If there are two or more triplets with the same maximum scores, the most frequent triplet is picked as the actual root, i.e., Experiment This algorithm was tested on an unedited Arabic corpus extracted from Aljazeera website (http://www.aljazeera.net) on March 20 th, 2004. The corpus contained about 6000 tokens and about 2700 unique words. Two words were considered similar only if they were exact matches, otherwise they were considered two different words. No preprocessing was done to the text, and all the computations were done on Arabic script. Results The algorithm generated 35886 tokens and 3626 types of possible roots. The performance of the algorithm was evaluated manually in the following way. Foreign names, words that are known to have quad-lateral roots, and words less than 4-letter long were excluded. Having done that, the number of tri-lateral roots output by the algorithm as the correct roots was 2871. Out of these, the algorithm was able to correctly identify 2642, which constituted about 92% of the trilateral roots in the corpus. Tables 3, 4, and 5 show the descending scores for every possible root for three example words from the corpus: mtwrTyn (‘envolved’: 3rd, Masc., Plural), drAstp (‘study’: noun), and xwfA (‘in fear of’), respectively.

6

In almost 95% of the cases where the algorithm picked the wrong root, the correct root was among the highest scores. Table 6 shows two of these cases.

Most of the errors occurred in words with geminates (e.g., AlHd :’the limit/limiting’, possible root: Hdd), and words with weak roots (e.g., ltSl: ‘to reach’, possible root: WSl). In both cases, there is a missing radical in the surface form of the word. To see the effect of corpus size on the results, the algorithm was tested on bigger corpora. There was no clear correlation between the corpus size and the performance of the algorithm. Likewise, the algorithm did not show any sensitivity to letter normalization. What is interesting here is that with only two constraints and an experimental algorithm, a correctness rate of 92% in the root identification was achieved. Compared to the performance of other systems, this result is clearly promising, given the small number of constraints. This clearly emphasizes the utility of these constraints in root identification. Limitations and Conclusions The algorithm presented above is very experimental and serves mainly as a proofof-concept implementation. The errors in identification seem to result from the way the constraints were used in the construction of the algorithm. A more sophisticated algorithm that makes use of these constraints could give better results. The paper was limited to tri-lateral roots. Future research is still required to check the utility of these constraints with other types of roots. The main point this paper stresses is that this type of distributional constraints provides a promising direction that is worth pursuing in the computational analysis of Arabic morphology. Moreover, the constraints and the algorithm using them do not seem to have problems with spelling variation in the Arabic script, since the algorithm was implemented without any letter normalization. 7

References [1]http://lexicons.sakhr.com/intro/stat.asp [2] De Roeck, A. N. and Al-Fares, W. A., A morphologically sensitive clustering algorithm for identifying Arabic roots. In Proceedings ACL-2000, Hong Kong. [3] Al-Kharashi, I. and Evens, M. Comparing words, stems, and roots as index terms in an Arabic information retrieval. JASIS, 45(8): 548-560, 1994. [4] Abu-Salem, H., Al-Omari, M., and Evens, M. Stemming morphologies over individual querywords for Arabic information retrieval. JASIS, 50(6): 524-529, 1999. [5] Hmeidi, I., Kanaan, G, and Evens M. Designing and implementation of automatic indexing for information retrieval with Arabic documents. JASIS, 48(10):867-881, 1997. [6] Al-Shalabi, R. and Evens, M. A computational morphology system for Arabic. In Proceedings COLING –ACL, New Brunswick, NJ, 1997. [7] Darwish, K. Building a shallow Arabic morphological analyzer in one day. In Proceedings of ACL-2002 Workshop on Computational Approaches to Semitic Languages, 2002. [8] Kenneth H. Rosen (ed.). Handbook of discrete and combinatorial mathematics. CRC Press, New York, 2000.

8

1 A Constraint-based Algorithm for the Identification of ...

Abstract. This paper presents two distributional constraints on the relative positions of root letters in an Arabic word. Given these constraints, it is possible to significantly reduce the search space for roots. To test their utility, these constraints were used in an experimental root-identification algorithm that assumes no previous ...

Download PDF

159KB Sizes 1 Downloads 178 Views

Report

1 A Constraint-based Algorithm for the Identification of ...

Recommend Documents