Improved Letter Weighting Feature Selection on Arabic ...

Viewer
Transcript

2009 First Asian Conference on Intelligent Information and Database Systems

Improved Letter Weighting Feature Selection on Arabic Script Language Identification Choon-Ching Ng and Ali Selamat Faculty of Computer Science & Information Systems Universiti Teknologi Malaysia Johor, Malaysia Email: [email protected] & [email protected]

Abstract

its performance with existing systems such as n-grams and compression. Sebastiani (2002) has formulated language identification as a text categorization problem [11]. Therefore, it can take advantage of many methods in feature extraction, feature reduction, and classifier design methods in order to improve the capability and performance of language identification. Several limitations arise when dealing with web documents such as additional information for the visual appearance of a web page, data formatted as list, spelling and syntax errors, character encoding have been applied, and tremendous of international terms or names [12]. According to Constable and Simons (2000), problems of language identification including changes in the knowledge of languages, different definition of languages, inadequate operation definition of existing systems, scale of languages, documentation of existing system [13]. Sibun and Reynar (1996) have examined a number of issues in language identification such as the type of significant features, form of analysis, form of encoding, number of languages used, form of input, size of text, and appropriateness of method [14]. Initially, we have applied letter frequency neural network [15] and letter frequency fuzzy ARTMAP [2] on Arabic script language identification. It is time consuming and high computation cost if the application involved in any machine learning method. Therefore, we have further developed the algorithm by applying only simple statistical concept to solve the Arabic script language identification problem. It is based on the letter frequency of each document. For example, some of the English letter frequency is big different from other language, Malay language for instance [15]. Based on this assumption, we argue that the text based language identification can be solved using letter representation. Besides, the fundamental of any documents is based on letters, in which used to construct words, sentences and paragraphs. Therefore, it is better to deal with the letter instead of the word. This paper is organized as follows. Section 2 provides a description of the proposed method letter weighting feature selection. In Section 3, we discuss the experiment methodology and follow by the result analysis and discussion in

Language identification is the process identifying predefined language in a document automatically; we focused on the web documents in this paper. Initially, we have applied the letter frequency as features combine with neural networks in Arabic script language identification. However, reliability of selected letters of the features is a major issue to be overcome. Therefore, we propose an improved letter weighting feature selection in order to enhance the effectiveness of language identification. It is based on the concept letter frequency document frequency. From the experiments, we have found that the improved letter weighting feature selection achieve the highest accuracy 99.75% on Arabic script language identification.

1. Introduction Natural language processing is a well known research since 1950s. The purpose of language identification (LID) is to identify the predefined language from the web documents [1], [2]. It is a core technology in any multilingual information retrieval applications [3]. Many features of written language identification such as letter, word, n-grams, semantic etc. have been explored in the previous researches [4], [5], [6]. Various multilingual information retrieval systems have utilized those related features, and been proved effective for language identification [7]. There are a number of methods have been proposed for written language identification. For example, Tran and Sharma (2005) have proposed markov models for European written language identification [8]. They have found that letter representation is better than trigram language identification. Biemann and Teresniak (2005) have presented an unsupervised solution to European language identification [9]. The method sorts multilingual text corpus on the basic of sentences into different languages. Furthermore, Vinosh Babu and Baskaran have presented a language identification system that uses the multivariate analysis (MVA) for dimensionality reduction and classification [10]. They compare 978-0-7695-3580-7/09 $25.00 © 2009 IEEE DOI 10.1109/ACIIDS.2009.33

150

γ. Although some position feature appear the same letter but the sequence of feature is different in each language. We selected same number of features for each category without including those letter weighting is zero. Zero unit of the letter weighting is useless to be used in the classification because it is appear none or only a few times in the given document or collection. Finally, selected features for each language will be used in predicting the document’s language. For example, if a new unknown language document is fetched by the system, the sum of features α for each language will be compared, the highest is the winner. Note that the sum is total of all the letter frequency of selected feature for each language, then sum of features α for different languages will be compared.

Section 4. The conclusion is summarized in Section 5.

2. Improved letter weighting The term frequency inverse document frequency (TFIDF) is a method to weight terms often used in information retrieval or data mining in order to find out most important term for representing each category [16]. However, in our proposed method we use the letter frequency document frequency (LFDF) to weight the letters and find out the most appearance letter order by sequence1 . The sequence of selected features is another important features of our proposed method that use to find out the converge point β and prefix converge point γ. The letter frequency in the given document is simply the number of times a given letter appears in that document. This give a measure of the appearance of the letter Li within a particular document Dj and is given by, ni,j LFi,j = nk,j

Table 1. Letter weighting feature selection

(1)

k

DFi = log(Dj : Li ∈ Dj )

(2)

LF DFi,j = LFi,j ∗ DFi

(3)

where ni,j is the number of occurrences of the considered letter to document Dj , the denominator is the number of occurrences of all letters (nk,j ) in document Dj and document frequency DFi is the number of documents where the letter Li appears. Then, the letter weight LF DFi,j is the multiplication between the letter frequency LFi,j and document frequency DFi . A high weight of LF DF is reached by a high letter frequency (in the given documents) and a high document frequency of the letter in the whole collection of training set, the weight hence tend to figure out the most appearance letter order by sequence. Then, we will select the letters as features according to the sequence of highest weight to the lowest weight. This proposed method is focusing on the letter frequency, so instead of considering the local frequency, it also included the document frequency in order to increase the weighting of the letter in particular language. Table 1 illustrates the selected letters as feature and its corresponding letter weighting. For example, the first feature of Pashto language is (Arabic script letter ‘WAW’), decimal Unicode point (codepoint) is 1608 and the weight is 162.14 unit. Follow by the second letter (Arabic script letter ‘ALEF’), decimal Unicode point is 1575 and the weight is 134.73 unit etc. The sequence of feature will contribute to the parameter of converge point β and prefix converge point

However, we noticed that it is not enough if just applied sum of features α for a comparison due to certain cases appeared the same highest score. Therefore, we also introduced another two parameters namely converge point β and prefix converge point γ. The letter sequence in representing each language is important at this stage. We noticed that the most likely language of one document have faster convergence than other languages in calculating the sum of features. β = indexcur , if αnew > αold (4) β= continue γ = indexpre , if αnew > αold γ= (5) continue The formulation of β and γ is given by (4) and (5), respectively, where indexcur is the position of current feature, αnew is total of current feature frequency, αold is the total of previous feature frequency and indexpre is the value of previous indexcur . For example, an unknown document is evaluated according to each language view such as Arabic and Persian. The α, β and γ of each language are calculated based on each language corresponding features. The index is the position of feature according to the sequence letter weighting in descending, codepoint is the decimal Unicode value, letter is the actual character, frequency is the occurrence of selected letter in particular document, total is the sum of the current letter frequency of previous total

1. LFDF is similar to the concept term frequency document frequency (TFDF) which is focused on the letter instead of term

151

frequency and finally the sum of features α is obtained. The converge point β and prefix converge point γ for Arabic and Persian are four, three, three and two, respectively. In Figure 1, we assume the same highest frequency is found in both Arabic and Persian. This will cause features sum failed to identify the language. Therefore, we will check second condition by finding the lowest converge point. If converge point is also same, then we use a third condition by finding the lowest prefix converge point.

characters that not in the range of Arabic script. Then, we use our proposed method letter weighting feature selection on the preprocessed documents. Finally, the predefined rule based decisions will be used in to identify the language of documents.

Figure 2. Process flow in language identification Figure 1. Converge point and prefix converge point

4. Experimental results

In other words, our proposed letter weighting feature selection with a rule based decision, in which consists of three conditions with the priority from sum of features (α), converge point (β) and prefix converge point (γ). Either one of the conditions is satisfied; the language of the document will be identified. The rule based decision is derived as following,

Table 2. Arabic script news dataset

Language Arabic Persian Urdu Pashto

if Sum of Features > others then language identified Else if Converge Point > others then language identified Else if Prefix Converge Point > others then language identified Else continue

Units 1000 1000 1000 1000

Source BBC BBC BBC BBC

Charset Windows-1256 UTF-8 UTF-8 UTF-8

To evaluate the improved letter weighting feature selection method, we used Arabic script languages. In this paper, we collect 1000 documents of each language from BBC news website [18]. In this work, we assume only one language was used in a web document. The total documents for each language is 1000 unit, so if the number of testing document is 800 unit then the training document is 200 units, respectively. Table 2 shows the dataset have been collected in this work including units, source and original charset. Standard encoding scheme Unicode is used to experiment. Figure 3 shows the accuracy of Arabic script language identification using proposed method. For 800 testing documents, the accuracy of Urdu documents only achieved on 81.88%. This might be caused by a low number of training document been used. Compared to the 200 testing documents, the accuracy of Persian and Pashto is 100%, respectively. The good accuracy indicates that more documents being used for training, the result will get better. Figure 4 is the extension of the above figure which is the average accuracy of Arabic script language identification. The average accuracy of 800, 600, 400 and 200 testing documents are 94.5%, 97.75%, 98.06% and 99.75%.

notices that if none of the condition satisfied, meaning that the language of the document cannot be identified by the rule based decisions. The process will continue for the next sample. We noticed that if the document language is not in the scope of the proposed method, undefined condition occurred.

3. Methodology The language used are Arabic, Persian, Urdu and Pashto that belong to Arabic script. Figure 2 shows the methodology have been applied in this work. First, we identify the Uniform Resource Locator (URL) of the target documents from the BBC news website. Second, we use crawler [17] to crawl interested documents into our repository. Third, we discard the irrelevant string in the document such as Hypertext Markup Language (HTML) programming code and

152

letters on each language is important at this stage. We noticed that if an unknown document (which is belongs to the Arabic language) is being tested, converge point and prefix converge point is lowering than others although same total was found at the end. These parameters have been identified indirectly while features sum was being calculated. Therefore, the experimental results show that our improved letter weighting feature selection method is workable in Arabic script language identification. This proposed method able to deal with all kinds of web documents. The basic elements of all web document are letters. Therefore, instead of using word representation it is better to use the letters as feature on written language identification. When come across with real application of web documents, application having issued on recognizing words in a web document if the predefined language is unknown. Besides, word representation of written language identification also take issue out of vocabulary (OOV), in which the existing word library not able to do a robust written language identification. There are 7000 languages in the world have been reported, it is difficult to gather all the language expertise on providing the word library and the maintenance of the library would be another issue. We have noticed that some previous works are using specify language’s stemming and stopping3 on the preprocessing step. It is very doubtful to do so because the predefined document ’s language is unknown on the preprocessing step. Therefore, we have intended to overcome those issues by proposing written language identification based on the letter representation only. The disadvantage of this method is inflexible features. This method might not be worked in dataset other than BBC news or ten years future web documents in the same BBC news. The fundamental of our improved method was based on the letter frequency in a specific area. We assume that the evolution of web application will affect the style of language expression either in speaking or writing which may lead to change on the letter frequency. Besides, our proposed method might not be functioned in too similar language such as Malay and Indonesia. Both language style is too close. The possibility of similar features found in both languages is very high and it may lead to failure in language identification. The improvement on proposed method is still needed to overcome the limitations have been found, for example implementation of n-grams in our proposed method or hybridization with existing methods.

Figure 3. Accuracy of Arabic script language identification

Figure 4. Average accuracy of Arabic script language identification Initially, we used TFIDF as a feature selection method on Arabic script language identification problem. We found that the result was not good as predicted because TFIDF was used to find the most representative term in data mining problem. Therefore, we improved the TFIDF to LFDF in order to find the most frequent letters appear in dataset according to the sequence of the letter weighting in descending2 . Although LFDF can find out the most frequent letter among all languages but we faced the problem of using those frequent letter. In the testing, we calculate the sum of features α for each language and the highest is the winner. This led to another issue that same sum score was found on more than one language. As a result, the total cannot be used independently to discriminate the language. Converge point β and prefix converge point γ have been defined by second and third parameter for support the proposed method. The sequence of the most frequent

5. Conclusion Language identification is the process to determine the predefined language in a web document. In this work, 3. Stemming is the process of reducing inflected words to their stem, base or root form; Stopping is the name given to the words which are filtered out prior to, or after, processing of natural language data

2. The highest value of the letter weighting to the lowest one

153

we propose an improved letter weighting feature selection method in Arabic script language identification. Initially, we used the letter frequency machine learning methods such as backpropagation neural network and fuzzy ARTMAP for the problem. Then we further improved letter frequency to letter weighting feature selection method which is based on features sum (α), converge point (β) and prefix converge point (γ). The proposed method used come with some guarantees of optimality under certain limited conditions. The experimental results show that our improved method can achieve the highest accuracy of 99.75%. We conclude that the written language identification can be solved using letter representation only. This can directly solve the issue out of the vocabulary, in which implemented a library based on the word representation.

[8] D. Tran and D. Sharma, “Markov models for written language identification,” in Proceedings of the 12th International Conference on Neural Information Processing, Taiwan, 2005, pp. 67–70. [9] C. Biemann and S. Teresniak, “Disentangling from babylonian confusionunsupervised language identification,” in Proceedings of Computational Linguistics and Intelligent Text Processing, vol. 3406. Springer, 2005, pp. 762–773. [10] J. Vinosh Babu and S. Baskaran, “Automatic language identification using multivariate analysis,” in Computational Linguistics and Intelligent Text Processing. Springer Berlin / Heidelberg, 2005, pp. 789–792. [11] F. Sebastiani, “Machine learning in automated text categorization,” ACM Computing Surveys (CSUR), vol. 34, no. 1, pp. 1–47, 2002. [12] A. Xafopoulos, C. Kotropoulos, G. Almpanidis, and I. Pitas, “Language identification in web documents using discrete hmms,” Pattern Recognition, vol. 37, no. 3, pp. 583–594, 2004.

Acknowledgment This work is supported by the Ministry of Science, Technology & Innovation (MOSTI), Malaysia and Research Management Center, Universiti Teknologi Malaysia (UTM), under the Vot 79200.

[13] P. Constable and G. Simons. (2000-2001) Language identification and it: Addressing problems of linguistic diversity on a global scale. SIL Electronic Working Papers. [Online]. Available: http://www.sil.org/silewp/2000/001/

References

[14] P. Sibun and J. C. Reynar, “Language identification: Examining the issues,” in Proceedings of the Symposium on Document Analysis and Information Retrieval, 1996, pp. 125– 135.

[1] A. Selamat and C.-C. Ng, “Arabic script documents language identifications using fuzzy art,” in Proceedings of the Second Asia International Conference on Modeling & Simulation. IEEE Computer Society Washington, DC, USA, 2008, pp. 528–533.

[15] A. Selamat and C.-C. Ng, “Arabic script language identification using letter frequency neural networks,” International Journal of Web Information Systems, vol. 4, no. 4, pp. 484– 500, 2008.

[2] A. Selamat, C.-C. Ng, and Y. Mikami, “Arabic script web documents language identification using decision tree-artmap model,” in Proceedings of the International Conference on Convergence Information Technology, 2007, pp. 721–726.

[16] D.-A. Chiang, H.-C. Keh, H.-H. Huang, and D. Chyr, “The chinese text categorization system with association rule and category priority,” Expert Syst. Appl., vol. 35, no. 1-2, pp. 102–110, 2008.

[3] G. Chowdhury, “Natural language processing,” Annual Review of Information Science and Technology, vol. 37, no. 1, pp. 51–89, 2003.

[17] X. Roche, “Httrack website copier, offline browser,” 2008, accessed on June 2008. [Online]. Available: http://www.httrack.com

[4] P. McNAMEE, “Character n-gram tokenization for european language text retrieval,” Information Retrieval, vol. 7, pp. 73– 97, 2004.

[18] M. Thompson, “British broadcasting corporation (bbc),” 2008, accessed on June 2008. [Online]. Available: http://www.bbc.co.uk/

[5] S. Sagiroglu, U. Yavanoglu, and E. N. Guven, “Web based machine learning for language identification and translation,” in Proceedings of the Sixth International Conference on Machine Learning and Applications, 2007, pp. 280–285. [6] B. Martins and M. J. Silva, “Language identification in web pages,” in Proceedings of the 2005 ACM symposium on Applied computing, 2005, pp. 764–768. [7] N. Ljubesic, N. Mikelic, and D. Boras, “Language indentification: How to distinguish similar languages?” in Proceedings of the 29th International Conference on Information Technology Interfaces, Cavtat/Dubrovnik, Croatia, 2007, pp. 541– 546.

154

feature selection and time regression software: application on ...