A Statistical Method for Adding Case Ending Diacritics for Arabic Text Khaled Shaalan
Hitham M. Abo Bakr
Ibrahim Ziedan
The Institute of Informatics The British University in Dubai
Computer & System Dept. Zagazig University
Computer & System Dept. Zagazig University
[email protected]
[email protected]
[email protected]
reader is expected to infer or predict vowels from the context of the sentence. Written Arabic can be fully diacritized (this is the case with Qur'an and some heritage literature books), partially diacritized (this is the case when we want to disambiguate certain words like ( ﻋ ّﻤﺎنAmman – Captial of Jourdan) and ( ﻋُﻤﺎنOman - Country), or it can be entirely undiacritized (unvoweled). There are three types of diacritics: vowel (Fat-ha " " َـ, Dama " ُـ ", Kasra " )" ِـ, nunation (Fathatan " ًـ " , Damatan " ٌ" ـ, kasratan " ) "ٍـ, and shadda ( “ّ[) ”ـ2]. Case ending diacritics play an important rule for understanding the meaning of Arabic statement Case– ending gives the correct understanding of the statement.
Abstract In this paper, the issue of adding Case Ending diacritics to undiacritized Arabic text using statistical methods is addressed. The approach requires a large corpus of fully diacritized text for extracting the case ending. We made the training for detecting the case ending diacritics for each token based on its Part Of Speech (POS) and BPchunk position as well as the position of token in the statement. The case ending diacritics is then efficiently obtained using the SVM technique. We presented an evaluation of the proposed diacritization algorithm and discussed various modifications for improving the performance of this approach. Keywords: Arabic NLP, MSA, CaseEnding diacritization, Statistical approach, SVM, YamCha
There are many related work dealing with the problem of Arabic diacritization in general [2-5]; all trying to handle this problem using statistical approaches but they handle the case ending (last letter) diacritics with the same technique used to handle the internal (any letter but last) diacritics. In the literature, the detection of case-ending diacritics is treated as a syntactic problem whereas detecting the internal diacritics is treated as a morphological problem. In this paper, we claim that this requires distinction in handling the case ending diacritization from the handling of the
1. Introduction Arabic script consists of two classes of symbols: letters1 and diacritics. Letters are always written whereas diacritics are optional. Diacritics play a key role in disambiguating Arabic text. Most written MSA is not diacritized. The 1
Arabic writing system consists of 36 letter forms which represent the Arabic consonants. These are: ذ, ر, ز, س, ش ا, ﺁ, أ, إ, ئ, ؤ, ء, ى, ة, ب, ت, ث, ج, ح, خ, د, , م, ن, ﻩ, و ص, ض, ط, ظ, ع, غ, ف, ق, ك, ل, and ي.
1
The paper is organized as follows. The next section gives an overview of the proposed Arabic diacritizer. This is followed by the experiment conducted to evaluate our diacritizer. Finally, we conclude the paper and give directions for the future research.
internal diacritization. Nevertheless, this is main the reason why the performance of the current Arabic diacritizers is decreased when they included diacritics of case ending in their evaluation [2, 5]. The problem of automatic restoration of the diacritic signs of Arabic text can be solved by two approaches. The first approach is a rule – based approach that involves a complex integration of the Arabic morphological, syntactic, and semantic tools with significant efforts to acquire respective linguistic rules. A morphological analyzer gets the breakdown of the undiacritized word according to known patterns or templates and recognizes its prefixes and suffixes. A syntax analyzer applies specific syntactic rules to determine the case-ending diacritics, usually, using applying finite-state automata technique. Semantics handling helps to resolve ambiguous cases and to filter out hypothesis. As shown, the rulebased diacritization is a complicated process and takes longer time to process an Arabic sentence which is usually long. The second approach is the statistical approach, where a large tagged corpus (in particular a TreeBank) is used to extract language statistics for estimating the missing diacritical marks. The approach is practical and fully automated. Results are usually improved by increasing the size of the corpus. In this paper we will demonstrate a statistical method for detecting the case ending diacritics. The system is first trained using a Penn Arabic Treebank. The proposed method is efficient and can be processed in parallel with the detection of the internal diacritics which has achieved acceptable results.
2. The proposed Arabic Diacritizer 2.1 Overview of the Treebank Treebanks are language resources that provide annotations of natural languages at various levels of structure: word level, phrase level, and sentence level. Treebanks have become crucially important for the development of data-driven approaches to natural language processing. The Arabic Treebank was created on top of a corpus that has already been annotated with POS tags. The Penn Arabic Treebank (ATB) began in the fall of 2001 [1] and has now completed four full releases of morphologically and syntactically annotated data: Version 1 of the ATB has three parts with different releases, some versions like Part 1 V3.0 and Part 2 V 2.0 are fully diacritized trees. For example, the following undiacritized statement: "ﻟﻠﻴﻮم اﻟﺜﺎﻧﻲ ﻋﻠﻰ اﻟﺘﻮاﻟﻲ ﺗﻈﺎهﺮ ﻃﻼب ﻳﻨﺘﻤﻮن اﻟﻰ "....ﺟﻤﺎﻋﺔ "llywm AlvAny ElY AltwAly tZAhr TlAb Yntmwn
2
(PREP
Abbreviation Meaning No Case Ending B-NCE Kasra ِـ CASE_DEF GEN CASE_INDEF GEN kasratan ٍـ Dama ُـ CASE_DEF NOM CASE_DEF ACC Fat-ha َـ CASE_DEF ACCGEN Maftouh ba Kasra ِـ CASE_INDEF NOM Damatan ٌـ CASE_INDEF Fathatan ًـ ACCGEN Table 1 Meaning of Abbreviations used in the Treebank
The above representation is partially extracted from the tree file UMAAH_UM.ARB_20020120a.0007.tree that are provided by the ATB Part 2 V.2. Figure 1 shows a graphical representation of this tree
Figure 1. A graphical represnetioant of an Arabic sentence extracted from the Penn Arabic Treebank
Figure 2. Highlighting the case-ending diacritics in the annotated parse tree of the Arabic Treebank2
Figure 2 uses circles to highlight the case ending in the graphical representation of the Treebank. Case-ending is indicated by one of the following tags: CASE_DEF_GEN, CASE_INDEF_GEN, CASE_DEF _NOM, CASE_DEF_ACC, CASE_ DEF_ACCGEN, CASE_INDEF_ NOM, and CASE_INDEF_ACCGEN Table 1 gives the complete meaning of these abbreviations.
2.2 Different Techniques for Diacrtizing an Unvowled Arabic Word It is important to mention that in our current research, we distinguish between internal and case ending diacritization i.e. ( )ﻋﻼﻣﺎت اﻻﻋﺮابsince the former requires morphological analysis while the later depends on the 2
Figure 1 and figure 2 are the graphical representation for the Treebank files appeared using our Treebank Viewer tool, see http://www.staff.zu.edu.eg/hmabobakr/page.as p?id=53
3
syntactic analysis. We have successfully solved the Arabic internal diacritization problem using three different techniques, each of which has its own strengths and weaknesses. We combined them to optimize the performance of our diacritizer and to a large extent remove ambiguities. These techniques are: 1) Lexicon lookup, 2) statistical-based diacritizer, and 3) bigram diacritizer (results for internal diacritizer will be mentioned in another publication). Case ending diacritization is treated as a post process of the internal diacritization process. We have built a new statistical approach for detecting case ending diacritic signs. The final result is a fully diacritized Arabic word. The idea is to relate the caseending for each token with its POS and chunk position as well as its position in the sentence. We made the training using Support Vector Machines (SVM) technique with undiacritized tokens because the expected input text to the system during evaluation and testing is undiacritized. We extract a sequence of tokens with its POS, BP chunck and Case-Ending from Treebank using YamCha File Creator (YFC utility)3. The basic approach used in YFC is inspired by the work of Sabine [6] for treebank-to-chunck conversion script, which we have extended in order to be used with Arabic. This required adding some features like Case-Ending. The output result from YFC utility for case ending training process is shown in Table 2.
Token
POS
Chunc k
Case Ending
l
IN
B-PP
B-NCE
Al ywm
DT NN
B-NP I-NP
Al vAny
DT JJ
I-NP I-NP
B-NCE BCASE_DEF_GEN B-NCE B-NCE
ElY Al
IN DT
B-PP B-NP
B-NCE B-NCE
twAly tZAhr
NN VBD
I-NP B-VP
B-NCE B-NCE
TlAb
NN
B-NP
Yntmwn
VBP
B-VP
BCASE_INDEF_N OM B-NCE
IN NN
B-PP B-NP
B-NCE BCASE_DEF_GEN
Table 2. Training file format for detecting case ending
3- Evaluation We used the YFC utility for creating training and testing sets. This ensured that the definition of what constitutes a chunk is the same in both components of the training and testing. We used the YamCha [8-10], a well-known and reliable SVM tool. The ATB Part 2 V 2.0 corpus includes 501 stories from the Ummah Arabic News Text. There are a total of 144,199 words (counting non-Arabic tokens such as numbers and punctuation). We used the vocalized version of the Treebank for the experiments. We used the same evaluation criteria described in [11]. We used 90% of the corpus for training and 10% for testing. All the data is derived from the parsed trees in the Treebank. We use a standard SVM with a polynomial kernel, of degree 2 and C=1.0 (cf. [11]). Evaluation of the system done by calculating Accuracy
3
YFC utility is command line utility. We develop it using C++ to extract information from Penn Arabic Treebank ATB and create a Yamcha format to be used in the training process. http://www.staff.zu.edu.eg/hmabobakr/page.as p?id=53
4
1. Long sentences with distant grammatical constituents, e.g.: ف ِ ﻚ ِﻣﺼْ َﺮ ﺛﺎﻧِﻲ َأآْ َﺒ ِﺮ اﻟﻤَﺼﺎ ِر ُ َْﻗ ﱠﺮ َر َﺑﻨ ف ِﻣﺼْ َﺮ ُ اﻟﻌﺎ ﱠﻣ ِﺔ اﻟ ِﻤﺼْ ِﺮ ﱠﻳ ِﺔ َو َﻣﺼْ ِﺮ ﺲ إِدا َر ِﺗﻬِﻢ َ س َﻣﺠِْﻠ ُ ن َﻳﺮَْأ ِ ﻲ اﻟﻠﱠﺬا اﻟ ُﺪ َوِﻟ ﱠ ﺲ ُ ﺣﻠْﻤِﻲ َرﺋِﻴ ِ اﻟ ُﺪآْﺘُﻮ ُر ﺑَﻬﺎء اﻟﺪِﻳﻦ ﺧ َﻞ اﻟ َﺘ َﺪ ﱡ, ف اﻟ ِﻤﺼْ ِﺮ ﱠﻳ ِﺔ ِ ِإﺗﱢﺤﺎ ِد اﻟﻤَﺼﺎ ِر ف ِﻣﺼْ َﺮ ِإآْﺴ ِﺘﺮْﻳُﻮر ِ ِﻟﻤُﺴﺎ َﻧ َﺪ ِة َﻣﺼْ ِﺮ.” In the above statement, the subject “ ”اﻟﺘﺪﺧﻞis far away from its verb and object which is difficult to detect with our approach. 2. The free word order nature of Arabic sentence especially when the object precedes the subject (i.e. the order VOS vs. the common order VSO). e.g. "... "ﺳﺠﻞ اﻟﻬﺪف اﻟﻼﻋﺐThe system failed to detect which word is the object and has CASE DEF NOM and which word is the subject and has CASE_DEF_ACC. This structural ambiguity can only be resolved at the semantic level which is not included in the ATB. 3. The presence of an elliptic personal pronoun "alDamiir almustatir (e.g. “ وﻟﻌﺐ ...)”آﺮةﻋﺮﺿﻴﺔ. This is a difficult problem to resolve under the statistical approach as it needs contextual analysis. Increasing window size in the training phase and using extra training data can help to enhance the performance of the system
(Acc), Precision (Prec), Recall (Rec), and the F-measure. The overall results are listed in Table 3 Measurement Accuracy Precision Recall F-Measure
Overall Results 0.953485 0.809145 0.837309 0.822986
Table 3. Final results of Case Ending evaluation
Some details results are listed in Table 4. Class Precision Recall 0.987359 0.987177 B-NCE 0.963374 0.932665 DEF GEN 0.931354 0.919414 INDEF GEN 0.704425 0.780392 DEF NOM 0.757202 0.767733 DEF ACC 0.942982 0.903361 DEF ACCGEN 0.692810 0.736111 INDEF NOM 0.896552 0.812500 INDEF ACCGEN 0.809145 0.837309 Total Table 4 Case Ending deails results for presision and Recall
We observe from Table 4 the following results: • We got good results for No Case Ending (B-NCE) detection and (DEF_GEN) because we used a window size for training +2 and -2. For each token which is sufficient for detecting the DEF_GEN cases like ( )اﻟﺠﺎر واﻟﻤﺠﺮورor ( اﻟﻤﻀﺎف )واﻟﻤﻀﺎف إﻟﻴﺔ • Results are decreased in case of DEF_NOM and DEF_ACC which is inevitable even with the rule-based parsing approach [16]. The following are the reasons for this lower accuracy:
4- Conclusions and Future work In this paper, we proposed a statistical approach for diacritizing case-ending of an Arabic word using SVM machine
5
2000 and LLL-2000,Page 127-132, Lisbon,Portugal,2000 [7] Taku Kudo and Yuji Matsumoto, " Fast methods for kernel-based text analysis," In Proceedings of the 41st Annual Meeting on Association For Computational Linguistics - Volume 1 (Sapporo, Japan, July 07 - 12, 2003). Annual Meeting of the ACL. Association for Computational Linguistics, Morristown, 2003. [8] Taku Kudoh and Yuji Matsumoto, "Use of Support Vector Learning for Chunck Identification," In Proceedings of the 4th Conference on CoNLL-2000 and LLL2000, pages 142--144, 2000. [9] Nello Cristianini and John Shawe-Taylor, “An Introduction to Support Vector Machines and Other Kernel-based Learning Methods”, The Press Syndicate of the University of Cambridge, Cambridge, United Kingdom, 2000. [10] Marti A. Hearst, "Support Vector Machines," IEEE Intelligent Systems, vol. 13, no. 4, pp. 18-28, Jul/Aug, 1998. [11] Mona Diab, Kadri Hacioglu, and Daniel Jurafsky, "Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks," In Proc. of HLT/NAACL 2004, Boston,2004. [12] Beatrice Santorini, "Part-of-Speech Tagging Guidelines for the Penn Treebank Project," ftp://ftp.cis.upenn.edu/pub/treebank/doc/ta gguide.ps.gz, 1990. [13] Mohamed Maamouri, Seth Kulick, Ann Bies,"Diacritic Annotation in the Arabic Treebank and Its Impact on Parser Evaluation";LREC 2008, Marrakech, Morocco, May 28-30, 2008 [14] Mohamed Maamouri, Ann Bies, Seth Kulick, "Enhanced Annotation and Parsing of the Arabic Treebank"; INFOS 2008, Cairo, Egypt, March 27-29, 2008 [15] Mohamed Maamouri, Ann Bies, Seth Kulick, "Enhancing the Arabic Treebank: A Collaborative Effort toward New Annotation Guidelines";LREC 2008, Marrakech, Morocco, May 28-30, 2008 [16] Othman E, Shaalan K and Rafea A.,2004 “Towards Resolving Ambiguity in Understanding Arabic Sentense”. In NEMLAR Conference on Arabic Language Resources and Tools, Cairo, Egypt.
learning technique. SVM gives best results for many of NLP tasks, such as POS tagging, base NP chunking, etc. This approach is practical and fully automated. The results are promising and can be useful in many applications that need real time diacritization. The method is appealing as compared to hand-crafted rule based approaches. The results of evaluating our system performance showed that the technique is highly accurate with 95.3% accuracy and 82% F-measure. Our future work is to increase the training set by using latest fully diacritized Treebank like Part1 V3.0 [13-15] which is not available due to limitation of our budget. Increasing window size during the training is expected to enhance results. References [1] Mohamed Maamouri, Ann Bies, and Tim Buckwalter. 2004. “The Penn Arabic Treebank: Building a large-scale annotated arabic corpus”. In NEMLAR Conference on Arabic Language Resources and Tools, Cairo, Egypt. [2] Habash N. and Rambow O.,2007. “Arabic Diacritization through Full Morphological Tagging”, In Proceedings of the North American chapter of the Association for Computational Linguistics (NAACL), Rochester, New York, 2007. [3] Imed Zitouni, Jeffrey S. Sorensen, and Ruhi Sarikaya. 2006. “Maximum entropy based restoration of arabic diacritics”. In Proceedings of ACL’06. [4] Ananthakrishnan, S. Narayanan, and S. Bangalore, 2005. “Automatic diacritization of arabic transcripts for asr”. In Proceedings of ICON-05, Kanpur, India. [5] Elshafei M., Al-Muhtaseb H., and Alghamdi M., 2006. “Statistical Methods for Automatic Diacritization of Arabic Text”. The Saudi 18th National Computer Conference. Riyadh. 18: 301-306. [6] Sang E. & Buchholz S.,2000,” Introduction to the CoNLL-2000 Shared Task: Chuncking”, Proceeding of CoNLL-
6